# discrete_copula_diffusion__d006c0c4.pdf Published as a conference paper at ICLR 2025 DISCRETE COPULA DIFFUSION Anji Liu1,2, Oliver Broadrick1, Mathias Niepert2, Guy Van den Broeck1 1Department of Computer Science, University of California, Los Angeles, USA 2Institute for Artificial Intelligence, University of Stuttgart, Germany {liuanji,obroadrick,guyvdb}@cs.ucla.edu mathias.niepert@ki.uni-stuttgart.de Discrete diffusion models have recently shown significant progress in modeling complex data, such as natural languages and DNA sequences. However, unlike diffusion models for continuous data, which can generate high-quality samples in just a few denoising steps, modern discrete diffusion models still require hundreds or even thousands of denoising steps to perform well. In this paper, we identify a fundamental limitation that prevents discrete diffusion models from achieving strong performance with fewer steps they fail to capture dependencies between output variables at each denoising step. To address this issue, we provide a formal explanation and introduce a general approach to supplement the missing dependency information by incorporating another deep generative model, termed the copula model. Our method does not require fine-tuning either the diffusion model or the copula model, yet it enables high-quality sample generation with significantly fewer denoising steps. When we apply this approach to autoregressive copula models, the combined model outperforms both models individually in unconditional and conditional text generation. Specifically, the hybrid model achieves better (un)conditional text generation using 8 to 32 times fewer denoising steps than the diffusion model alone. In addition to presenting an effective discrete diffusion generation algorithm, this paper emphasizes the importance of modeling inter-variable dependencies in discrete diffusion.1 1 INTRODUCTION Discrete diffusion models have recently achieved significant progress in modeling complex data such as natural languages (Campbell et al., 2022; Sahoo et al., 2024), protein sequences (Gruver et al., 2023; Morehead et al., 2023), and graphs (Vignac et al., 2022; Huang et al., 2023). In particular, recent discrete diffusion models for text generation (Lou et al., 2024; Sahoo et al., 2024; Shi et al., 2024) have matched or even surpassed the performance of autoregressive models at the scale of GPT-2 (Radford et al., 2019). Additionally, discrete diffusion models offer improved inference-time controllability using guidance from auxiliary models such as classifiers (Dhariwal & Nichol, 2021), making them suitable for controlled generation tasks (Li et al., 2022; Han et al., 2023). Despite these promising results, discrete diffusion models still require hundreds to thousands of denoising steps to produce high-quality samples (Austin et al., 2021; Sahoo et al., 2024), significantly affecting their efficiency. In this paper, we identify a fundamental limitation in most discrete diffusion models that hinders their ability to generate high-quality samples in just a few steps. We illustrate the problem in Figure 1. At each denoising step, a partially completed sample shown in the top-left is fed into a sequence-to-sequence denoising model, which predicts the univariate marginal distributions for each masked token independently. A new output sequence is then sampled based on these univariate marginals before proceeding to the next denoising step. The key issue with this process is that when multiple edits (i.e., replacing masked tokens with data tokens) are made simultaneously, the model does not account for the joint probability of these changes occurring together. As a result, the generated samples often lack coherence, as shown in the bottom- 1Code is available at https://github.com/liuanji/Copula-Diffusion. Published as a conference paper at ICLR 2025 The dog the neighbors. The dog the neighbors. Discrete Diffusion Model scared frightened calmed 0.18 0.15 0.07 puppy barking black 0.11 0.07 0.03 Copula Model The puppy dog scared the neighbors. The puppy dog calmed the neighbors. The barking dog scared the neighbors. The puppy dog frightened the neighbors. Note: this is computed implicitly in practice. Sample independently Sample jointly 0.18 0.15 0.07 0.09 0.12 0.79 0.47 0.49 0.04 0.44 0.39 0.17 0.15 0.08 0.10 0.22 0.20 0.04 0.23 0.08 0.07 Univariate marginals Figure 1: Discrete Copula Diffusion (DCD). At each denoising step, a partially completed sequence is given as input (top-left). The diffusion model independently predicts the univariate marginals for each masked token, which leads to the samples in the bottom-left. DCD introduces an additional copula model (top-right) to capture the inter-variable dependencies, thereby supplementing the information missed by the diffusion model. By combining outputs from both models in a principled way, DCD achieves better performance than either model individually (see improved samples in the bottom-right), enabling few-step discrete diffusion generation. left of Figure 1. This problem is exacerbated in few-step generation, where many tokens must be edited simultaneously. We formally demonstrate that if the diffusion model predicts each variable independently, an irreducible term (in addition to the data entropy) remains in the negative evidence lower bound (ELBO), preventing the model from perfectly capturing the data distribution. We propose using a generative model, which we refer to as the copula model, to compensate for the missing dependency information between output variables at each denoising step. Our method operates only at inference time and can be adapted to any discrete diffusion model and a wide range of copula models. As illustrated on the right side of Figure 1, the input sequence is also fed into a copula model that (implicitly) produces information on inter-variable dependencies. This information is combined with the univariate marginals predicted by the diffusion model to produce a more accurate distribution, resulting in higher-quality samples, shown in the bottom-right corner. We formally show that the univariate marginals from the diffusion model and the dependencies captured by the copula model can be combined in a principled way, leading to a better approximation of the true denoising distribution under mild assumptions. Further, finding this combined distribution reduces to solving a convex optimization problem that can be efficiently approximated in practice. By instantiating the copula model as an autoregressive deep generative model such as GPT (Radford et al., 2019), we propose an algorithm that combines any pretrained discrete diffusion model with an autoregressive model to form a hybrid model called Discrete Copula Diffusion (DCD). This model is capable of producing high-quality (un)conditional samples with only a few denoising steps. Empirical results on text and antibody generation show that DCD significantly outperforms both of its base models. Moreover, DCD achieves comparable or better performance using 8 to 32 times fewer denoising steps compared to the base discrete diffusion model. In addition to proposing a discrete diffusion model capable of few-step generation, we emphasize the importance of modeling inter-variable dependencies in discrete diffusion models. 2 PRELIMINARIES We aim to model the joint distribution of variables X0, a set of categorical variables with C categories. Discrete diffusion models (Austin et al., 2021) learn to sample from p(X0) by modeling the reversal of the following noising process involving X0 and a set of auxiliary variables {Xt}T t=1: t {1, . . . , T} q(xt|xt 1) := Cat(xt; Qt xt 1), (1) where Cat(x; p) refers to the Categorical distribution over x with class probabilities p, and Qt is a C C transition matrix that is applied independently to every variable xi t 1 (denote xi t 1 as the ith variable of xt 1) to get the corresponding categorical distribution of xi t. Specifically, each variable xi t 1 is treated as a one-hot vector of size C 1, which is then multiplied by Qt to compute the class probabilities of xi t. The noising process is designed such that p(x T ) follows a simple distribution regardless of the data distribution. Published as a conference paper at ICLR 2025 Instead of using a fixed number of predefined time steps, we can treat t as a continuous variable within the range [0, T]. The noising process is now defined by the rate of change of p(xt) w.r.t. t: dp(xt) dt =Q p(xt), where Q RC C is a transition rate matrix. For any 0 s0 and pest(x)>0. Then ˆp exists and has the form ˆp(x) = pest(x) Y where σi is a positive function that depends on xi. Assume X consists of N categorical variables, each with C categories, we can represent the factors {σi}i using a matrix V RN C. Under this representation, the combined distribution is ˆp(x) = pest(x) Y i exp(V[i, xi]), (3) where V[i, j] denotes the element at the ith row and jth column of V and V[i, xi] = log σi(xi). Determining the true matrix V corresponding to ˆp, which is the I-projection of pest onto Pptar mar, can be reformulated as solving the following convex optimization problem. Published as a conference paper at ICLR 2025 General prompt: Let s do outdoor sports! How about < > < >? Y X Probability table: Marginals + copula: Y skiing diving p01 p00=0.4 =0.02 scuba Marginals diving Y =0.42 p0 Given additional suffix: in Switzerland copula(X, Y ) p 0 p1 =0.58 =0.52 equivalent Use I-projection to adjust while keeping the copula unchanged Sample from the new joint How about alpine skiing? strong dependency Most probable phrases: - alpine skiing - scuba diving diving Y p 1 Figure 2: Illustration of the decomposition of a distribution into univariate marginals and a copula. Theorem 1. If V minimizes the following convex objective function, then the corresponding ˆp defined by Equation (3) is the I-projection of pest onto Pptar mar.2 L(V; ptar, pest) := X x pest(x) Y i exp(V[i, xi]) xi=1 V[i, xi] ptar(xi). (4) We proceed to explain why I-projecting pest leads to a better estimate of ptar as suggested by Proposition 2. In general, a joint distribution can be viewed as combining two independent pieces of information: (i) a set of univariate marginal distributions and (ii) a copula describing the association or dependence among the variables. By the classical work of Sklar (1959), for continuous variables the copula can take the form of a joint distribution with uniform margins and can be combined quite simply with univariate marginal distributions to recover the full joint distribution, a fact heavily exploited in statistics (Nelsen, 2006). While the discrete case is somewhat less straightforward, recent work of Geenens (2020) has developed the fundamental notions of discrete copula modeling as well, where the information of a copula can be parameterized by odds ratios. Figure 2 shows an example consisting of two binary variables X and Y . The probability table on the left can be equivalently expressed using univariate marginals (i.e., p0 , p1 , p 0, p 1) and the odds ratio (i.e., copula) ω:= p00p11 p01p10 as shown in the middle of Figure 2. Intuitively, ω=125 indicates that the phrases alpine skiing and scuba diving are more likely than others (e.g., alpine diving ), and the marginals decide which of the two phrases appears more frequently. The idea of representing the copula with odds ratios generalizes to the multivariate case and is presented in Appendix C. The following result demonstrates that, under its functional form in Equation (3), I-projecting pest onto Pptar mar only improves the univariate marginals and leaves the copula unchanged regardless of V. Proposition 4. For a positive distribution p and any V RN C, the distribution q(x) p(x) Q i exp(V[i, xi]) has the same copula as p. In general, Proposition 4 holds because scaling factors (e.g., exp(V[i, xi])) cancel in odds ratios. For example, in the 2 2 case in Figure 2, scaling the top row of the probability table by a would result in the odds ratio ω= ap00p11 ap01p10 = p00p11 4.2 MODELING DEPENDENCE IN DISCRETE DIFFUSION MODELS Recall from Section 3 that our goal is to capture inter-variable dependencies between the output variables at each denoising step (e.g., sampling xt from q(Xt|xt+1)). Similar to the general case shown in Section 4.1, we first have a set of univariate marginals {pdm(Xi t|xt+1)}i from the diffusion model. Notably, these univariate marginals are fairly accurate since for both discrete-time and continuous-time diffusion models, if their respective training losses are minimized, the model recovers the true univariate marginals. This is formally justified in Appendix D. Alongside the univariate marginals, we assume access to a copula model that encodes a distribution over Xt. Following Section 4.1, combining the copula model s distribution with the univariate marginals from the diffusion model will lead to an improved estimate of q(Xt|xt+1) (Prop. 2). The performance of the augmented diffusion model hinges on two key questions: (i) how well can the copula model capture the inter-variable dependencies in q(Xt|xt+1) (defined by the data distribution and the noising process); (ii) given a good copula distribution, how to effectively combine it with the univariate marginals obtained from the diffusion model, i.e., how to solve Equation (4). 2Equation (4) closely resembles the matrix scaling problem (Idel, 2016). See Appendix B for details. Published as a conference paper at ICLR 2025 5 AUTOREGRESSIVE MODELS AS COPULA MODELS This section answers the two questions above tailored to the case where the copula model is an autoregressive model such as GPT (Radford et al., 2019) and State Space Models (Dao & Gu, 2024). Specifically, Section 5.1 discusses how to approximate q(Xt|xt+1) using an autoregressive model trained on the clean data distribution p(X0) under certain noising processes. Section 5.2 explores the process of performing I-projection from the (autoregressive) copula distribution onto the set of distributions with univariate marginals {pdm(Xi t|xt+1)}i. Finally, Section 5.3 summarizes the sampling procedure with a discrete diffusion model and an autoregressive copula model. 5.1 EXTRACTING COPULA DISTRIBUTIONS FROM AUTOREGRESSIVE MODELS At step t, to sample xt conditioned on xt+1, we need a copula distribution pcopula(Xt) that closely approximates q(Xt|xt+1). While this might suggest that the copula model should also be trained with a diffusion model objective, which brings us back to the problem of modeling inter-variable dependencies, we show that any model trained on the clean data distribution can serve as a copula model that indirectly approximates q(Xt|xt+1) under the absorbing mask forward noising process. The absorbing mask noising process gradually converts data tokens in x0 p(X0) to a new category denoted through the sequence x1, . . . , x T . Specifically, each token in x0 is independently converted to with probabilities 0<α1 <. . .<αT =1 in x1, . . . , x T , respectively. This is a widely used noising strategy for discrete diffusion models. Since this process only transforms data tokens into the mask token, it preserves the dependencies between the remaining unmasked tokens. Therefore, we can decompose q(Xt|xt+1) as q(xt|xt+1)=P xt q( xt|xt+1)q(xt| xt, xt+1), where q( xt|xt+1) is inuitively capturing the joint distribution of generating all currently masked tokens, and q(xt| xt, xt+1) captures only the choice of which currently masked tokens will actually be generated. Formally, define I as the set of variables i such that xi t+1 = and J as its complement. The auxiliary distributions have the following form. Proposition 5. Assume p(X0) is the clean data distribution and {q(Xt|xt 1)}T t=1 follows the absorbing mask noising process. Let αt be the probability of conversion to the mask state from Xi 0 to Xi t ( i). Define Xt as a set of auxiliary variables such that q( xt|xt+1) = p(XI 0 = x I t |XJ 0 = x J t+1) 1[ x J t = x J t+1]. (5) Then, the distribution q(Xt| xt, xt+1) is the following: q(Xt| xt, xt+1)=Q i q(xi t| xi t, xi t+1). For i I, q(xi t| xi t, xi t+1) equals αt/αt+1 if xi t = and equals 1 αt/αt+1 if xi t = xi t. For i J, q(xi t| xi t, xi t+1)=1 if and only if xi t =xi t+1. Since q(Xt| xt, xt+1) is fully factorized, the copula model only needs to account for inter-variable dependencies in q( Xt|xt+1). Following Equation (5), we can transform pcopula(X0), which estimates the clean data distribution, into pcopula( Xt|xt+1) that approximates q( Xt|xt+1) by conditioning it on the unmasked tokens in xt+1 (i.e., x J t+1). Specifically, for autoregressive copula models (i.e., pcopula(x) := Q i pcopula(xi|x, q( Xi t = c|xt+1) q(Xi t = c|xt+1). As a result, for each i, the distribution pdm( Xi t|xt+1) can be similarly obtained by renormalizing pdm(Xi t|xt+1), which is directly obtained from the denoising model, to exclude the mask state. 5.2 APPROXIMATE I-PROJECTION WITH AUTOREGRESSIVE MODELS Given univariate marginals {pdm( Xi t|xt+1)}i and an autoregressive copula distribution pcopula( Xt|xt+1), both of which estimate the target distribution q( Xt|xt+1), our goal is to combine them following the I-projection procedure described in Section 4.1. Specifically, this involves solving the convex optimization problem in Equation (4), which is specialized to the following: xt pcopula( xt|xt+1) Y i exp(V[i, xi t]) xt=1 V[i, xi t] pdm( xi t|xt+1). (7) Following Theorem 1, if V minimizes Equation (7), then the distribution defined by ˆp( xt|xt+1) = pcopula( xt|xt+1) Q i exp(V[i, xi t]) is the I-projection of pcopula( xt|xt+1) onto the set of distributions with the univariate marginals {pdm( Xi t|xt+1)}i, which is the desired combined distribution. Consider initializing all coefficients in V to zero, i.e., ˆp( xt|xt+1)=pcopula( xt|xt+1). For each row i, if we only optimize the values V[i, :] and fix the rest to zero, the optimal coefficients are c, V[i, c] = log pdm( Xi t = c|xt+1) log pcopula( Xi t = c|xt+1). (8) We approximate the solution to Equation (7) by applying the above update (Eq. (8)) to each row in V independently, as it strikes a proper balance between efficiency and empirical performance. While the first term on the right-hand side of Equation (8) can be acquired from the diffusion model, the second term is not accessible through the copula model. Plug in the definition in Equation (6), the required marginal probabilities can be written as (for j J, pcopula( xj t|xt+1)=1 iff xj t =xj t+1) i I, pcopula( xi t|xt+1) = pcopula(Xi = xi t|XKi =x Ki t+1), where Ki ={j : j J and j , and does not contribute to the distribution of Xi t according to Proposition 5). Correspondingly, we update V following i, c, V[i, c] = log pdm( Xi t = c|xt+1) log pdm( Xi t = c|x. Since ˆp is sampled autoregressively, this allows us to use techniques such as KVcaching for autoregressive Transformers (Pope et al., 2023) to significantly reduce computation cost introduced by the copula model. See Appendix E for details and the concrete algorithm. 6 EXPERIMENTS We empirically validate the proposed method, Discrete Copula Diffusion (DCD), on language modeling tasks (Sec. 6.1 and 6.2) and antibody sequence infilling tasks (Sec. 6.3). For all tasks, we evaluate whether DCD can effectively reduce the number of diffusion steps while maintaining strong performance. Specifically, since DCD combines two pretrained models: a discrete diffusion model and an autoregressive copula model, we examine whether DCD outperforms each individual model. 6.1 UNCONDITIONAL TEXT GENERATION We first compare the quality of unconditional samples generated by models trained on either Web Text (Radford et al., 2019) or Open Web Text (Gokaslan & Cohen, 2019), which contain web content extracted from URLs shared on Reddit with a minimum number of upvotes. We adopt the mediumsized SEDD model (Lou et al., 2024) (SEDDM) since it is a So TA discrete diffusion model for text generation. The GPT-2-small model (Radford et al., 2019) (GPT-2S) serves as the copula model. We generate samples of 128 tokens each. Following Han et al. (2023); Dieleman et al. (2022), we evaluate sample quality using their generative perplexity, which is the perplexity of the samples when evaluated with the GPT-2-large model. Since previous studies have observed that this metric can be affected by distribution annealing methods such as nucleus sampling, we always sample directly from the models. SEDDM is evaluated with 2 to 256 diffusion steps and DCD (i.e., SEDDM with GPT-2S as the copula model) is run with diffusion steps ranging from 2 to 32. We adopt the log-linear noise schedule suggested by the SEDD paper. See Appendix G.1 for more details. For each configuration, we draw 10,000 samples and report the average perplexity in Figure 3. First, when fixing the number of denoising steps between 2 to 32, we observe that DCD outperforms both SEDDM with the same number of denoising steps and GPT-2S. This provides empirical validation of the effectiveness of the I-projection procedure for modeling inter-variable dependencies. Additionally, DCD with just 4 denoising steps achieves performance comparable to SEDDM with 128 steps, representing a 32x reduction in the number of denoising steps. This result not only demonstrates the efficiency of DCD but also underscores the importance of modeling inter-variable dependencies in discrete diffusion models, particularly in few-step generation settings. Finally, as shown in Figure 4, SEDD fails to generate fluent and meaningful sentences given only a few diffusion steps, as too many tokens have to be generated in each step. In contrast, by modeling the inter-variable dependencies, DCD generates smooth sentences with only 4 denoising steps. Published as a conference paper at ICLR 2025 # denoising steps Generative Perplexity Figure 3: Generative perplexity ( ) with different numbers of denoising steps. He added the United States should continue double-in-channel media discussions , but stressed the importance of an agreement based on the purpose of the dialogue. Putin said Moscow had envisaged sending navy ships from Among the dozens of layoffs Detroit inflicted last week in September fell to layoffs of 243,000 workers, or just 7 percent of the city s 3.2 million population.. interesting is that the A+N start using enforcope thewhich Cookbook starts using in made ay antimidesis stuff (the grow and judges 7 And age goods Singh, who served as chief minister in charge and prime minister in charge of the UK, had asked Russian PM to attend the Iceland meeting of 2005. SEDD (4 steps) SEDD (256 steps) DCD (4 steps) DCD (16 steps) Figure 4: Generated text from SEDDM and DCD with different number of steps. See Appendix I for more. Table 1: Evaluation of text infilling performance using the MAUVE score ( ) with 5 prompt masks. Scores of DCD are all better than (i) SEDD with the same # denoising steps, and (ii) GPT-2S. Prompt ranges (remainder is masked) SSD-LM GPT-2S SEDDM DCD (ours) 100 500 N/A 2 4 8 16 32 2 4 8 16 32 [0.1,0.2] & [0.5,0.7] 0.057 0.083 0.079 0.013 0.051 0.122 0.152 0.201 0.158 0.187 0.185 0.195 0.211 [0.25,0.75] 0.072 0.108 0.188 0.027 0.110 0.226 0.237 0.278 0.249 0.251 0.257 0.314 0.298 [0.0,0.1] & [0.4,0.6] & [0.9,1.0] 0.333 0.681 0.928 0.827 0.940 0.972 0.980 0.979 0.962 0.976 0.979 0.982 0.983 [0.4,0.5] & [0.8,1.0] 0.436 0.565 0.914 0.896 0.944 0.978 0.978 0.980 0.963 0.975 0.975 0.976 0.981 [0.2,0.3] & [0.6,0.8] 0.041 0.054 0.069 0.016 0.056 0.128 0.207 0.215 0.171 0.178 0.215 0.217 0.403 Efficiency. We compare the sample time and the generative perplexity of DCD against competitive baselines in Figure 5. We additionally adopt another recent discrete diffusion baseline MDLM (Sahoo et al., 2024). We adopt the autoregressive version of DCD as described in Section 5.3 and Appendix E. Compared to the baselines, DCD consistently achieves better generative perplexity given a fixed runtime constraint. It also requires less time to reach a desired perplexity value. We defer a comprehensive study of DCD s efficiency to Appendix F. 6.2 CONDITIONAL TEXT GENERATION We now move on to conditional text generation, where certain tokens are provided in advance, and the task is to generate the remaining tokens. As shown in the first column of Table 1, we use five mask strategies, where tokens in specific prompt ranges are given (we use a sequence length of 128). We adopt the MAUVE score (Pillutla et al., 2021) with the default settings to compare the difference between the generated and original texts. See Appendix G.2 for further details. For all methods, we use the same set of 2,000 text sequences from the validation set of Wiki Text103 (Merity et al., 2022). After applying the prompt mask, we generate 5 samples for each prompt, resulting in a total number of 10,000 samples. In addition to SEDDM and GPT-2S, we compare against SSD-LM (Han et al., 2023), which is a semiautoregressive diffusion model designed for text infilling. We adopt the autoregressive unmasking variant of DCD described in the last paragraph of Section 5.3. Results are presented in Table 1. First, DCD outperforms all three baselines in all five tasks. Additionally, when fixing the number of denoising steps between 2 and 32, DCD surpasses both of its base models. Notably, while both GPT-2S and the 2-step SEDDM performs poorly on the first, the second, and the fifth tasks, combining them in a principled way allows DCD to achieve significantly better performance using only two denoising steps. 6.3 ANTIBODY SEQUENCE INFILLING We consider the task of unguided antibody infilling, where certain complementarity determining regions (CDRs) of antibodies (i.e., sequences of amino acids) are missing and to be generated by the model. We adopt NOC-D (Gruver et al., 2023), which is a discrete diffusion model trained on 104K Published as a conference paper at ICLR 2025 Runtime (s/sample) Generative Perplexity Figure 5: Sampling time vs. generative perplexity (the autoregressive version of DCD is used). Figure 6: Antibody sequence infilling performance measured by sequence recovery rate ( ). We compare DCD against its two base models in two tasks, where amino acids at different locations are masked. DCD outperforms both baselines with only 4 denoising steps. Method # steps Task HCDR{1+2+3} {H+L}CDR1 GPT N/A 57.21 90.28 NOS-D 64 51.56 88.82 DCD 4 58.28 91.58 antibody sequences from the Observed Antibody Space dataset (Ruffolo et al., 2023). We further train a GPT model on the same dataset as the copula model. See Appendix G.3 for training details. We follow Gruver et al. (2023) to select the same 10 antibody seed sequences from paired OAS (Olsen et al., 2022). We consider two infilling tasks: (i) three CDRs {HCDR1, HCDR2, HCDR3} are masked, and (ii) two CDRs {HCDR1, LCDR1} are masked. We follow the original paper and run 64 diffusion steps for NOS-D. For DCD (i.e., combining NOS-D with the trained GPT model as the copula model), we use 4 denoising steps. We measure the sequence recovery rate, i.e., the accuracy of the infilled sequences given the ground truth sequence. As shown in Figure 6, by combining the univariate marginals from NOS-D and the dependencies captured by the GPT model, DCD can also perform well in antibody sequence infilling tasks. 7 RELATED WORK AND CONCLUSION Diffusion models have been widely applied to model discrete data such as text and DNA sequences. Encouraged by the successes of continuous diffusion models (e.g., Ho et al. (2020); Song et al. (2020)), initial attempts convert discrete data into continuous embeddings with either predefined or learned mappings. This enables the use of continuous diffusion models for discrete data (Chen et al., 2022; Dieleman et al., 2022; Li et al., 2022; Lovelace et al., 2023). However, due to the need for ad-hoc mappings between the discrete data space and the continuous embedding space, which have to be pre-defined or pre-trained, continuous diffusion models are not as effective for modeling discrete distributions (Strudel et al., 2022; Li et al., 2022; Dieleman et al., 2022). Austin et al. (2021) proposed the first diffusion model designed directly for discrete data. Later works further improved discrete diffusion models from various aspects such as better loss functions/learning objectives (Campbell et al., 2022; Meng et al., 2022; Lou et al., 2024; Benton et al., 2022), better model architectures (Sun et al., 2022), better sampling algorithms (Chen et al., 2023), and unifying and scaling up existing techniques (Sahoo et al., 2024; Shi et al., 2024). Despite the recent breakthroughs of discrete diffusion models, few papers address the challenge of sampling in a few denoising steps. Some works attribute the failure to perform high-quality few-step generation to a scaling problem of the model. However, we show that the fundamental problem lies in the assumption made by discrete diffusion models that each variable is denoised independently at each step. In addition to identifying this problem, we propose a general solution Discrete Copula Diffusion that combines a discrete diffusion model with a copula model at inference time to obtain a better estimate of the denoising distribution at each step. Concurrently, Guo et al. (2024) show that energy-based models can also be used as copula models to capture inter-variable dependencies. There are a few limitations of DCD. First, in addition to a discrete diffusion model, it requires another copula model, which may require additional training for certain applications. Second, although the I-projected distribution is guaranteed as a better estimate of the target distribution, the I-projection step often needs to be approximated in practice. Finally, although DCD requires fewer denoising steps, the computation cost of each step is higher than in discrete diffusion models. Therefore, DCD may not always provide a notable speedup. However, DCD points out the inter-dependency modeling problem and opens up the possibility of combining different types of generative models for better overall performance. Published as a conference paper at ICLR 2025 ACKNOWLEDGEMENTS This work was funded in part by the DARPA ANSR program under award FA8750-23-2-0004, the DARPA CODORD program under award HR00112590089, the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy - EXC 2075 390740016, NSF grant #IIS-1943641, and gifts from Adobe Research and Amazon. We acknowledge the support of the Stuttgart Center for Simulation Science (Sim Tech). MN thanks IMPRS-IS (International Max Planck Research School for Intelligent Systems) for the support. Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 890 901. IEEE, 2017. Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. Joe Benton, Yuyang Shi, Valentin De Bortoli, George Deligiannidis, and Arnaud Doucet. From denoising diffusions to denoising markov models. ar Xiv preprint ar Xiv:2211.03595, 2022. Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 28266 28279, 2022. Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. ar Xiv preprint ar Xiv:2407.01392, 2024. Ting Chen, Ruixiang ZHANG, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. In The Eleventh International Conference on Learning Representations, 2022. Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via de-randomization for discrete diffusion models. ar Xiv preprint ar Xiv:2312.09193, 2023. Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning (ICML), 2024. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 8780 8794, 2021. Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data. ar Xiv preprint ar Xiv:2211.15089, 2022. Gery Geenens. Copula modeling for discrete random vectors. Dependence Modeling, 8(1):417 440, 2020. Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019. Nate Gruver, Samuel Stanton, Nathan Frey, Tim GJ Rudner, Isidro Hotzel, Julien Lafrance-Vanasse, Arvind Rajpal, Kyunghyun Cho, and Andrew Gordon Wilson. Protein design with guided discrete diffusion. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 12489 12517, 2023. Published as a conference paper at ICLR 2025 Wei Guo, Yuchen Zhu, Molei Tao, and Yongxin Chen. Plug-and-play controllable generation for discrete masked models. ar Xiv preprint ar Xiv:2410.02143, 2024. Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 11575 11596, 2023. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Han Huang, Leilei Sun, Bowen Du, and Weifeng Lv. Conditional diffusion based on discrete graph structures for molecular graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 4302 4311, 2023. Martin Idel. A review of matrix scaling and sinkhorn s normal form for matrices and positive maps. ar Xiv preprint ar Xiv:1609.06349, 2016. Bahman Kalantari and Leonid Khachiyan. On the rate of convergence of deterministic and randomized ras matrix scaling algorithms. Operations research letters, 14(5):237 244, 1993. Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusionlm improves controllable text generation. In Advances in Neural Information Processing Systems, 2022. Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. ar Xiv preprint ar Xiv:2310.16834, 2024. Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 56998 57025, 2023. Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data. Advances in Neural Information Processing Systems, 35: 34532 34545, 2022. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2022. Alex Morehead, Jeffrey Ruffolo, Aadyot Bhatnagar, and Ali Madani. Towards joint sequencestructure generation of nucleic acid and protein complexes with se (3)-discrete diffusion. ar Xiv preprint ar Xiv:2401.06151, 2023. Roger B Nelsen. An introduction to copulas, 2006. Tobias H Olsen, Fergus Boyles, and Charlotte M Deane. Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1):141 146, 2022. Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. Mauve: measuring the gap between neural text and human text using divergence frontiers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 4816 4828, 2021. Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5:606 624, 2023. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. Tam as Rudas. Lectures on categorical data analysis. Springer, 2018. Published as a conference paper at ICLR 2025 Jeffrey A Ruffolo, Lee-Shin Chu, Sai Pooja Mahajan, and Jeffrey J Gray. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nature communications, 14(1):2389, 2023. Ludger Ruschendorf. Convergence of the iterative proportional fitting procedure. The Annals of Statistics, pp. 1160 1174, 1995. Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. ar Xiv preprint ar Xiv:2406.07524, 2024. Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias. Simplified and generalized masked diffusion for discrete data. In Advances in Neural Information Processing Systems, 2024. Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876 879, 1964. M Sklar. Fonctions de r epartition a n dimensions et leurs marges. In Annales de l ISUP, volume 8, pp. 229 231, 1959. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020. Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, pp. 32211 32252. PMLR, 2023. Robin Strudel, Corentin Tallec, Florent Altch e, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation. ar Xiv preprint ar Xiv:2211.04236, 2022. Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. In The Eleventh International Conference on Learning Representations, 2022. Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. In The Eleventh International Conference on Learning Representations, 2022. Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, Jian Jiao, Juntao Li, Jian Guo, Nan Duan, Weizhu Chen, et al. Ar-diffusion: Auto-regressive diffusion model for text generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. Lingxiao Zhao, Xueying Ding, Lijun Yu, and Leman Akoglu. Improving and unifying discrete&continuous-time discrete denoising diffusion. ar Xiv preprint ar Xiv:2402.03701, 2024. Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In International Conference on Machine Learning, 2024. Published as a conference paper at ICLR 2025 A PROOF OF THE THEORETICAL RESULTS Proof of Proposition 1. Following (Ho et al., 2020; Sohl-Dickstein et al., 2015), the negative ELBO L can be decomposed as follows: log p(x T ) t=1 log pθ(xt 1|xt) log p(x T ) t=1 log pθ(xt 1|xt) q(xt 1) q(xt 1|xt) q(xt) log p(x T ) t=1 log pθ(xt 1|xt) q(xt 1|xt) log p(x0) = DKL(q(x T ) p(x T )) + Eq t=1 DKL(q(xt 1|xt) pθ(xt 1|xt)) + H(x0). (11) The first term equals 0 as we assume the noise distribution p(XT ) is consistent in the noising and the denoising processes. Given the independent denoising assumption, when the denoising distribution are optimal, we have t {1, . . . , T}, pθ(xt 1|xt) = Y i q(xi t 1|xt). Plug in Equation (11) and using the definition of total correlation, we have: L = DKL(q(x T ) p(x T )) + Eq t=1 DKL(q(xt 1|xt) Y i q(xi t 1|xt)) = DKL(q(x T ) p(x T )) + t=1 DTC(q(Xt 1|Xt)) + H(p(X0)) t=1 DTC(q(Xt 1|Xt)). Proof of Proposition 2. According to Pythagoras triangle-inequality theorem, if ˆp is the Iprojection of pest onto Pptar mar, and Pptar mar is convex (this can be shown by applying the definition of a convex set), the following holds for any p Pptar mar: DKL(p pest) DKL(p ˆp) + DKL(ˆp pest). (12) Choosing p =ptar, we have DKL(ptar ˆp) DKL(ptar pest) DKL(ˆp pest) < DKL(ptar pest), where the last inequality holds since DKL(ˆp pest)>0 if the set of univariate marginals of pest and ptar are different (as assumed in the proposition). Proof of Proposition 3. Following the definition of ˆp, we write down the constrained optimization problem as follows minimize p DKL(p pest) s.t. i {1, . . . , N}, xi {1, . . . , C}, X x\i p (x\i, xi) = ptar(xi). Published as a conference paper at ICLR 2025 To incorporate the constraints, we use the method of Lagrange multipliers. The Lagrangian for this problem is L(p , {λi}N i=1) = X x p (x) log p (x) xi=1 λi(xi) x\i p (x\i, xi) ptar(xi) where the Lagrange multipliers {λi}N i=1 enforce the univariate marginal constraints. To minimize the Lagrangian with respect to p (x), we take the partial derivative of L(p , {λi}N i=1) with respect to p (x) and set it to 0: L(p , {λi}N i=1) p (x) = log p (x) pest(x) + 1 + X i λi(xi) = 0. Simplifying this equation gives p (x) = pest(x) exp Defining σi(xi):=exp( λi(xi) 1/N) gives p (x) = pest(x) Q Existence of the solution follows from the fact that (i) the objective function is convex and bounded (since probability values are in [0, 1]), and (ii) the set of constraints is feasible (e.g., p (x)=Q i ptar(xi) or p (x)=ptar(x)). Proof of Theorem 1. We show that for any V that minimizes the objective function L(V; ptar, pest), the corresponding p defined by p (x) = pest(x) Q i exp(V[i, xi]) belongs to the set Pptar mar. Specifically, for any V that minimizes the objective, the partial derivative of L(V; ptar, pest) with respect to any V[i, xi] should be 0: L(V; ptar, pest) V[i, xi] = exp(V[i, xi]) X x\i pest(x\i, xi) Y j =i exp(V[j, xj]) ptar(xi) = 0. Plug in the definition of p , we have x\i p (x\i, xi) ptar(xi) = p (xi) ptar(xi). (13) Since Equation (13) holds for all (i, xi) pairs, we have that every minimizer of L(V; ptar, pest) corresponds to a distribution p in Pptar mar. Since L(V; ptar, pest) is convex, we can also argue the converse: if a distribution p with the above-defined form belongs to Pptar mar, then the corresponding V is a minimizer of L(V; ptar, pest). According to Proposition 3, the solution to the following I-projection exists and its solution ˆp has the same form as p . ˆp = arg min p Pp mar DKL(p pest). Since ˆp has the same form as p (by Prop. 3) and belongs to Pptar mar, it is the a minimizer of L(V; ptar, pest). Proof of Proposition 4. The copula of p is shown to be invariant under rescalings of the form q(x) p(x) Q i exp(V[i, xi]) for any V RN C by using the parameterization of a discrete copula by conditional odds ratios (Definition 2). The scaling factors cancel in the ratios as shown, e.g. by Rudas (2018, Theorem 12.3). Published as a conference paper at ICLR 2025 Proof of Proposition 5. We start by writing the probability q(xt|xt+1) using the Bayes rule: q(xt|xt+1) = q(xt+1|xt) q(xt) q(xt+1), 1 q(xt+1) q(xt+1|xt) q(xt|x0) p(x0), (14) where the last equality follows from q(xt)=P x0 q(xt|x0) p(x0). Recall from the proposition that I is defined as the set of variables i such that xi t+1 = and J is the complement of I. First, we must have xj t =xj t+1 for j J since for any other value of Xj t , we have q(xt+1|xt)=0 in Equation (14). As a result, q(xt|xt+1) is also zero. We then move our attention to the variables in I. We first consider the probability q(Xi t = |xt+1) for any i I. Following Equation (14), we have q(Xi t = |xt+1) = X 1 q(xt+1) q(xt+1|xt) q(xt|x0) p(x0), 1 q(xt+1) q(xt+1|xt) q(xt), = q(Xi t+1 = |Xi t = ) q(Xi t = ) q(Xi t+1 = ) , = q(Xi t = ) q(Xi t+1 = ) = αt αt+1 . (15) We then focus on XI t = x I t , where none of the value in x I t is . Note that we also need to have XJ t =x J t+1. q(xt|xt+1) X x0 q(xt+1|xt) q(xt|x0) q(x0), (a) = q(xt+1|xt) q(X0 = xt), |I| q(X0 = xt), q(X0 = xt), (16) where p(X0) is the data distribution; (a) follows from the fact that no value in xt is , hence x0 =xt; αt+1 αt 1 αt is the probability of transitioning into the mask state from time t to time t+1. Denote Xt as a set of variables with the same configuration and semantics as Xt, with the only difference that the category is excluded. By following Equation (16) and apply normalization, we conclude that q( xt|xt+1) = p(XI 0 = x I t |XJ 0 = x J t+1) 1[ x J t = x J t+1]. (17) This matches the definition in Equation (5). Finally, we verify the correctness of the distribution q(Xt| xt, xt+1) defined in the proposition by verifying the following for any xt q(xt|xt+1) = X xt q( xt|xt+1) q(xt| xt, xt+1). (18) Denote K as the set of variables i such that xt = and L as its complement. First, if L J (i.e., I K), then both the left-hand side (LHS) and the right-hand sides (RHS) are zero. Specifically, the RHS is zero since according to the definition, i J & i K, we have q(xi t| xi t, xi t+1)=0. Next, if K I, we can decompose q(xt|xt+1) as follows q(xt|xt+1) = q(x I\K t |xt+1) Y i K q(xi t|xt+1) Y j J q(xj t|xt+1). (19) Published as a conference paper at ICLR 2025 For any j J, if xj t =xj t+1 then both the LHS and the RHS of Equation (18) are zero. Otherwise we always have q(xj t|xt+1)=1. Therefore, Equation (19) can be further simplified as q(xt|xt+1) = q(x I\K t |xt+1) Y i K q(xi t|xt+1). (20) We then proceed to simplify the RHS of Equation (18): X xt q( xt|xt+1) q(xt| xt, xt+1), q( x K t , x I\K t |xt+1) αt |K| αt+1 αt q( x K t , x I\K t |xt+1) αt+1 αt i K q(xi t|xt+1), = q( x I\K t |xt+1) αt+1 αt i K q(xi t|xt+1), (b) p(XI\K 0 = x I\K t , XJ 0 = x J t ) Y i K q(xi t|xt+1), (c) q(XI\K t = x I\K t |xt+1) Y i K q(xi t|xt+1), (21) where (a) follows from Equation (15), (b) applies the definition in Equation (17), and (c) is a result of applying Equation (16) to the case where x L t ={ x I\K t , x J t } are not . By combining Equations (21) and (20), we conclude that the LHS and the RHS of Equation (18) are proportional to each other. Since they are both properly-normalized distributions, they must also match exactly. Proof of Proposition 6. We first state a more detailed version of the proposition: for each variable i and data category c (c =), we have q( Xi t = c|xt+1) = 1 Z q(Xi t = c|xt+1), where Z = X c = q(Xi t = c|xt+1). According to the proof of Proposition 5, Equation (18) holds for all xt. Therefore, we have that for each i and each data category xi t =, q(xi t|xt+1) = X xt q( xt|xt+1) q(xi t| xt, xt+1). (22) If i J, then both the LHS of the above equation and q(xi t| xt, xt+1) equals one if and only if xi t =xi t+1. Therefore, the result holds trivially. Next, if i I, denote I\i :=I\{i}, Equation (22) is simplified to q(xi t|xt+1) = X xt q( xt|xt+1) q(xi t| xt, xt+1), q( xi t, x I\i t |xt+1) q(xi t| xi t, xi t+1), = q( Xi t = xi t|xt+1) q(xi t| Xi t = xi t, xi t+1), = q( Xi t = xi t|xt+1) αt+1 αt Published as a conference paper at ICLR 2025 Therefore, we have q( Xi t = xi t|xt+1) = 1 Z q(Xi t = xi t|xt+1), where Z = X xi t = q(Xi t = xi t|xt+1). B RELATION BETWEEN L(V; ptar, pest) AND MATRIX SCALING The matrix scaling problem gives a matrix A as input and asks for diagonal scaling matrices X and Y such that XAY is doubly stochastic (its row and column sums are all one). More generally, target row and column sum vectors r and c are provided and need not contain only ones. The solvability of this problem for positive matrices was established by Sinkhorn (1964), and its algorithms (sometimes called iterative proportional fitting), generalizations, and numerous applications have been studied thoroughly (Kalantari & Khachiyan, 1993; Ruschendorf, 1995; Allen-Zhu et al., 2017); see (Idel, 2016) for a review. Taking the multidimensional generalization of the problem and interpreting the tensor as a (unnormalized) probability distribution yields the connection to our problem, with the target sums being the univariate marginal distributions. C PARAMETERIZING DISCRETE COPULAS BY ODDS RATIOS We start by formally defining odds ratios. Definition 2 (Rudas (2018)). Let p be a distribution over variables X each taking values in {0, 1}. For a partition of X into sets A and B, the conditional odds ratio of variables A conditioned on the assignment B = b is CORp(A|B = b) = Q a s p(a, b) Q a d p(a, b) where s is the set of assignments to A whose parity is the same as the number of variables in A, and d is the set of assignments whose parity is different. In the case of more than two categories per variable, CORp(A|B = b) can generalized further to be a set of similarly defined ratios (see, e.g., Rudas (2018)). Together the set of all conditional odds ratios CORp(A|B = b) for partitions of X into sets A and B with |A| 2, completely specifies the association among the variables in the joint distribution p, as established by the following theorem. Theorem 2 (Rudas (2018)). Let q and r be positive probability distributions on a the set of variables X each taking values in {0, 1, . . . , k}. Then there exists a unique probability distribution p such that p has the same univariate marginal distributions as q, that is, for all i p(xi) = q(xi), and p has the same copula as q, that is for all partitions of X into sets A and B with |A| 2, CORp(A|B = b) = CORr(A|B = b). Proof. This follows from (Rudas, 2018, Theorem 10.2) by taking the descending set to contain the empty set and all singletons (and the ascending set, its complement). Theorem 2 shows how any distribution p can be viewed as combining independent marginal distributions (i.e., from r) and odds ratios (i.e., from q). Such a combination has desirable properties. For example, in the case of two variables with possibly many categories, it has been shown that among all distributions with the same margins as r, the distribution p minimizes the KL-divergence to q (Geenens, 2020, Theorem 6.2), i.e. that p is the information projection of q onto the set of distributions with the margins of r. Published as a conference paper at ICLR 2025 D UNBIASED UNIVARIATE MARGINALS FROM DISCRETE DIFFUSION MODELS In this section, we show that when their respective training losses are minimized, discrete-time and continuous-time discrete diffusion models recover the true univariate marginals. Discrete-Time Diffusion Models. Discrete-time diffusion models (Austin et al., 2021) are trained to maximize the ELBO between the forward joint distribution p(x0)q(x1:T |x0), where p(x0) is the data distribution, and the reverse joint distribution pθ(x0:T ). The ELBO can be simplified to log p(x T ) t=1 log pθ(xt 1|xt) q(xt 1|xt) + log p(x0) Assume that pθ(xt 1|xt) encodes fully-factorized distribution, the above objective can be simplified as i q(xi t 1|xt) log pθ(xi t 1|xt) q(xi t 1|xt) + Eq log p(x T ) q(x T ) + log p(x0) , where the second term is independent to pθ. From the first term of the above formula, we can conclude that the ELBO objective is maximized when pθ(xi t 1|xt) = q(xi t 1|xt) for every t and every i. Continuous-Time Diffusion Models. As described in Section 2, many continuous-time diffusion models learn to approximate the likelihood ratio (defined as sθ(xt, x t; t)) at all noise levels t [0, T]: sθ(xt, x t; t) := q(Xt = x t) q(Xt = xt). Specifically, Lou et al. (2024); Meng et al. (2022) directly parameterize a neural network to approximate the likelihood ratios, and Sun et al. (2022) approximates the likelihood ratios with the conditional distributions pθ(Xi t|x\i t ) ( i, t). For each xt, since there are exponentially many possible x t, it is infeasible to have a neural network to directly model the likelihood ratio for all pairs of (xt, x t). Instead, they focus on (xt, x t) pairs where xt and x t are only different in one single variable, i.e., their Hamming distance is one. For example, in Lou et al. (2024), they represent sθ as sθ(xt, yi t; t, i), which computes the likelihood ratio between xt and x t ={x\i t , yi t}. sθ is trained by minimizing the following objective: Et,xt q(Xt) yi t =xi t wt sθ(xt, yi t; t, i) q(Xt = {x\i t , yi t}) q(Xt = xt) log sθ(xt, yi t; t, i) where {wt}t are positive weights. When the above objective is minimized, sθ recovers the correct likelihood ratios: i, t, sθ(xt, yi t; t, i) = q(Xt = {x\i t , yi t}) q(Xt = xt) . (23) At inference time, continuous-time discrete diffusion models select a list of time steps 0 < t0 < < tk = T to sample from: first sample from the prior p(Xtk) and then sample recursively from {pθ(xti 1|xti)}k i=1, where pθ(xti 1|xti) is obtained from sθ(xt, yi t; t, i) in an indirect manner. Specifically, assume dp(xt) dt =Q p(xt), we have3 q(xti 1|xti) = q(xti|xti 1) q(xti 1) = q(xti|xti 1) x exp( t Q)(xti 1, x) q(Xti = x) q(Xti = xti) 3This argument largely follows Theorem 4.1 in Lou et al. (2024). We include it for the sake of completeness. Published as a conference paper at ICLR 2025 # denoising steps Runtime (ms) Figure 7: Sampling time of DCD and its two base models with 2 to 128 denoising steps. where t:=ti ti 1 and exp( t Q)(xti 1, x) denotes the product of exp( t Q)(xj ti 1, xj), the xj ti 1-th row and xj-th column of exp( t Q). Plug in Equation (23), we can compute the marginal of xj ti 1 (i.e., pθ(xj ti 1|xti)) following q(Xj ti 1 = y|xti) q(xti|xti 1) y exp( t Q)(y, y ) sθ(xti, y ; ti, j) = exp( t Q)(y, xj ti) y exp( t Q)(y, y ) sθ(xti, y ; ti, j) Therefore, if sθ perfectly learns the likelihood ratios between inputs with Hamming distance at most one, then the correct marginals q(xj ti 1|xti) can be computed using sθ. E IMPLEMENTATION DETAILS OF DCD We describe details about the autoregressive version of DCD introduced in Section 5.3. According to Section 5.3, the first (T t 1)/T portion of the tokens in xt+1 are unmasked. At step t, we only need to additionally unmask the tokens spanning the (T t 1)/T to (T t)/T fraction of the sequence xt. We do this by caching the keys and values generated by the attention layers of tokens generated in previous denoising steps. So at step t, we will have the KV-caches of the first (T t 1)/T fraction of tokens. As a result, the computational cost for running the autoregressive Transformer is independent of the number of denoising steps. Algorithm 2 DCD with Autoregressive Copula Models and Using Autoregressive Sampling 1: Inputs: a diffusion model pdm, an autoregressive model par, number of time steps T, sequence length L 2: Outputs: a sample x0 from the discrete diffusion model augmented by the autoregressive model 3: Initialize: Sample x T from the prior noise distribution p(XT ) 4: for t = T 1 to 0 5: imin, imax = L T (T t 1), L T (T t) (w.l.o.g. assume L is divisible by T) 6: Compute {pdm( Xi t|xt+1)}i and {pdm( Xi t|x. GPT. The GPT-2-small model is obtained from Hugging Face: https://huggingface.co/ openai-community/gpt2. DCD. We implement DCD by combining SEDDM and GPT-2S following the steps in Algorithm 1. In line 8, instead of masking tokens independently, we group chunks of 8 tokens together and mask/unmask them with the same probability given the noise schedule (i.e., αt/αt+1 as shown in Prop. 5). G.2 CONDITIONAL TEXT GENERATION MAUVE Score. We adopt the MAUVE implementation available in the Python package evaluate. We use the default hyperparameters established by the original paper (Pillutla et al., 2021), which is also the default used by the package. We found that the number of samples and the number of samples given a fixed prompt influenced the score. Therefore, we randomly selected the 2,000 prompts and generated 5 samples for each prompt for all methods. Detailed Runtime Analysis. As shown in Algorithm 1, in each denoising step of DCD, we need to run the discrete diffusion model twice: first to compute {p( Xi t|xt+1)}i and next to compute {p( Xi t|x