# beta_diffusion__73aebf14.pdf Beta Diffusion Mingyuan Zhou , Tianqi Chen, Zhendong Wang, and Huangjie Zheng The University of Texas at Austin Austin, TX 78712 We introduce beta diffusion, a novel generative modeling method that integrates demasking and denoising to generate data within bounded ranges. Using scaled and shifted beta distributions, beta diffusion utilizes multiplicative transitions over time to create both forward and reverse diffusion processes, maintaining beta distributions in both the forward marginals and the reverse conditionals, given the data at any point in time. Unlike traditional diffusion-based generative models relying on additive Gaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is multiplicative and optimized with KL-divergence upper bounds (KLUBs) derived from the convexity of the KL divergence. We demonstrate that the proposed KLUBs are more effective for optimizing beta diffusion compared to negative ELBOs, which can also be derived as the KLUBs of the same KL divergence with its two arguments swapped. The loss function of beta diffusion, expressed in terms of Bregman divergence, further supports the efficacy of KLUBs for optimization. Experimental results on both synthetic data and natural images demonstrate the unique capabilities of beta diffusion in generative modeling of range-bounded data and validate the effectiveness of KLUBs in optimizing diffusion models, thereby making them valuable additions to the family of diffusion-based generative models and the optimization techniques used to train them. 1 Introduction Diffusion-based deep generative models have been gaining traction recently. One representative example is Gaussian diffusion [57, 59, 23, 60, 35] that uses a Gaussian Markov chain to gradually diffuse images into Gaussian noise for training. The learned reverse diffusion process, defined by a Gaussian Markov chain in reverse order, iteratively refines noisy inputs towards clean photorealistic images. Gaussian diffusion can also be viewed from the lens of denoising score matching [28, 64, 59, 60] and stochastic differential equations [61]. They have shown remarkable success across a wide range of tasks, including but not limited to generating, restoring, and editing images [14, 24, 50, 53, 54, 70, 67], transforming 2D to 3D [49, 1], synthesizing audio [9, 37, 73], reinforcement learning [31, 66, 48], quantifying uncertainty [19], and designing drugs and proteins [56, 44, 32]. Constructing a diffusion-based generative model often follows a general recipe [57, 23, 35, 34]. The recipe involves three basic steps: First, defining a forward diffusion process that introduces noise into the data and corrupts it with decreasing signal-to-noise ratio (SNR) as time progresses from 0 to 1. Second, defining a reverse diffusion process that denoises the corrupted data as time reverses from 1 to 0. Third, discretizing the time interval from 0 to 1 into a finite number of intervals, and viewing the discretized forward and reverse processes as a fixed inference network and a learnable generator, respectively. Auto-encoding variational inference [36, 51] is then applied to optimize the parameters of the generator by minimizing a weighted negative ELBO that includes a Kullback Leibler (KL) divergence-based loss term for each discretized reverse step. Corresponding to: mingyuan.zhou@mccombs.utexas.edu Py Torch code is available at: https://github.com/mingyuanzhou/Beta-Diffusion 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Figure 1: Illustration of the beta forward diffusion process for two example images. The first column displays the original images, while the other 21 columns display the images noised and masked by beta diffusion at time t = 0, 0.05, . . . , 1, using η = 10000 and the sigmoid diffusion schedule with c0 = 10 and c1 = 13. Figure 2: Illustration of reverse beta diffusion for two example generations. The time t decreases from 1 to 0 when moving from left to right. In each image generation, the top row shows the trajectory of zt (rescaled for visualization), which has been demasked and denoised using reverse diffusion, whereas the bottom row shows ˆx0 = fθ(zt, t), whose theoretical optimal solution is equal to E[x0 | zt]. See Appendix D for more details. Although the general diffusion-modeling recipe is simple in concept, it requires access to the corrupted data at any time during the forward diffusion process given a data observation, as well as the analytic form of the conditional posterior for any earlier time given both a data observation and its corrupted version at the present time. The latter requirement, according to Bayes rule, implies access to the analytical form of the distribution of a corrupted data observation at the present time given its value at any previous time. Linear operations of the Gaussian distributions naturally satisfy these requirements since they are conjugate to themselves with respect to their mean parameters. This means that the marginal form and the conditional distribution of the mean remain Gaussian when two Gaussian distributions are mixed. Similarly, the requirements can be met under the categorical distribution [26, 2, 18, 27] and Poisson distribution [10]. However, few additional distributions are known to meet these requirements and it remains uncertain whether negative ELBO would be the preferred loss. While previous works have primarily used Gaussian, categorical, or Poisson distribution-based diffusion processes, this paper introduces beta diffusion as a novel addition to the family of diffusion-based generative models. Beta diffusion is specifically designed to generate data within bounded ranges. Its forward diffusion process is defined by the application of beta distributions in a multiplicative manner, whereas its reverse diffusion process is characterized by the use of scaled and shifted beta distributions. Notably, the distribution at any point in time of the forward diffusion given a data observation remains a beta distribution. We illustrate the forward beta diffusion process in Figure 1, which simultaneously adds noise to and masks the data, and the reverse one in Figure 2, which iteratively performs demasking and denoising for data generation. We provide the details on how these images are obtained in Appendix D. Since the KL divergence between two beta distributions is analytic, one can follow the general recipe to define a negative ELBO to optimize beta diffusion. However, our experiments show that minimizing the negative ELBO of beta diffusion can fail to optimally estimate the parameters of the reverse diffusion process. For each individual time-dependent KL loss term of the negative ELBO, examining it in terms of Bregman divergence [4] reveals that the model parameters and corrupted data are placed in its first and second arguments, respectively. However, to ensure that the optimal solution under the Bregman divergence agrees with the expectation of the clean data given the corrupted data, the order of the two arguments must be swapped. By swapping the Bregman divergence s two arguments, we obtain an upper bound on the KL divergence from the joint distribution of corrupted observations in the reverse chain to that in the forward chain. This bound arises from the convexity of the KL divergence. In addition, there exists another Bregman divergence that upper bounds the KL divergence from the univariate marginal of a corrupted observation in the reverse chain to that in the forward chain. These two Bregman divergences, which can be derived either through KL divergence upper bounds (KLUBs) or the logarithmically convex beta function, share the same optimal solution but have distinct roles in targeting reverse accuracy at each step or counteracting accumulation errors over the course of reverse diffusion. We further demonstrate that combining these two KLUBs presents a computationally viable substitution for a KLUB derived at the chain level, which upper bounds the KL divergence from the joint distribution of all latent variables in a forward diffusion chain to that of its reverse. Either KLUB works on its own, which is not unexpected as they both share the same optimal solutions in theory, but combining them could lead to the best overall performance. In beta diffusion, the KL divergence is asymmetric, which enables us to derive an alternative set of KLUBs by swapping its two arguments. We will demonstrate that these augment-swapped KLUBs, which will be referred to as AS-KLUBs, essentially reduce to negative ELBOs. In Gaussian diffusion, the KL divergence is often made symmetric, resulting in KLUBs that are equivalent to (weighted) negative ELBOs. Our main contributions are the introduction of beta diffusion as a novel diffusion-based multiplicative generative model for range-bounded data, as well as the proposal of KLUBs as effective loss objectives for optimizing diffusion models, in place of (weighted) negative ELBOs. Additionally, we introduce the log-beta divergence, a Bregman divergence corresponding to the differentiable and strictly convex log-beta function, as a useful tool for analyzing KLUBs. These contributions enhance the existing family of diffusion-based generative models and provide a new perspective on optimizing them. 2 Beta Diffusion and Optimization via KLUBs We begin by specifying the general requirements for constructing a diffusion-based generative model and establish the notation accordingly [57, 23, 35]. Let x0 denote the observed data, and let zs and zt represent their corrupted versions at time s and time t, respectively, where 0 < s < t < 1. In the forward diffusion process, we require access to random samples from the marginal distribution q(zt | x0) at any time t, as well as an analytical expression of the probability density function (PDF) of the conditional distribution q(zs | zt, x0) for any s < t. The forward beta diffusion chain uses diffusion scheduling parameters αt to control the decay of its expected value over the course of forward diffusion, given by E[zt | x0] = αtx0, and a positive concentration parameter η to control the tightness of the diffusion process around its expected value. We typically set αt to approach 1 and 0 as t approaches 0 and 1, respectively, and satisfy the condition 1 α0 > αs > αt > α1 0 for all s (0, t), t (0, 1). Let Γ( ) denote the gamma function and B( , ) denote the beta function. The beta distribution Beta(x; a, b) = B(a, b) 1xa 1(1 x)b 1 is a member of the exponential family [6, 65, 7]. Its log partition function is a log-beta function as ln B(a, b) = ln Γ(a) + ln Γ(b) ln Γ(a + b), which is differentiable, and strictly convex on (0, )2 as a function of two variables [15]. As a result, the KL divergence between two beta distributions can be expressed as the Bregman divergence associated with the log-beta function. Specifically, as in Appendix A, one can show by their definitions that KL(Beta(αp, βp)||Beta(αq, βq)) = ln B(αq, βq) B(αp, βp) (αq αp, βq βp) α ln B(αp, βp) β ln B(αp, βp) = Dln B(a,b)((αq, βq), (αp, βp)). (1) We refer to the above Bregman divergence as the log-beta divergence. Moreover, if (αq, βq) are random variables, applying Proposition 1 of Banerjee et al. [4], we can conclude that the optimal value of (αp, βp) that minimizes this log-beta divergence is (α p, β p) = E[(αq, βq)]. Next, we introduce a conditional bivariate beta distribution, which given a data observation has (scaled and shifted) beta distributions for not only its two marginals but also two conditionals. These properties are important for developing the proposed diffusion model with multiplicative transitions. 2.1 Conditional Bivariate Beta Distribution We first present the conditional bivariate beta distribution in the following Lemma, which generalizes previous results on the distribution of the product of independent beta random variables [30, 38, 33]. Lemma 1 (Conditional Beta Bivariate Distribution). Denote (zs, zt) as variables over a pair of time points (s, t), with 0 < s < t < 1. Given a random sample x0 (0, 1) from a probability-valued data distribution pdata(x0), we define a conditional bivariate beta distribution over (zs, zt) with PDF: q(zs, zt | x0) = Γ(η) Γ(ηαtx0)Γ(η(1 αsx0))Γ(η(αs αt)x0) zηαtx0 1 t (1 zs)η(1 αsx0) 1 (zs zt)1 η(αs αt)x0 . (2) Marginals: Given x0, the two univariate marginals of this distribution are both beta distributed as q(zs | x0) = Beta(ηαsx0, η(1 αsx0)), (3) q(zt | x0) = Beta(ηαtx0, η(1 αtx0)). (4) Conditionals: Given x0, a random sample (zt, zs) from this distribution can be either generated in forward order, by multiplying a beta variable from (3) with beta variable πs t, as zt = zsπs t, πs t Beta(ηαtx0, η(αs αt)x0), zs q(zs | x0), (5) or generated in reverse order, by combining a beta variable from (4) with beta variable ps t, as zs = zt + (1 zt)ps t, ps t Beta(η(αs αt)x0, η(1 αsx0)), zt q(zt | x0). (6) The proof starts with applying change of variables to obtain two scaled and shifted beta distributions q(zt | zs, x0) = 1 zs Beta zt zs ; ηαtx0, η(αs αt)x0 , (7) q(zs | zt, x0) = 1 1 zt Beta zs zt 1 zt ; η(αs αt)x0, η(1 αsx0) , (8) and then takes the products of them with their corresponding marginals, given by (3) and (4), respectively, to show that the PDF of the joint distribution defined in (5) and that defined in (6) are both equal to the PDF of q(zs, zt | x0) defined in (2). The detailed proof is provided in Appendix B. To ensure numerical accuracy, we will calculate zs = zt + (1 zt)ps t in (6) in the logit space as logit(zs) = ln elogit(zt) + elogit(ps t) + elogit(zt)+logit(ps t) . (9) 2.2 Continuous Beta Diffusion Forward Beta Diffusion. We can use the conditional bivariate beta distribution to construct a forward beta diffusion chain, beginning with the beta distribution from (3) and proceeding with the scaled beta distribution from (7). The marginal at any time t for a given data observation x0, as shown in (4), stays as beta-distributed in the forward chain. For the beta distribution given by (4), we have E[zt | x0] = αtx0, var[zt | x0] = (αtx0)(1 αtx0) η+1 , SNRt = E[zt | x0] std[zt | x0] 2 = αtx0(η+1) Thus when αt approaches 0 (i.e., t 1), both zt and SNRt are shrunk towards 0, and if α1 = 0, we have z1 Beta(0, η), a degenerate beta distribution that becomes a unit point mass at 0. Infinite Divisibility. We consider beta diffusion as a form of continuous diffusion, as its forward chain is infinitely divisible given x0. This means that for any time k (s, t), we can perform forward diffusion from zs to zt by first setting zk = zsπs k, where πs k Beta(ηαkx0, η(αk αs)x0), and then setting zt = zkπk t, where πk t Beta(ηαtx0, η(αt αk)x0). The same approach can be used to show the infinite divisibility of reverse beta diffusion given x0. Reverse Beta Diffusion. We follow Gaussian diffusion to use q(zs | zt, x0) to help define pθ(zs | zt) [58, 35, 72]. To construct a reverse beta diffusion chain, we will first need to learn how to reverse from zt to zs, where s < t. If we know the x0 used to sample zt as in (3), then we can readily apply the conditional in (8) to sample zs. Since this information is unavailable during inference, we make a weaker assumption that we can exactly sample from zt q(zt) = Ex0 pdata[q(zt | x0)], which is the x0-free univariate marginal at time t. It is straightforward to sample during training but must be approximated during inference. Under this weaker assumption on q(zt), utilizing (8) but replacing its true x0 with an approximation ˆx0 = fθ(zt, t), where fθ denotes the learned generator parameterized by θ, we introduce our x0-free and hence causal time-reversal distribution as pθ(zs | zt) = q(zs | zt, ˆx0 = fθ(zt, t)). (10) 2.3 Optimization via KLUBs and Log-Beta Divergence KLUB for Time Reversal. The time-reversal distribution pθ(zs | zt) reaches its optimal when its product with q(zt) becomes equivalent to q(zs, zt) = Ex0 pdata[q(zs, zt | x0)], which is a marginal bivariate distribution that is x0-free. Thus we propose to optimize θ by minimizing KL(pθ(zs | zt)q(zt)||q(zs, zt)) in theory but introduce a surrogate loss in practice: Lemma 2 (KLUB (conditional)). The KL divergence from q(zs, zt) to pθ(zs | zt)q(zt), two x0-free joint distributions defined by forward and reverse diffusions, respectively, can be upper bounded: KL(pθ(zs | zt)q(zt)||q(zs, zt)) KLUBs,t = E(zt,x0) q(zt | x0)pdata(x0)[KLUB(s, zt, x0)], (11) KLUB(s, zt, x0) = KL(q(zs | zt, ˆx0 = fθ(zt, t))||q(zs | zt, x0))]. (12) The proof in Appendix B utilizes the equation pθ(zs | zt)q(zt) = Ex0 pdata[pθ(zs | zt)q(zt | x0)] and then applies the convexity of the KL divergence [11, 74] and the definition in (10). Log-Beta Divergence. To find out the optimal solution under KLUB, following (1), we can express KLUB(s, zt, x0) given by (12) as a log-beta divergence as Dln B(a,b) [η(αs αt)x0, η(1 αsx0)] , [η(αs αt)fθ(zt, t), η(1 αsfθ(zt, t))] . (13) We note KLUBs,t defined in (11) can also be written as Ezt q(zt)Ex0 q(x0 | zt)[KLUB(s, zt, x0)], where the log-beta divergence for KLUB(s, zt, x0), defined as in (13), includes x0 q(x0 | zt) in its first argument and the generator fθ(zt, t) in its second argument. Therefore, applying Proposition 1 of Banerjee et al. [4], we have the following Lemma. Lemma 3. The objective KLUBs,t defined in (11) for any s < t is minimized when fθ (zt, t) = E[x0 | zt] = Ex0 q(x0 | zt)[x0] for all zt q(zt). (14) Thus under the KLUBs,t-optimized θ , we have pθ (zs | zt) = q(zs | zt, E[x0 | zt]), which is different from the optimal solution of the original KL loss in (11), which is p θ(zs | zt) = q(zs, zt)/q(zt) = q(zs | zt) = Ex0 q(x0 | zt)[q(zs | zt, x0)]. It is interesting to note that they only differ on whether the expectation is carried out inside or outside the conditional posterior. In practice, we need to control the gap between pθ (zs | zt) and q(zs | zt) and hence s needs to be close to t. Furthermore, the assumption of having access to unbiased samples from the true marginal q(zt) is also rarely met. Thus we need to discretize the time from 1 to t into sufficiently fine intervals and perform time-reversal sampling over these intervals. Specifically, we can start with z1 = 0 and iterate (10) over these intervals to obtain an approximate sample from zt q(zt). However, the error could accumulate along the way from z1 to zt, to which we present a solution below. KLUB for Error Accumulation Control. To counteract error accumulation during time reversal, we propose to approximate the true marginal q(z t) using a distribution-cycle-consistency approach. This involves feeding a random sample zt from q(zt) into the generator fθ, followed by the forward marginal q(z t | ˆx0 = fθ(zt, t)), with the aim of recovering the distribution q(z t) itself. Specifically, we propose to approximate q(z t) with pθ(z t) := Ezt q(zt)[q(z t | ˆx0 = fθ(zt, t))] by minimizing KL(pθ(z t)||q(z t)) in theory, but introducing a surrogate loss in practice: Lemma 4 (KLUB (marginal)). The KL divergence KL(pθ(z t)||q(z t)) can be upper bounded: KL(pθ(z t)||q(z t)) KLUBt = E(zt,x0) q(zt | x0)pdata(x0)[KLUB(zt, x0)], (15) KLUB(zt, x0) = KL(q(z t | fθ(zt, t))||q(z t | x0)). (16) The proof in Appendix B utilizes the fact that q(z t) = E(zt,x0) q(zt | x0)pdata(x0)[q(z t | x0)]. Note that the mathematical definition of KLUB is reused throughout the paper and can refer to any of the equations (11), (12), (15), or (16) depending on the context. Similar to previous analysis, we have KLUB(zt, x0) = Dln B(a,b) [ηαtx0, η(1 αtx0)] , [ηαtfθ(zt, t), η(1 αtfθ(zt, t))] , (17) and can conclude with the following Lemma. Lemma 5. KLUBt in (15) is optimized when the same optimal solution given by (14) is met. Optimization via KLUBs. With the two KLUBs for both time reversal and error accumulation control, whose optimal solutions are the same as in (14), we are ready to optimize the generator fθ via stochastic gradient descent (SGD). Specifically, denoting ω, π [0, 1] as two weight coefficients, the loss term for the ith data observation x(i) 0 in a mini-batch can be computed as Li = ωKLUB(si, zti, x(i) 0 ) + (1 ω)KLUB(zti, x(i) 0 ), (18) zti q(zti | x(i) 0 ) = Beta(ηαtix(i) 0 , η(1 αtix(i) 0 )), si = πti, ti Unif(0, 1). 2.4 Discretized Beta Diffusion for Generation of Range-Bounded Data For generating range-bounded data, we discretize the beta diffusion chain. Denote zt0 = 1 and let tj increase with j. Repeating (5) for T times, we define a discretized forward beta diffusion chain: q(zt1:T | x0) = QT j=1 q(ztj | ztj 1, x0) = QT j=1 1 ztj 1 Beta ztj ztj 1 ; ηαtjx0, η(αtj 1 αtj)x0 . (19) A notable feature of (19) is that the marginal at any discrete time step tj follows a beta distribution, similarly defined as in (4). We also note while q(zt1:T | x0) defines a Markov chain, the marginal q(zt1:T ) = Ex0 pdata(x0)[q(zt1:T | x0)] (20) in general does not. Unlike in beta diffusion, where the transitions between zt and zt 1 are applied multiplicatively, in Gaussian diffusion, the transitions between zt and zt 1 are related to each other additively and zt1:T forms a Markov chain regardless of whether x0 is marginalized out. The discretized forward beta diffusion chain given by (19) is reversible assuming knowing x0. This means given x0, it can be equivalently sampled in reverse order by first sampling zt T q(zt T | x0) = Beta(ηαt T x0, η(1 αt T x0)) and then repeating (8) for t T , . . . , t2, with PDF q(zt1:T | x0) = q(zt T | x0) QT j=2 1 1 ztj Beta ztj 1 ztj 1 ztj ; η(αtj 1 αtj)x0, η(1 αtj 1x0) . This non-causal chain, while not useful by itself, serves as a blueprint for approximate generation. Specifically, we approximate the marginal given by (20) with a Markov chain in reverse order as pθ(zt1:T ) = pprior(zt T ) QT j=2 pθ(ztj 1 | ztj) = pprior(zt T ) QT j=2 q(ztj 1 | ztj, fθ(ztj, αtj)). (21) To start the reverse process, we choose to approximate q(zt T ) = Ex0 pdata(x0)[q(zt T | x0)] with pprior(zt T ) = q(zt T | E[x0]), which means we let zt T Beta(ηαt T E[x0], η(1 αt T E[x0])). To sample zt1:T 1, we use the remaining terms in (21), which are scaled and shifted beta distributions that are specified as in (8) and can be sampled as in (6) and (9). KLUB for Discretized Beta Diffusion. An optimized generator is expected to make the x0-free joint distribution over all T steps of the discretized reverse beta diffusion chain, expressed as pθ(zt1:T ), to approach that of the discretized forward chain, expressed as q(zt1:T ). Thus an optimized θ is desired to minimize the KL divergence KL(pθ(zt1:T )||q(zt1:T )). While this KL loss is in general intractable to compute, it can also be bounded using the KLUB shown as follows. Lemma 6 (KLUB for discretized diffusion chain). KL(pθ(zt1:T )||q(zt1:T )) is upper bounded by KLUB = Ex0 pdata(x0) [KL(pprior(zt T )||q(zt T | x0))] + PT j=2 KLUBtj 1,tj where (22) KLUBs,t = E(zt,x0) pθ(zt | x0)pdata(x0)[KLUB(s, zt, x0)], KLUB(s, zt, x0) = KL(pθ(zs | zt)||q(zs | zt, x0))]. We provide the proof in Appendix B. We note Lemma 6 is a general statement applicable for any diffusion models with a discrete forward chain q(zt1:T | x0) = QT j=1 q(ztj | ztj 1, x0) and a discrete reverse chain pθ(zt1:T ) = pprior(zt T ) QT j=2 pθ(ztj 1 | ztj). To estimate the KLUB in (22), however, during training, one would need to sample ztj pθ(ztj | x0) p(x0 | ztj)pθ(ztj), which is often infeasible. If we replace ztj pθ(ztj | x0) with ztj q(ztj | x0), then KLUBs,t becomes the same as KLUBs,t given by (11), and KLUBt given by (15) can be considered to remedy the impact of approximating pθ(ztj) with q(ztj). Therefore, we can consider the combination of KLUBs,t given by (11) and KLUBt given by (15) as a computationally viable solution to compute KLUBs,t, which hence justifies the use of the loss in (18) to optimize the discretized reverse beta diffusion chain. We summarize the training and sampling algorithms of beta diffusion in Algorithms 1 and 2, respectively. 2.5 Argument-swapped KLUBs and Negative ELBOs We note that in theory, instead of the KL divergences in (11) and (15), fθ can also be optimized using two argument-swapped KL divergences: KL(q(zt, zs)||pθ(zs|zt)q(zt)) and KL(q(z t)||pθ(z t)). By the same analysis, these KL divergences can also be bounded by KLUBs and log-beta divergences that are equivalent to the previous ones, but with swapped arguments. The argument-swapped KLUBs and log-beta divergences will be shown to be closely related to optimizing a discretized beta reverse diffusion chain via ELBO, but they do not guarantee an optimal solution that satisfies (14) in beta diffusion and are found to provide clearly inferior empirical performance. Negative ELBO for Discretized Beta Diffusion. As an alternative to KLUB, one can also consider following the convention in diffusion modeling to minimize the negative ELBO, expressed as Ex0 pdata(x0) ln pθ(x0) ELBO = Ex0 pdata(x0)Eq(zt1:T | x0) h ln p(x0 | zt1:T )pθ(zt1:T ) q(zt1:T | x0) i = Ex0Eq(zt1 | x0) ln p(x0 | zt1) + Ex0KL[q(zt T | x0)||pprior(zt T )] + PT j=2 Ex0Eq(ztj | x0)[L(tj 1, ztj, x0)], where the first two terms are often ignored and the focus is placed on the remaining T 2 terms as L(tj 1, ztj, x0) = KL(q(ztj 1 | ztj, x0)||q(ztj 1 | ztj, ˆx0 = fθ(ztj, tj))) = Dln B(a,b) n η(αtj 1 αtj)fθ(ztj, tj), η(1 αtj 1fθ(ztj, tj)) , η(αtj 1 αtj)x0, η(1 αtj 1x0) o Lemma 7 ( ELBO and argument-swapped KLUB). Optimizing the generator fθ with ELBO is equivalent to using an upper-bound for the augment-swapped KL divergence KL(q(zt1:T )||pθ(zt1:T )). The proof in Appendix B relies on the convex nature of both the KL divergence and the negative logarithmic function. We find for beta diffusion, optimizing KL(pθ(zt1:T )||q(zt1:T )) via the proposed KLUBs is clearly preferred to optimizing KL(q(zt1:T )||pθ(zt1:T )) via ELBOs (i.e., augmentswapped KLUBs) and leads to stable and satisfactory performance. KLUBs and (Weighted) Negative ELBOs for Gaussian Diffusion. We note KLUBs are directly applicable to Gaussian diffusion, but they may not result in new optimization algorithms for Gaussian diffusion that drastically differ from the weighted ELBO, which weighs the KL terms using the corresponding SNRs. Moreover, whether the default or argument-swapped KLUBs are used typically does not make any difference in Gaussian diffusion and would result in the same squared error-based Bregman divergence. We provide the derivation of the (weighted) ELBOs from the lens of KLUBs in Appendix C, providing theoretical support for Gaussian diffusion to use the SNR weighted ELBO [23, 35, 20], which was often considered as a heuristic but crucial modification of ELBO. 3 Related Work, Limitations, and Future Directions Various diffusion processes, including Gaussian, categorical, Poisson, and beta diffusions, employ specific distributions in both forward and reverse sampling. Gaussian diffusion starts at N(0, 1) in its reverse process, while both Poisson and beta diffusion start at 0. Beta diffusion s reverse sampling is a monotonically non-decreasing process, similar to Poisson diffusion, but while Poisson diffusion takes count-valued discrete jumps, beta diffusion takes probability-valued continuous jumps. A future direction involves extending beta diffusion to encompass the exponential family [6, 65, 45, 7]. Several recent works have explored alternative diffusion processes closely related to Gaussian diffusion. Cold diffusion by Bansal et al. [5] builds models around arbitrary image transformations instead of Gaussian corruption, but it still relies on L1 loss, resembling Gaussian diffusion s squared Euclidean distance. Rissanen et al. [52] propose an inverse heat dispersion-based diffusion process that reverses the heat equation using inductive biases in Gaussian diffusion-like models. Soft diffusion by Daras et al. [12] uses linear corruption processes like Gaussian blur and masking. Blurring diffusion by Hoogeboom and Salimans [25] shows that blurring (or heat dissipation) can be equivalently defined using a Gaussian diffusion process with non-isotropic noise and proposes to incorporate blurring into Gaussian diffusion. These alternative diffusion processes share similarities with Gaussian diffusion in loss definition and the use of Gaussian-based reverse diffusion for generation. By contrast, beta diffusion is distinct from all of them in forward diffusion, training loss, and reverse diffusion. A concurrent work by Avdeyev et al. [3] utilizes the Jacobi diffusion process for discrete data diffusion models. Unlike Gaussian diffusion s SDE definition, the Jacobi diffusion process in Avdeyev et al. [3] is defined by the SDE dx = s 2[a(1 x) bx]dt + p sx(1 x)dw, with x [0, 1] and s, a, b > 0. The stationary distribution is a univariate beta distribution Beta(a, b). Beta diffusion and the Jacobi process are related to the beta distribution, but they differ in several aspects: Beta diffusion ends its forward process at Beta(0, η), a unit point mass at 0, not a Beta(a, b) random variable. The marginal distribution of beta diffusion at time t is expressed as q(zt | x0) Beta(ηαtx0, η(1 αtx0)), while the Jacobi diffusion process involves an infinite sum. Potential connections between beta diffusion and the Jacobi process under specific parameterizations are worth further investigation. Several recent studies have been actively exploring the adaptation of diffusion models to constrained scenarios [13, 40, 41, 17], where data is bounded within specific ranges or constrained to particular manifolds. These approaches are all rooted in the framework of Gaussian diffusion, which involves the incorporation of additive Gaussian noise. In sharp contrast, beta diffusion introduces a distinct perspective by incorporating multiplicative noise, leading to the emergence of an inherently hypercubic-constrained diffusion model that offers new development and exploration opportunities. Classifier-free guidance (CFG), often used in conjunction with heuristic clipping, is a widely used technique to perform conditional generation with Gaussian diffusion [22, 41]. Beta diffusion offers the potential for a seamless integration of CFG by directly applying it to the logit space, the operating space of its fθ network, thereby eliminating the necessity for heuristic clipping. To adapt Gaussian diffusion for high-resolution images, a common approach is to perform diffusion within the latent space of an auto-encoder [53, 62]. A promising avenue to explore is the incorporation of sigmoid or tanh activation functions in the encoder s final layer. This modification would establish a bounded latent space conducive to applying beta diffusion, ultimately leading to the development of latent beta diffusion tailored for high-resolution image generation. One limitation of beta diffusion is that its training is computationally expensive and data-intensive, akin to Gaussian diffusion. Specifically, with four Nvidia RTX A5000 GPUs, beta diffusion and Gaussian diffusion (VP-EDM) both take approximately 1.46 seconds to process 1000 images of size 32 32 3. Processing 200 million CIFAR-10 images, the default number required to reproduce the results of VP-EDM, would thus take over 80 hours. Several recent works have explored different techniques to make Gaussian diffusion faster and/or more data efficient in training [68, 20, 76, 71]. It is worth exploring how to adapt these methods to enhance the training efficiency of beta diffusion. Beta diffusion has comparable sampling costs to Gaussian diffusion with the same NFE. However, various methods have been developed to accelerate the generation of Gaussian diffusion, including combining it with VAEs, GANs, or conditional transport [77] for faster generation [72, 47, 78, 69], distilling the reverse diffusion chains [43, 55, 76], utilizing reinforcement learning [16], and transforming the SDE associated with Gaussian diffusion into an ODE, followed by fast ODE solvers [58, 39, 42, 75, 34]. Given these existing acceleration techniques for Gaussian diffusion, it is worth exploring their generalization to enhance the sampling efficiency of beta diffusion. Beta diffusion raises concerns regarding potential negative societal impact when trained on image datasets curated with ill intentions. This issue is not exclusive to beta diffusion but applies to diffusion-based generative models as a whole. It is crucial to address how we can leverage these models for the betterment of society while mitigating any potential negative consequences. 4 Experiments The training and sampling algorithms for beta diffusion are described in detail in Algorithms 1 and 2, respectively, in the Appendix. Our experiments, conducted on two synthetic data and the CIFAR10 images, primarily aim to showcase beta diffusion s effectiveness in generating range-bounded data. We also underscore the superiority of KLUBs over negative ELBOs as effective optimization objectives for optimizing beta diffusion. Additionally, we highlight the differences between beta and Gaussian diffusion, specifically in whether the data are generated through additive or multiplicative transforms and their ability to model the mixture of range-bounded distributions with disjoint supports. We compare the performance of Gauss ELBO, Beta ELBO, and Beta KLUB, which respectively correspond to a Gaussian diffusion model optimized with the SNR weighted negative ELBO [23, 35], a beta diffusion model optimized with the proposed KLUB loss defined in (18) but with the two arguments inside each KL term swapped, and a beta diffusion model optimized with the proposed KLUB loss defined in (18). On CIFAR-10, we also evaluate beta diffusion alongside a range of non-Gaussian or Gaussian-like diffusion models for comparison. 4.1 Synthetic Data We consider a discrete distribution that consists of an equal mixture of five unit point masses located at x0 D = {1/7, 2/7, 3/7, 4/7, 5/7}. We would like to highlight that a unit point mass can also be seen as an extreme case of range-bounded data, where the range is zero. Despite being simple, Figure 3: Comparison of the true PMF, marked in red square , and the empirical PMFs of three different methods Gauss ELBO, Beta ELBO, and Beta KLUB calculated over 100 equal-sized bins between 0 and 1. Each empirical PMF is marked in solid dot . this data could be challenging to model by a continuous distribution, as it would require the generator to concentrate its continuous-valued generations on these five discrete points. We follow previous works to choose the beta linear diffusion schedule as αt = e 1 2 βdt2 βmint, where βd = 19.9 and βmin = 0.1. This schedule, widely used by Gaussian diffusion [23, 61, 34], is applied consistently across all experiments conducted on synthetic data. We set η = 10000, π = 0.95, and ω = 0.5. As the data already falls within the range of 0 to 1, necessitating neither scaling nor shifting, we set Scale = 1 and Shift = 0. We use the same structured generator fθ for both Gaussian and beta diffusion. We choose 20-dimensional sinusoidal position embeddings [63], with the positions set as 1000t. The network is an MLP structured as (21-256)-Re LU-(256-256)-Re LU-(256-1). We utilize the Adam optimizer with a learning rate of 5e-4 and a mini-batch size of 1000. For data generation, we set NFE = 200. We provide the generation results in Figure 3, which shows the true probability mass function (PMF) of the discrete distribution and the empirical PMFs over 100 equal-sized bins between 0 and 1. Each empirical PMF is computed based on 100k random data points generated by the model trained after 400k iterations. It is clear from Figure 3 that Gauss ELBO is the worst in terms of mis-aligning the data supports and placing its data into zero-density regions; Beta ELBO is the worst in terms of systematically overestimating the density at smaller-valued supports; whereas Beta KLUB reaches the best compromise between accurately identifying the data supports and maintaining correct density ratios between different supports. In Appendix E, we further provide quantitative performance comparisons between different diffusion models and conduct an ablation study between KLUB and its two variants for beta diffusion: KLUB Conditional and KLUB Marginal, corresponding to ω = 1 and ω = 0, respectively, in the loss given by (18). Additionally, we evaluate beta diffusion on another synthetic data, which comes from a mixture of range-bounded continuous distributions and point masses supported on disjoint regions. 4.2 Experiments on Image Generation We employ the CIFAR-10 dataset and build upon VP-EDM [34] as the foundation of our codebase. Our initial foray into applying beta diffusion to generative modeling of natural images closely mirrors the settings of Gaussian diffusion, including the choice of the generator s network architecture. We introduce a sigmoid diffusion schedule defined as αt = 1/(1 + e c0 (c1 c0)t), which has been observed to offer greater flexibility than the beta linear schedule for image generation. This schedule bears resemblance to the sigmoid-based one introduced for Gaussian diffusion [35, 29]. We configure the parameters for beta diffusion as follows: c0 = 10, c1 = 13, Shift = 0.6, Scale = 0.39, η = 10000, ω = 0.99, and π = 0.95. We utilize the Adam optimizer with a learning rate of 2e-4. We use EDM s data augmentation approach, but restrict augmented images to a 0-1 range before scaling and shifting. We use the beta diffusion model trained on 200M images to calculate the FID [21]. Explanations regarding the intuition behind these parameter selections can be located in Appendix G. Below we provide numerical comparison of beta diffusion with not only Gaussian diffusion models but also alternative non-Gaussian or Gaussian-like ones. We also conduct an ablation study to compare KLUB and negative ELBO across different NFE, η, and B. A broad spectrum of diffusion models is encompassed in Table 1, which shows that beta diffusion outperforms all non-Gaussian diffusion models on CIFAR10, including Cold Diffusion [5] and Inverse Heat Dispersion [52], as well as categorical and count-based diffusion models [2, 8, 10]. In comparison to Gaussian diffusion and Gaussian+blurring diffusion, beta diffusion surpasses VDM [35], Soft Diffusion [12], and Blurring Diffusion [25]. While it may fall slightly short of DDPM [23], improved DDPM [46], TPDM+ [78], VP-EDM [34], it remains a competitive alternative that uses non-Gaussian based diffusion. Table 1: Comparison of the FID scores of various diffusion models trained on CIFAR-10. Diffusion Space Model FID ( ) DDPM [23] 3.17 VDM [35] 4.00 Improved DDPM [46] 2.90 TDPM+ [78] 2.83 VP-EDM [34] 1.97 Gaussian+Blurring Soft Diffusion [12] 3.86 Blurring Diffusion [25] 3.17 Deterministic Cold Diffusion (image reconstruction) [5] 80.08 (deblurring) 8.92 (inpainting) Inverse Heat Dispersion [52] 18.96 Categorical D3PM Gauss+Logistic [2] 7.34 τLDR-10 [8] 3.74 Count JUMP (Poisson Diffusion) [10] 4.80 Range-bounded Beta Diffusion 3.06 Table 2: Comparing FID scores for KLUB and negative ELBO-optimized Beta Diffusion on CIFAR-10 with varying NFE under η = 10000 and two different mini-batch sizes B. Loss ELBO ELBO KLUB KLUB B 512 288 512 288 20 16.04 16.10 17.06 16.09 50 6.82 6.82 6.48 5.96 200 4.55 4.84 3.69 3.31 500 4.39 4.65 3.45 3.10 1000 4.41 4.61 3.38 3.08 2000 4.50 4.66 3.37 3.06 (b) KLUB Figure 4: Uncurated randomly-generated images by beta diffusion optimized with ELBO or KLUB with η = 10000 and B = 288. The generation with NFE = 1000 starts from the same random seed. Table 2 presents a comparison between KLUB and negative ELBO-optimized beta diffusion across different NFE under two different mini-batch sizes B. Table 3 in the Appendix includes the results under several different combinations of η and B. We also include Figure 4 to visually compare generated images under KLUB and negative ELBO. The findings presented in Tables 2 and 3 and Figure 4 provide further validation of KLUB s efficacy in optimizing beta diffusion. As each training run takes a long time and FID evaluation is also time-consuming, we have not yet optimized the combination of these hyperparameters given the limit of our current computation resources. Thus the results reported in this paper, while demonstrating that beta diffusion can provide competitive image generation performance, do not yet reflect the full potential of beta diffusion. These results are likely to be further improved given an optimized hyperparameter setting or a network architecture that is tailored to beta diffusion. We leave these further investigations to our future work. 5 Conclusion We introduce beta diffusion characterized by the following properties: 1) Analytic Marginal: Given a probability-valued data observation x0 (0, 1), the distribution at any time point t [0, 1] of the forward beta diffusion chain, expressed as q(zt | x0), is a beta distribution. 2) Analytical Conditional: Conditioning on a data x0 and a forward-sampled latent variable zt q(zt | x0), the forward beta diffusion chain can be reversed from time t to the latent variable at any previous time s [0, t) by sampling from an analytic conditional posterior zs q(zs | zt, x0) that follows a scaled and shifted beta distribution. 3) KLUBs: We introduce the combination of two different Kullback Leibler Upper Bounds (KLUBs) for optimization and represent them under the log-beta Bregman divergence, showing that their optimal solutions of the generator are both achieved at fθ (zt, t) = E[x0 | zt]. We also establish the connection between augment-swapped KLUBs and (weighted) negative ELBOs for diffusion models. Our experimental results confirm the distinctive qualities of beta diffusion when applied to generative modeling of range-bounded data spanning disjoint regions or residing in high-dimensional spaces, as well as the effectiveness of KLUBs for optimizing beta diffusion. Acknowledgements The authors acknowledge the support of NSF-IIS 2212418, NIH-R37 CA271186, the Fall 2022 Mc Combs REG award, the Texas Advanced Computing Center (TACC), and the NSF AI Institute for Foundations of Machine Learning (IFML). [1] Mohammadreza Armandpour, Huangjie Zheng, Ali Sadeghian, Amir Sadeghian, and Mingyuan Zhou. Re-imagine the negative prompt algorithm: Transform 2D diffusion into 3D, alleviate Janus problem and beyond. ar Xiv preprint ar Xiv:2304.04968, 2023. URL https://perp-neg. github.io/. (the first three authors contributed equally). [2] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. [3] Pavel Avdeyev, Chenlai Shi, Yuhao Tan, Kseniia Dudnyk, and Jian Zhou. Dirichlet diffusion score model for biological sequence generation. ar Xiv preprint ar Xiv:2305.10699, 2023. [4] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, Joydeep Ghosh, and John Lafferty. Clustering with Bregman divergences. Journal of machine learning research, 6(10), 2005. [5] Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Cold diffusion: Inverting arbitrary image transforms without noise. ar Xiv preprint ar Xiv:2208.09392, 2022. [6] O. Barndorff-Nielsen. Information and Exponential Families : In Statistical Theory. Wiley Series in Probability and Statistics Ser. John Wiley & Sons, Incorporated, New York, 2nd ed. edition, 2014. ISBN 9781118857373. [7] David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017. [8] Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Dm T862YAie Y. [9] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wave Grad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id= Ns MLjc Fa O8O. [10] Tianqi Chen and Mingyuan Zhou. Learning to Jump: Thinning and thickening latent counts for generative modeling. In ICML 2023: International Conference on Machine Learning, July 2023. URL http://arxiv.org/abs/2305.18375. [11] Thomas M Cover and Joy A Thomas. Elements of information theory. 1991. [12] Giannis Daras, Mauricio Delbracio, Hossein Talebi, Alex Dimakis, and Peyman Milanfar. Soft diffusion: Score matching with general corruptions. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=W98reb Bxl Q. [13] Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, and Arnaud Doucet. Riemannian score-based generative modelling. Advances in Neural Information Processing Systems, 35:2406 2422, 2022. [14] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum? id=AAWu Cvza Vt. [15] Sever Silvestru Dragomir, Ravi P. Agarwal, and N. S. Barnett. Inequalities for beta and gamma functions via some classical and new integral inequalities. Journal of Inequalities and Applications, 2000:504054, 2000. [16] Ying Fan and Kangwook Lee. Optimizing DDPM sampling with shortcut fine-tuning. ar Xiv preprint ar Xiv:2301.13362, 2023. [17] Nic Fishman, Leo Klarner, Valentin De Bortoli, Emile Mathieu, and Michael Hutchinson. Diffusion models for constrained domains. ar Xiv preprint ar Xiv:2304.05364, 2023. [18] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696 10706, 2022. [19] Xizewen Han, Huangjie Zheng, and Mingyuan Zhou. CARD: Classification and regression diffusion models. In Advances in Neural Information Processing Systems, 2022. [20] Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via Min-SNR weighting strategy. ar Xiv preprint ar Xiv:2303.09556, 2023. [21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626 6637, 2017. [22] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022. [23] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020. [24] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47 1, 2022. [25] Emiel Hoogeboom and Tim Salimans. Blurring diffusion models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=Oj Dk C57x5sz. [26] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454 12465, 2021. [27] Minghui Hu, Yujie Wang, Tat-Jen Cham, Jianfei Yang, and Ponnuthurai N Suganthan. Global context with discrete diffusion in vector quantised modelling for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11502 11511, 2022. [28] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695 709, 2005. URL http://jmlr.org/papers/v6/ hyvarinen05a.html. [29] Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. ar Xiv preprint ar Xiv:2212.11972, 2022. [30] M. V. Jambunathan. Some properties of beta and gamma distributions. The Annals of mathematical statistics, 25(2):401 405, 1954. ISSN 0003-4851. [31] Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. ar Xiv preprint ar Xiv:2205.09991, 2022. [32] Bowen Jing, Gabriele Corso, Jeffrey Chang, Regina Barzilay, and Tommi S. Jaakkola. Torsional diffusion for molecular conformer generation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=w6fj2r62r_H. [33] N. L. Johnson, A. W. Kemp, and S. Kotz. Univariate Discrete Distributions. John Wiley & Sons, 2005. [34] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=k7Fu TOWMOc7. [35] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696 21707, 2021. [36] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [37] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diff Wave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=a-x FK8Ymz5J. [38] I. Kotlarski. On groups of n independent random variables whose product follows the beta distribution. Colloquium Mathematicum, 9(2):325 332, 0 1962. [39] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Pl KWVd2y Bk Y. [40] Xingchao Liu, Lemeng Wu, Mao Ye, and Qiang Liu. Learning diffusion bridges on constrained domains. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WH1y Ca0Tb B. [41] Aaron Lou and Stefano Ermon. Reflected diffusion models. ar Xiv preprint ar Xiv:2304.04740, 2023. [42] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=2u Aa Gwl P_V. [43] Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. ar Xiv preprint ar Xiv:2101.02388, 2021. [44] Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, and Jianzhu Ma. Antigenspecific antibody design and optimization with diffusion-based generative models for protein structures. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?id=j Sor Gn2Tjg. [45] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012. [46] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162 8171. PMLR, 2021. [47] Kushagra Pandey, Avideep Mukherjee, Piyush Rai, and Abhishek Kumar. Diffusevae: Efficient, controllable and high-fidelity generation from low-dimensional latents. ar Xiv preprint ar Xiv:2201.00308, 2022. [48] Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models. ar Xiv preprint ar Xiv:2301.10677, 2023. [49] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dream Fusion: Text-to-3D using 2D diffusion. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Fj Nys5c7Vy Y. [50] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. [51] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278 1286. PMLR, 2014. [52] Severi Rissanen, Markus Heinonen, and Arno Solin. Generative modelling with inverse heat dissipation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=4PJUBT9f2Ol. [53] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684 10695, 2022. [54] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479 36494, 2022. [55] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=TId IXIpzho I. [56] Chence Shi, Shitong Luo, Minkai Xu, and Jian Tang. Learning gradient fields for molecular conformation generation. In International Conference on Machine Learning, pages 9558 9568. PMLR, 2021. [57] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256 2265. PMLR, 2015. [58] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020. [59] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11918 11930, 2019. [60] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438 12448, 2020. [61] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=Px TIG12RRHS. [62] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Advances in neural information processing systems, 2021. [63] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998 6008, 2017. [64] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011. [65] Martin J Wainwright and Michael Irwin Jordan. Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008. [66] Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=AHv FDPi-FA. [67] Zhendong Wang, Yifan Jiang, Yadong Lu, Yelong Shen, Pengcheng He, Weizhu Chen, Zhangyang Wang, and Mingyuan Zhou. In-context learning unlocked for diffusion models. ar Xiv preprint ar Xiv:2305.01115, 2023. [68] Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, and Mingyuan Zhou. Patch diffusion: Faster and more data-efficient training of diffusion models. ar Xiv preprint ar Xiv:2304.12526, 2023. URL https://arxiv.org/abs/ 2304.12526. [69] Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Diffusion-GAN: Training GANs with diffusion. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= HZf7Ubp WHu A. [70] Zhixin Wang, Xiaoyun Zhang, Ziying Zhang, Huangjie Zheng, Mingyuan Zhou, Ya Zhang, and Yanfeng Wang. DR2: Diffusion-based robust degradation remover for blind face restoration. In CVPR 2023: Conference on Computer Vision and Pattern Recognition, 2023. URL https: //openreview.net/forum?id=Hmc DUPEGA0. [71] Zike Wu, Pan Zhou, Kenji Kawaguchi, and Hanwang Zhang. Fast diffusion model. ar Xiv preprint ar Xiv:2306.06991, 2023. [72] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Jpr M0p-q0Co. [73] Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1720 1733, 2023. doi: 10.1109/TASLP.2023. 3268730. [74] Mingzhang Yin and Mingyuan Zhou. Semi-implicit variational inference. In International Conference on Machine Learning, pages 5660 5669, 2018. [75] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Loek7hfb46P. [76] Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning. In International Conference on Machine Learning, pages 42390 42402. PMLR, 2023. [77] Huangjie Zheng and Mingyuan Zhou. Exploiting chain rule and Bayes theorem to compare probability distributions. Advances in Neural Information Processing Systems, 34:14993 15006, 2021. [78] Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=HDxga Kk956l. Beta Diffusion: Appendix Algorithm 1 Training of Beta Diffusion Require: Dataset D whose values are bounded from 0 to 1, Mini-batch size B, concentration parameter η = 10000, data shifting parameter Shift = 0.6, data scaling parameter Scale = 0.39, generator fθ, loss balance coefficient ω = 0.99, time reversal coefficient π = 0.95, and a decreasing function that returns scheduling parameter αt (0, 1) given t [0, 1], such as a beta linear schedule defined by αt = e 0.5(βmax βmin)t2 βmint, where βmax = 20 and βmin = 0.1, and a sigmoid schedule defined by αt = 1/(1 + e c0 (c1 c0)t), where c0 = 10 and c1 = 13. 1: repeat 2: Draw a mini-batch X0 = {x(i) 0 }B i=1 from D 3: for i = 1 to B do can be run in parallel 4: ti Unif(1e-5, 1) 5: si = πti 6: Compute αsi and αti 7: x(i) 0 = x(i) 0 Scale + Shift 8: zti Beta(ηαtix(i) 0 , η(1 αtix(i) 0 )) 9: ˆx(i) 0 = fθ(zti, ti) Scale + Shift 10: Using (18) to compute the loss as Li = ωKLUB(si, zti, x(i) 0 ) + (1 ω)KLUB(zti, x(i) 0 ) = ωDln B(a,b) h η(αsi αti)x(i) 0 , η(1 αsix(i) 0 ) i , h η(αsi αti)ˆx(i) 0 , η(1 αsi ˆx(i) 0 ) i + (1 ω)Dln B(a,b) h ηαtix(i) 0 , η(1 αtix(i) 0 ) i , h ηαti ˆx(i) 0 , η(1 αti ˆx(i) 0 ) i (24) or swap the two arguments of both log-beta Bregman divergences in (24) if the loss is (weighted) ELBO. 11: end for 12: Perform SGD with 1 B θ PB i=1 Li 13: until converge Table 3: Comparing FID scores for KLUB and negative ELBO-optimized Beta Diffusion on the CIFAR-10 image dataset with varying NFE under several different combinations of concentration parameter η and mini-batch size B. We train the model with 200M images and use it to calculate FID. We compute FID one time in each experiment. The other model parameters are set as Scale = 0.39, Shift = 0.60, π = 0.95, ω = 0.99, lr = 2e-4, c0 = 10, and c1 = 13. Loss η 10 4 B NFE = 10 20 50 200 500 1000 2000 ELBO 1 512 37.64 16.04 6.82 4.55 4.39 4.41 4.50 ELBO 1 288 37.54 16.10 6.82 4.84 4.65 4.61 4.66 KLUB 1 512 39.05 17.06 6.48 3.69 3.45 3.38 3.37 KLUB 0.1 512 41.28 20.38 9.72 5.85 4.98 4.90 4.88 KLUB 0.3 512 36.70 16.47 7.03 4.10 3.65 3.67 3.66 KLUB 1 288 37.67 16.09 5.96 3.31 3.10 3.08 3.06 KLUB 0.8 288 36.46 15.58 5.98 3.49 3.22 3.21 3.23 KLUB 1.2 288 38.49 16.59 6.31 3.68 3.36 3.37 3.24 KLUB 1 128 38.16 16.47 6.29 3.74 3.44 3.40 3.47 KLUB 2 128 39.80 17.09 6.58 3.93 3.76 3.75 3.65 Algorithm 2 Sampling of Beta Diffusion Require: Number of function evaluations (NFE) J = 200, generator fθ, and timesteps {tj}J j=0: tj = 1 (1 1e-5) (J j)/(J 1) for j = 1, . . . , J and t0 = 0 1: if NFE > 350 then 2: αtj = 1/(1 + e c0 (c1 c0)tj) 3: else 4: αtj = (1/(1 + e c1))tj 5: end if 6: Initialize ˆx0 = E[x0] Scale + Shift 7: zt J Beta(ηαt J ˆx0, η(1 αt J ˆx0)) 8: for j = J to 1 do 9: ˆx0 = fθ(ztj, αtj) Scale + Shift 10: p(tj 1 tj) Beta η(αtj 1 αtj)ˆx0, η(1 αtj 1 ˆx0) , which is implemented in the logit space as logit(p(tj 1 tj)) = ln u ln v, u Gamma(η(αtj 1 αtj)ˆx0, 1), v Gamma(η(1 αtj 1 ˆx0), 1) 11: ztj 1 = ztj + (1 ztj)p(tj 1 tj), which is implemented in the logit space as logit(ztj 1) = ln elogit(ztj ) + elogit(p(tj 1 tj )) + elogit(ztj )+logit(p(tj 1 tj )) 12: end for 13: return (ˆx0 Shift)/Scale or (zt0/αt0 Shift)/Scale A Log-beta Divergence and KL Divergence between Beta Distributions By the definition of Bregman divergence, the log-beta divergence corresponding to the log-beta function ln B(a, b) = ln Γ(a) + ln Γ(b) ln Γ(a + b), which is differentiable, and strictly convex on (0, )2 as a function of a and b, can be expressed as Dln B(a,b)((αq, βq), (αp, βp)) = ln B(αq, βq) B(αp, βp) (αq αp, βq βp) αp ln B(αp, βp) βp ln B(αp, βp) = ln B(αq, βq) B(αp, βp) (αq αp, βq βp) ψ(αp) ψ(αp + βp) ψ(βp) ψ(αp + βp) = ln B(αq, βq) B(αp, βp) (αq αp)ψ(αp) (βq βp)ψ(βp) + (αq αp + βq βp)ψ(αp + βp), which is equivalent to the analytic expression of KL(Beta(αp, βp)||Beta(αq, βq)). Proof of Lemma 1. The joint distribution of zt and zs in (5) can be expressed as q(zt, zs | x0) = q(zt | zs)q(zs | x0) zs ; ηαtx0, η(αs αt)x0 Beta(zs; ηαsx0, η(1 αsx0)) Γ(ηαsx0) Γ(ηαtx0)Γ(η(αs αt)x0) ηαtx0 1 1 zt η(αs αt)x0 1 Γ(η) Γ(ηαsx0)Γ(η(1 αsx0))zηαsx0 1 s (1 zs)η(1 αsx0) 1 = Γ(η) Γ(ηαtx0)Γ(η(αs αt)x0)Γ(η(1 αsx0)) zηαtx0 1 t (zs zt)η(αs αt)x0 1(1 zs)η(1 αsx0) 1. (25) The joint distribution of zt and zs in (6) can be expressed as q(zt, zs | x0) = q(zs | zt, x0)q(zt | x0) = 1 1 zt Beta zs zt 1 zt ; η(αs αt)x0, η(1 αsx0) Beta(zt; ηαtx0, η(1 αtx0)) Γ(η(1 αtx0)) Γ(η(1 αsx0))Γ(η(αs αt)x0) η(αs αt)x0 1 1 zs zt η(1 αsx0) 1 Γ(η) Γ(ηαtx0)Γ(η(1 αt)x0)zηαtx0 1 t (1 zt)η(1 αtx0) 1 = Γ(η) Γ(ηαtx0)Γ(η(αs αt)x0)Γ(η(1 αsx0)) (zs zt)η(αs αt)x0 1(1 zs)η(1 αsx0) 1zηαtx0 1 t . (26) The joint distribution shown in (25) is the same as that in (26). Proof of Lemma 2. Since we can re-express q(zt) and q(zs, zt) as q(zt) = Ex0 pdata(x0)[q(zt | x0)], q(zs, zt) = Ex0 pdata(x0)[q(zs, zt | x0)], and q(zs, zt | x0) = q(zs | zt, x0)q(zt | x0), using the convexity of the KL divergence, we have KL(pθ(zs | zt)q(zt)||q(zs, zt)) = KL(Ex0 pdata(x0)[pθ(zs | zt)q(zt | x0)]||Ex0 pdata(x0)[q(zs, zt | x0)]) Ex0 pdata(x0)[KL(pθ(zs | zt)q(zt | x0)||q(zs, zt | x0))] = Ex0 pdata(x0)E(zs,zt) pθ(zs | zt)q(zt | x0) ln pθ(zs | zt)q(zt | x0) q(zs | zt, x0)q(zt | x0) = Ex0 pdata(x0)Ezt q(zt | x0)Ezs pθ(zs | zt) ln pθ(zs | zt) q(zs | zt, x0) = E(zt,x0) q(zt | x0)pdata(x0)[KL(pθ(zs | zt)||q(zs | zt, x0))] = E(zt,x0) q(zt | x0)pdata(x0)[KL(q(zs | zt, ˆx0 = fθ(zt, t)||q(zs | zt, x0))]. Proof of Lemma 4. Since we can re-express pθ(z t) and q(z t) as pθ(z t) := Ezt q(zt)[q(z t | ˆx0 = fθ(zt, t))] = E(zt,x0) q(zt | x0)pdata(x0)[q(z t | ˆx0 = fθ(zt, t))], q(z t) = E(zt,x0) q(zt | x0)pdata(x0)[q(z t | x0)], using the convexity of the KL divergence, we have KL(pθ(z t)||q(z t)) = KL(E(zt,x0) q(zt | x0)pdata(x0)[q(z t | ˆx0 = fθ(zt, t))]||E(zt,x0) q(zt | x0)pdata(x0)[q(z t | x0)]) E(zt,x0) q(zt | x0)pdata(x0)[KL(q(z t | fθ(zt, t))||q(z t | x0))]. Proof of Lemma 6. The proof utilizes the convexity of the KL divergence to show that KL(pθ(zt1:T )||q(zt1:T )) = KL(Ex0 pdata(x0)[pθ(z1:T | x0)]||Ex0 pdata(x0)[q(z1:T | z0)]) Ex0 pdata(x0)KL(pθ(zt1:T | x0)||q(zt1:T | x0)) = Ex0 pdata(x0)Ezt1:T pθ(zt1:T | x0) ln pθ(zt1:T | x0) q(zt1:T | x0) = Ex0 pdata(x0) KL(pprior(zt T | x0)||q(zt T | x0)) j=2 Eztj pθ(ztj | x0)Eztj 1 pθ(ztj 1 | ztj ) ln pθ(ztj 1 | ztj) q(ztj 1 | ztj, x0) We note pθ(ztj | x0) p(x0 | ztj)pθ(ztj), which in general is intractable to sample from, motivating us to replace pθ(ztj | x0) with q(ztj | x0) for tractable computation. Proof of Lemma 7. Utilizing the convexity of the KL divergence, we present an upper-bound of the augmented-swapped KL divergence KL(q(zt1:T )||pθ(zt1:T )), referred to as AS-KLUB, as KL(q(zt1:T )||pθ(zt1:T )) = KL(Ex0 pdata(x0)[q(zt1:T | x0)]||Ex0 pdata(x0)[pθ(zt1:T | x0)]) AS-KLUB = Ex0 pdata(x0)KL(q(zt1:T | x0)||pθ(zt1:T | x0)) = Ex0 pdata(x0)Ezt1:T q(zt1:T | x0) ln q(zt1:T | x0) pθ(zt1:T | x0) = Ex0 pdata(x0) KL(q(zt T | x0)||pprior(zt T | x0)) j=2 Eztj q(ztj | x0)Eztj 1 q(ztj 1 | ztj ,x0) ln q(ztj 1 | ztj, x0) pθ(ztj 1 | ztj) Utilizing the convexity of the negative logarithmic function, we have Ex0 pdata(x0) ln pθ(x0) = Ex0 pdata(x0) ln Epθ(zt1:T )[p(x0 | zt1:T )] = Ex0 pdata(x0) ln Eq(zt1:T | x0) p(x0 | zt1:T )pθ(zt1:T ) q(zt1:T | x0) ELBO = Ex0 pdata(x0)Eq(zt1:T | x0) h ln p(x0 | zt1:T )pθ(zt1:T ) q(zt1:T | x0) i = Ex0 pdata(x0) Eq(zt1 | x0) ln p(x0 | zt1) + KL[q(zt T | x0)||pprior(zt T )] j=2 Eztj q(ztj | x0)Eztj 1 q(ztj 1 | ztj ,x0) ln q(ztj 1 | ztj, x0) pθ(ztj 1 | ztj) Thus, when we disregard its first term, AS-KLUB is equivalent to ELBO without its first two terms. Since these three terms typically do not affect optimization, optimizing the generator fθ with AS-KLUB is equivalent to using ELBO. C Derivation of (Weighted) ELBOs of Gaussian Diffusion from KLUB Let us denote α0 = 1 and z0 = x0. Following the definition in Ho et al. [23] and Song et al. [58], we define a Gaussian diffusion-based generative model as z1 N( α1x0, 1 α1), α1 z1, 1 α2 αt 1 zt 1, 1 αt αt 1 αT 1 z T 1, 1 αT αT 1 where the diffusion scheduling parameters αt in this paper is related to βt in Ho et al. [23] as βt := 1 αt αt 1 . The same as the derivations in Ho et al. [23], we can express the forward marginal distribution at time t as zt N( αtx0, 1 αt). Assuming E[x0] = 0 and var[x0] = 1, the signal-to-noise ratio at time t is defined as " E[zt | x0] std[zt | x0] = αt 1 αt E[x2 0] = αt 1 αt . αt xt N(zt 1, αt 1 αt 1) and zt 1 N( αt 1z0, 1 αt 1), using the conjugacy of the Gaussian distributions with respect to their mean, we can express the conditional posterior as q(zt 1 | x0, zt) 1 1 αt 1 + αt αt 1 αt 1 αt 1 + αt 1αtxt , 1 1 αt 1 + αt αt 1 αt 1 αt (1 αt αt 1 )x0 + 1 αt 1 αt 1 zt, 1 αt 1 1 αt (1 αt αt 1 ) . The forward Gaussian diffusion chain q(z1:T | x0) = t=1 p(zt | zt 1)) can be equivalently sampled in reverse order as q(z1:T | x0) = q(z T | x0) t=2 q(zt 1 | zt, x0). As this reverse chain is non-causal and non-Markovian, we need to approximate it with a Markov chain during inference, which is expressed as p(z1:T ) = pprior(z T ) t=2 pθ(zt 1 | zt). The usual strategy [58, 35, 72] is to utilize the conditional posterior to define pθ(zt 1 | zt) = q(zt 1 | zt, ˆx0 = fθ(zt, t)). In what follows, we redefine the time t as a continuous variable between 0 and 1. We let t1:T be a set of T discrete time points within that interval, with 0 t1 < t2 < . . . < t T 1. The corrupted data observations over these T time points are defined as zt1:T . C.1 Optimization of Gaussian Diffusion via Negative ELBO Viewing pθ(x0 | z1:T ) as the decoder, pθ(z1:T ) as the prior, and q(z1:T | x0) as the inference network, we can optimize the generator parameter via the negative ELBO as Ex0 ln pθ(x0) ELBO = Ex0Eq(zt1:T | x0) h ln p(x0 | zt1:T )p(zt1:T ) q(zt1:T | x0) i = Ex0Eq(zt1 | x0) ln p(x0 | zt1) + Ex0KL[q(zt T | x0)||pprior(zt T )] + PT j=2 Ex0Eq(ztj | x0)[L(tj 1, ztj, x0)], where the first two terms are often ignored and the focus is placed on the remaining T 2 terms, defined as L(s, zt, x0) = KL(q(zs | zt, x0)||q(zs | zt, ˆx0 = fθ(zt, t))) 2 x0 fθ(zt, t) 2 2 αs 1 αs αt 1 αt x0 fθ(zt, t) 2 2, (27) where 0 < s < t < 1. We choose s = max(t 1/T, 0) during training. Since zt = αtx0 + 1 αtϵt, ϵt N(0, 1), it is true that x0 = zt 1 αtϵt αt . Instead of directly predicting x0 given xt, we can equivalently predict ˆϵt = ϵθ(zt, t) given zt and let ˆx0 = fθ(zt, t) = zt 1 αtϵθ(zt, t) αt . Thus (27) can also be written as L(s, zt, x0) = 1 αs 1 αs αt 1 αt x0 fθ(zt, t) 2 2 2(SNRs SNRt) x0 fθ(zt, t) 2 2 αs 1 αs αt 1 αt αt ϵt ϵθ(zt, t) 2 2 SNRt 1 ϵt ϵθ(zt, t) 2 2, (28) which agrees with Equations 13 and 14 in Kingma et al. [35]. C.2 Optimization of Gaussian Diffusion via KLUB Conditional Following the definition of KLUB (Conditional), for Gaussian diffusion we have KLUB(s, zt, x0) = KL(q(zs | ˆx0 = fθ(zt, t), zt)||q(zs | x0, zt)) For two Gaussian distributions q1 and q2 that have the same variance, we have KL(q1||q2) = KL(q2||q1) and hence KLUB(s, zt, x0) = KL(q(zs | ˆx0 = fθ(zt, t), zt)||q(zs | x0, zt)) = KL(q(zs | x0, zt)||q(zs | ˆx0 = fθ(zt, t), zt)) = L(s, zt, x0). Therefore, over the same set of time points t1:T , optimizing via KLUB conditional is identical to optimizing via ELBO. As analyzed before, optimizing via KLUB conditional (i.e., ELBO) may not be able to directly counteract the error accumulated over the course of diffusion, and hence could lead to slow convergence. C.3 Optimization of Gaussian Diffusion via KLUB Marginal and SNR-weighted Negative ELBO Following the definition of KLUB (Marginal), for Gaussian diffusion we have KLUB(zt, x0) = KL(N( αtx0, 1 αt)||N( αtf(zt, t), 1 αt) = αt 1 αt x0 f(zt, t) 2 2 = SNRt x0 f(zt, t) 2 2 = ϵ0 ϵθ(zt, t) 2 2. It is surprising to find out that the KLUB Marginal is identical to the SNR weighted ELBO first introduced in Ho et al. [23] and further discussed in Hang et al. [20]. C.4 Optimization of Gaussian Diffusion via KLUB Following beta diffusion, we can also combine two KLUBs as ωKLUB(s, zt, x0) + (1 ω)KLUB(zt, x0) αt 1 + (1 ω) ϵ0 ϵθ(zt, t) 2 2. Since αs 1 αs 1 αt αt is in general close to 1 when s is not too far from t, when ω is not too close to 1, a combination of these two KLUBs does not result in an algorithm that clearly differs from the use of an SNR-weighted negative ELBO. D Illustration of Forward and Reverse Beta Diffusion Illustration of Forward Beta Diffusion. We first visualize the beta forward diffusion process by displaying a true image x0 and its noise-corrupted versions over the course of the forward diffusion process. Specifically, we display the noise corrupted and masked image at time t = 0, 0.05, 0.1 . . . , 1 as zt = max min 1 Scale zt αt Shift , 1 , 0 , zt Beta(ηαtx0, η(1 αtx0)). It is clear from Figure 1 that with multiplicative beta-distributed noises, the image becomes both noisier and sparser as time increases, and eventually becomes almost completely dark. Thus the forward process of beta diffusion can be considered as simultaneously noising and masking the pixels. This clearly differs beta diffusion from Gaussian diffusion, whose forward diffusion gradually applies additive random noise and eventually ends at a Gaussian random noise. In addition, the reverse diffusion process of Gaussian diffusion can be considered a denoising process, whereas that of beta diffusion is simultaneously demasking and denoising the data, as illustrated below. Illustration of Reverse Beta Diffusion. We further illustrate the reverse process of beta diffusion by displaying ztj 1 = max min 1 Scale αtj 1 Shift , 1 , 0 , where ztj 1 = ztj + (1 ztj)p(tj 1 tj) is iteratively computed as in Algorithm 2. We also display ˆx0 = fθ(ztj, tj) E[x0 | ztj], where the approximation would become more and more accurate when θ approaches its theoretical optimal θ , under which we have fθ (ztj, tj) = E[x0 | ztj]. As shown in Figure 2, starting from a random image drawn from zt J Beta(ηαt J ˆx0, η(1 αt J ˆx0)), ˆx0 = E[x0], most of whose pixels would be completely dark, beta diffusion gradually demasks and denoises the image towards a clean image through multiplicative transforms, as shown in Algorithm 2. Figure 5: Comparison of the statistical distances between the true and generated data distributions over the course of training. The blue, green, and orange curves are for Gauss ELBO, Beta ELBO, and Beta KLUB, respectively. From the left to right are the plots for Wasserstein-1 distance, Jensen Shannon divergence, and Hellinger distance, respectively. Figure 6: Analogy to Figure 5 for comparing KLUB Conditional, KLUB Marginal, and KLUB. E Additional Results for Synthetic Data Quantitative Evaluation Metrics. We consider Wasserstein-1, Jensen Shannon divergence (JSD), and Hellinger distance as three complimentary evaluation metrics. Wasserstein-1 has a high tolerance of misalignment between the supports of the true data density and these of the generated data. In other words, it is not sensitive to both misaligning the modes of the true density and these of the generated density and placing density in the regions where there are no data. By contrast, both JSD and Hellinger can well reflect the misalignments between the supports of high data density regions. Wasserstein-1: We monitor the performance of different algorithms during training by generating 100k data points using the trained model and drawing 100k data points from the data distribution, and computing the Wasserstein-1 distance between their empirical distributions, which can be done by sorting them and taking the mean of their element-wise absolute differences. If the true distribution is discrete in that it takes values uniformly at random from a discrete set D, then we compute the Wasserstein-1 distance by sorting a set of 10k generated samples from small to large and calculating the mean of their absolute element-wise differences with a non-decreasing vector of the same size, which consists of an equal number of copies for each unique value in D. JSD and Hellinger distance: We discretize the 100k generated data into 100 equal-sized bins between 0 and 1 and compute the frequency of the data in each bin, which provides an empirical probability mass function (PMF). We then compute both the JSD and Hellinger distance between the empirical PMF of the generated data and the empirical (true) PMF of the true (discrete) data. Additional Results. The observations in Figure 3 are further confirmed by Figure 5, which shows that while Gaussian diffusion has good performance measured by the Wasserstein-1 distance, it is considerably worse than beta diffusion in aligning the supports between the true and generated data distributions. Within beta diffusion, either KLUB or its argument-swapped version leads to good performance measured by both the JSD and Hellinger distance, suggesting their excellent ability to concentrate their generations around the true data supports. However, the proposed KLUB has a much better recovery of the true underlying density in comparison to its argument-swapped version, as confirmed by examining the first subplot in Figure 5 and comparing the second and third subplots in Figure 3. Figure 7: Analogy to Figure 3 for the mixture of continuous range-bounded data and point masses. Figure 8: Analogy to Figure 5 for the mixture of continuous range-bounded data and point masses. Figure 9: Analogy to Figure 8 for comparing KLUB Conditional, KLUB Marginal, and KLUB. We further conduct an ablation study between KLUB and its two variants: KLUB Conditional and KLUB Marginal, corresponding to ω = 1 and ω = 0, respectively, in the loss given by (18). The results suggest KLUB Conditional and KLUB Marginal have comparable performance, which is not that surprising considering that in theory, they both share the same optimal solution given by (14). More closely examining the plots suggests that KLUB Marginal converges faster than KLUB Conditional in aligning the generation with the data supports, but eventually delivers comparable performance, and combining them leads to the KLUB that has a good overall performance. We note we have conducted additional experiments by varying model parameters and adopting different diffusion schedules. Our observation is that while each algorithm could do well by carefully tuning these parameters, Beta KLUB is the least sensitive to parameter variations and is consistently better than or comparable to the other methods across various combinations of model parameters. While the toy data is valuable for showcasing the unique properties of beta diffusion, the performance of Beta KLUB is not that sensitive to model parameters, and the observed patterns may be disrupted by image-specific settings. Hence, their utility in tuning beta diffusion for image generation, where data dimension and model size/architecture are much larger and more complex, is limited. E.1 Mixture of Continuous Range-bounded Distributions and Point Masses We consider an equal mixture of three range-bounded distributions and two unit point masses, including a Uniform distribution between 0.1 and 0.2, a Beta(1,5) distribution first scaled by 0.1 and then shifted by 0.3, a unit point mass at x0 = 0.5, a Beta(0.5,0.5) distribution first scaled by 0.1 and then shifted by 0.6, and a unit point mass at x0 = 0.8. A random sample is generated as x0 = yk, k Discrete({1, 2, 3, 4, 5}), y1 Unif(0.1, 0.2), y2 1 0.1Beta( y2 0.3 0.1 ; 1, 5), y3 = 0.5, y4 1 0.1Beta( y4 0.6 0.1 ; 0.5, 0.5), y5 = 0.9. (29) We follow the same experimental protocol in Section 4.1. The results are shown in Figures 7 and 8, from which we again observe that Gauss ELBO does a poor job in aligning its generation with the true data supports and places a proportion of its generation to form smooth transitions between the boundaries of high-density regions; Beta ELBO well captures the data supports and density shapes, but has a tendency to overestimate the density of smaller-valued data; whereas Beta KLUB very well aligns its generated data with the true data supports and captures the overall density shapes and the sharp transitions between these five range-bounded density regions, largely avoiding placing generated data into zero-density regions. F Preconditioning of Beta Diffusion for Image Generation It is often a good practice to precondition the input of a neural network. For beta diffusion, most of our computation is operated in the logit space and hence we will consider how to precondition logit(zt) before feeding it as the input to the generator fθ. Specifically, since we have zt Beta(ηαtx0, η(1 αtx0)), we can draw the logit of zt as logit(zt) = ln ut ln vt, ut Gamma(ηαtx0, 1), vt Gamma(η(1 αtx0), 1). Assume x0 Unif[xmin, xmax], where xmin = Shift and xmax = Scale + Shift, we have E[logit(zt)] = E[ln ut] E[ln vt] = Ex0[ψ(ηαtx0)] Ex0[ψ(η(1 αtx0))], Ex0[ψ(ηαtx0)] = 1 ηαt(xmax xmin) ηαtxmin d ln Γ(z) = 1 ηαt(xmax xmin)[ln Γ(ηαtxmax) ln Γ(ηαtxmin)], Ex0[ψ(η(1 αtx0))] = 1 ηαt(xmax xmin) Z η(1 αtxmin) η(1 αtxmax) d ln Γ(z) = 1 ηαt(xmax xmin)[ln Γ(η(1 αtxmin)) ln Γ(η(1 αtxmax))]. We further estimate the variance of logit(zt) as var[logit(zt)] = var[ln ut] + var[ln vt] = Ex0[ψ(1)(ηαtx0)] + Ex0[ψ(1)(η(1 αtx0))] + varx0[ψ(ηαtx0)] + varx0[ψ(η(1 αtx0))] Ex0[ψ(1)(ηαtx0)] = 1 ηαt(xmax xmin)[ψ(ηαtxmax) ψ(ηαtxmin)], Ex0[ψ(1)(η(1 αtx0))] = 1 ηαt(xmax xmin)[ψ(η(1 αtxmin)) ψ(η(1 αtxmax))], varx0[ψ(ηαtx0)] = Ex0[ψ2(ηαtx0)] (Ex0[ψ(ηαtx0)])2 ψ2 ηαt xmin + i 100(xmax xmin) 21(i=0)+1(i=100) (Ex0[ψ(ηαtx0)])2, 0 varx0[ψ(η(1 αtx0))] = Ex0[ψ2(η(1 αtx0))] (Ex0[ψ(η(1 αtx0))])2 ψ2 η 1 αt xmin + i 100(xmax xmin) 21(i=0)+1(i=100) (Ex0[ψ(η(1 αtx0))])2, 0 Now we are ready to precondition logit(zt) as g(zt) = logit(zt) E[logit(zt)] p var[logit(zt)] . We use g(zt) as the input to the generator fθ and add a skip connection layer to add g(zt) into the output of the generator. Specifically, following the notation of EDM, we define the network as fθ(zt, t) = cskip(t)g(zt) + cout(t)Fθ(cin(t)g(zt), cnoise(t)), where for simplicity, we set cskip(t) = cout(t) = cin(t) = 1 and cnoise(t) = logit(αt)/8. A more sophisticated selection of these parameters has the potential to enhance the performance of beta diffusion. However, due to our current limitations in computing resources, we defer this investigation to future studies. G Parameter Settings of Beta Diffusion on CIFAR10 We present the intuition on how we set the model parameters, including Shift, Scale, η, ω, π, c0, and c1. Given the limitation of our computation resources, we leave the careful tuning of these model parameters to future work. We set ω = 0.99 and π = 0.95 across all experiments conducted on CIFAR10 with beta diffusion. We pre-process the range-bounded data by linearly scaling and shifting them to lie between [Shift, Shift + Scale], where 0 < Shift < Shift + Scale < 1. This linear transformation serves two purposes: firstly, it enables us to use beta diffusion to model any range-bounded data, and secondly, it helps avoid numerical challenges in computing the log-beta divergence when its arguments are small. When evaluating the performance, we linearly transform the generated data back to their original scale by reversing the scaling and shifting operations applied during the pre-processing step. For CIFAR10 images, as beta diffusion is diffusing pixels towards Beta(0, η), our intuition is to set both Scale and Shift large enough to differentiate the diffusion trajectories of different pixel values. However, Scale + Shift needs to be smaller than 1, motivating us to choose Scale = 0.39 and Shift = 0.60. For the diffusion concentration parameter η, our intuition is that a larger η provides a higher ability to differentiate different pixel values and allows a finer discretization of the reverse diffusion process, but leads to slower training and demands more discretized steps during sampling. We set η = 10000 and the mini-batch size as B = 288 by default, but also perform an ablation study on CIFAR10 with several different values, as shown in Table 3. We use a sigmoid-based diffusion schedule as αt = 1/(1 + e c0 (c1 c0)t), where we set c0 = 10 and c1 = 13 by default. We note a similar schedule had been introduced in Kingma et al. [35] and Jabri et al. [29] for Gaussian diffusion. For the CIFAR-10 dataset2, we utilize the parameterization of EDM3 [34] as the code base. We replace the variance preserving (VP) loss and the corresponding network parameterization implemented in EDM, which is the weighted negative ELBO of Gaussian diffusion, with the KLUB loss of beta diffusion that is given by (24). We keep the generator network the same as that of VP-EDM, except that we set cnoise = logit(αt)/8 and simplify the other parameters as cskip = 1, cout = 1, and cin = 1. The image pixel values between 0 and 255 are divided by 255 and then scaled by Scale and shifted by Shift, and hence we have x0 [Shift, Shift + Scale]. The two inputs to the generator network fθ( , ), originally designed for VP-EDM and adopted directly for beta diffusion, are g(logit(zt)) and logit(αt), where g( ) is a preconditioning function described in detail in Appendix F. The output of the generator network is first transformed by a sigmoid function and then scaled by Scale and shifted by Shift, and hence ˆx0 = fθ(zt, t) [Shift, Shift + Scale]. 2https://www.cs.toronto.edu/~kriz/cifar.html 3https://github.com/NVlabs/edm