# starshaped_denoising_diffusion_probabilistic_models__6fe07d36.pdf Star-Shaped Denoising Diffusion Probabilistic Models Andrey Okhotin HSE University, MSU University Moscow, Russia andrey.okhotin@gmail.com Dmitry Molchanov BAYESG Budva, Montenegro dmolch111@gmail.com Vladimir Arkhipkin Sber AI Moscow, Russia arkhipkin.v98@gmail.com Grigory Bartosh AMLab, Informatics Institute University of Amsterdam Amsterdam, Netherlands g.bartosh@uva.nl Viktor Ohanesian Independent Researcher v.v.oganesyan@gmail.com Aibek Alanov AIRI, HSE University Moscow, Russia alanov.aibek@gmail.com Dmitry Vetrov Constructor University Bremen, Germany dvetrov@constructor.university Denoising Diffusion Probabilistic Models (DDPMs) provide the foundation for the recent breakthroughs in generative modeling. Their Markovian structure makes it difficult to define DDPMs with distributions other than Gaussian or discrete. In this paper, we introduce Star-Shaped DDPM (SS-DDPM). Its star-shaped diffusion process allows us to bypass the need to define the transition probabilities or compute posteriors. We establish duality between star-shaped and specific Markovian diffusions for the exponential family of distributions and derive efficient algorithms for training and sampling from SS-DDPMs. In the case of Gaussian distributions, SS-DDPM is equivalent to DDPM. However, SS-DDPMs provide a simple recipe for designing diffusion models with distributions such as Beta, von Mises Fisher, Dirichlet, Wishart and others, which can be especially useful when data lies on a constrained manifold. We evaluate the model in different settings and find it competitive even on image data, where Beta SS-DDPM achieves results comparable to a Gaussian DDPM. Our implementation is available at https://github.com/andrey-okhotin/star-shaped. 1 Introduction Deep generative models have shown outstanding sample quality in a wide variety of modalities. Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Karras et al., 2021), autoregressive models (Ramesh et al., 2021), Variational Autoencoders (Kingma & Welling, 2013; Rezende et al., 2014), Normalizing Flows (Grathwohl et al., 2018; Chen et al., 2019) and energy-based models (Xiao et al., 2020) show impressive abilities to synthesize objects. However, GANs are not robust to the choice of architecture and optimization method (Arjovsky et al., 2017; Gulrajani et al., 2017; Karras et al., 2019; Brock et al., 2018), and they often fail to cover modes in data distribution (Zhao et al., 2018; Thanh-Tung & Tran, 2020). Likelihood-based models avoid mode collapse but may overestimate the probability in low-density regions (Zhang et al., 2021). Equal contribution 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Figure 1: Markovian forward processes of DDPM (left) and the star-shaped forward process of SS-DDPM (right). Recently, diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020) have received a lot of attention. These models generate samples using a trained Markov process that starts with white noise and iteratively removes noise from the sample. Recent works have shown that diffusion models can generate samples comparable in quality or even better than GANs (Song et al., 2020b; Dhariwal & Nichol, 2021), while they do not suffer from mode collapse by design, and also they have a log-likelihood comparable to autoregressive models (Kingma et al., 2021). Moreover, diffusion models show these results in various modalities such as images (Saharia et al., 2021), sound (Popov et al., 2021; Liu et al., 2022) and shapes (Luo & Hu, 2021; Zhou et al., 2021). The main principle of diffusion models is to destroy information during the forward process and then restore it during the reverse process. In conventional diffusion models like denoising diffusion probabilistic models (DDPM) destruction of information occurs through the injection of Gaussian noise, which is reasonable for some types of data, such as images. However, for data distributed on manifolds, bounded volumes, or with other features, the injection of Gaussian noise can be unnatural, breaking the data structure. Unfortunately, it is not clear how to replace the noise distribution within traditional diffusion models. The problem is that we have to maintain a connection between the distributions defining the Markov noising process that gradually destroys information and its marginal distributions. While some papers explore other distributions, such as delta functions (Bansal et al., 2022) or Gamma distribution (Nachmani et al., 2021), they provide ad hoc solutions for special cases that are not easily generalized. In this paper, we present Star-Shaped Denoising Diffusion Probabilistic Models (SS-DDPM), a new approach that generalizes Gaussian DDPM to an exponential family of noise distributions. In SS-DDPM, one only needs to define marginal distributions at each diffusion step (see Figure 1). We provide a derivation of SS-DDPM, design efficient sampling and training algorithms, and show its equivalence to DDPM (Ho et al., 2020) in the case of Gaussian noise. Then, we outline a number of practical considerations that aid in training and applying SS-DDPMs. In Section 5, we demonstrate the ability of SS-DDPM to work with distributions like von Mises Fisher, Dirichlet and Wishart. Finally, we evaluate SS-DDPM on image and text generation. Categorical SS-DDPM matches the performance of Multinomial Text Diffusion (Hoogeboom et al., 2021) on the text8 dataset, while our Beta diffusion model achieves results, comparable to a Gaussian DDPM on CIFAR-10. We start with a brief introduction of DDPMs. The Gaussian DDPM (Ho et al., 2020) is defined as a forward (diffusion) process q DDPM(x0:T ) and a corresponding reverse (denoising) process p DDPM θ (x0:T ). The forward process is defined as a Markov chain with Gaussian conditionals: q DDPM(x0:T ) = q(x0) t=1 q DDPM(xt|xt 1), (1) q DDPM(xt|xt 1) = N xt; p 1 βtxt 1, βt I , (2) where q(x0) is the data distribution. Parameters βt are typically chosen in advance and fixed, defining the noise schedule of the diffusion process. The noise schedule is chosen in such a way that the final x T no longer depends on x0 and follows a standard Gaussian distribution q DDPM(x T ) = N (x T ; 0, I). (a) Denoising Diffusion Probabilistic Models (b) Star-Shaped Denoising Diffusion Probabilistic Models Figure 2: Model structure of DDPM and SS-DDPM. The reverse process p DDPM θ (x0:T ) follows a similar structure and constitutes the generative part of the model: p DDPM θ (x0:T ) = q DDPM(x T ) t=1 p DDPM θ (xt 1|xt), (3) p DDPM θ (xt 1|xt) = N (xt 1; µθ(xt, t), Σθ(xt, t)) . (4) The forward process q DDPM(x0:T ) of DDPM is typically fixed, and all the parameters of the model are contained in the generative part of the model p DDPM θ (x0:T ). These parameters are tuned to maximize the variational lower bound (VLB) on the likelihood of the training data: L DDPM(θ) = Eq DDPM log p DDPM θ (x0|x1) t=2 DKL (q DDPM(xt 1|xt, x0) p DDPM θ (xt 1|xt)) (5) L DDPM(θ) max θ (6) One of the main challenges in defining DDPMs is the computation of the posterior q DDPM(xt 1|xt, x0). Specifically, the transition probabilities q DDPM(xt|xt 1) have to be defined in such a way that this posterior is tractable. Specific DDPM-like models are available for Gaussian (Ho et al., 2020), Categorical (Hoogeboom et al., 2021) and Gamma (Kawar et al., 2022) distributions. Defining such models remains challenging in more general cases. 2.2 Star-Shaped DDPMs As previously discussed, extending the DDPMs to other distributions poses significant challenges. In light of these difficulties, we propose to construct a model that only relies on marginal distributions q(xt|x0) in its definition and the derivation of the loss function. We define star-shaped diffusion as a non-Markovian forward process q SS(x0:T ) that has the following structure: q SS(x0:T ) = q(x0) t=1 q SS(xt|x0), (7) where q(x0) is the data distribution. We note that in contrast to DDPM all noisy variables xt are conditionally independent given x0 instead of constituting a Markov chain. This structure of the forward process allows us to utilize other noise distributions, which we discuss in more detail later. 2.3 Defining the reverse model In DDPMs the true reverse model q DDPM(x0:T ) has a Markovian structure (Ho et al., 2020), allowing for an efficient sequential generation algorithm: q DDPM(x0:T ) = q DDPM(x T ) t=1 q DDPM(xt 1|xt). (8) For the star-shaped diffusion, however, the Markovian assumption breaks: q SS(x0:T ) = q SS(x T ) t=1 q SS(xt 1|xt:T ). (9) Consequently, we now need to approximate the true reverse process by a parametric model which is conditioned on the whole tail xt:T . p SS θ (x0:T ) = p SS θ (x T ) t=1 p SS θ (xt 1|xt:T ). (10) Figure 3: Markov reverse process fails to recover realistic images from star-shaped diffusion, while a general reverse process produces realistic images. The top row is equivalent to DDIM at σ2 t = 1 αt 1. A similar effect was also observed by Bansal et al. (2022). It is crucial to use the whole tail xt:T rather than just one variable xt when predicting xt 1 in a star-shaped model. As we show in Appendix B, if we try to approximate the true reverse process with a Markov model, we introduce a substantial irreducible gap into the variational lower bound. Such a sampling procedure fails to generate realistic samples, as can be seen in Figure 3. Intuitively, in DDPMs the information about x0 that is contained in xt+1 is nested into the information about x0 that is contained in xt. That is why knowing xt allows us to discard xt+1. In star-shaped diffusion, however, all variables contain independent pieces of information about x0 and should all be taken into account when making predictions. We can write down the variational lower bound as follows: L SS(θ) = Eq SS log pθ(x0|x1:T ) t=2 DKL (q SS(xt 1|x0) p SS θ (xt 1|xt:T ) With this VLB, we only need the marginal distributions q(xt 1|x0) to define and train the model, which allows us to use a wider variety of noising distributions. Since conditioning the predictive model pθ(xt 1|xt:T ) on the whole tail xt:T is typically impractical, we propose a more efficient way to implement the reverse process next. 2.4 Efficient tail conditioning Instead of using the full tail xt:T , we would like to define some statistic Gt = Gt(xt:T ) that would extract all information about x0 from the tail xt:T . Formally speaking, we call Gt a sufficient tail statistic if the following equality holds: q SS(xt 1|xt:T ) = q SS(xt 1|Gt). (12) One way to define Gt is to concatenate all the variables xt:T into a single vector. This, however, is impractical, as its dimension would grow with the size of the tail T t + 1. The Pitman Koopman Darmois (Pitman, 1936) theorem (PKD) states that exponential families admit a sufficient statistic with constant dimensionality. It also states that no other distribution admits one: if such a statistic were to exist, the distribution has to be a member of the exponential family. Inspired by the PKD, we turn to the exponential family of distributions. In the case of star-shaped diffusion, we cannot apply the PKD directly, as it was formulated for i.i.d. samples and our samples are not identically distributed. However, we can still define a sufficient tail statistic Gt for a specific subset of the exponential family, which we call an exponential family with linear parameterization: Theorem 1. Assume the forward process of a star-shaped model takes the following form: q SS(xt|x0) = ht(xt) exp ηt(x0)TT (xt) Ωt(x0) , (13) ηt(x0) = Atf(x0) + bt. (14) Let Gt be a tail statistic, defined as follows: Gt = Gt(xt:T ) = s=t AT s T (xs). (15) Then, Gt is a sufficient tail statistic: q SS(xt 1|xt:T ) = q SS(xt 1|Gt). (16) Here definition (13) is the standard definition of the exponential family, where ht(xt) is the base measure, ηt(x0) is the vector of natural parameters with corresponding sufficient statistics T (xt), and Ωt(x0) is the log-partition function. The key assumption added is the linear parameterization of the natural parameters (14). We provide the proof in Appendix C. When At is scalar, we denote it as at instead. For the most part, the premise of Theorem 1 restricts the parameterization of the distributions rather than the family of the distributions involved. As we discuss in Appendix F, we found it easy to come up with linear parameterization for a wide range of distributions in the exponential family. For example, we can obtain a linear parameterization for the Beta distribution q(xt|x0) = Beta(xt; αt, βt) using x0 as the mode of the distribution and introducing a new concentration parameter νt: αt = 1 + νtx0, (17) βt = 1 + νt(1 x0). (18) In this case, ηt(x0) = νtx0, T (xt) = log xt 1 xt , and we can use equation (15) to define the sufficient tail statistic Gt. We provide more examples in Appendix F. We also provide an implementation-ready reference sheet for a wide range of distributions in the exponential family in Table 6. We suspect that, just like in PKD, this trick is only possible for a subset of the exponential family. In the general case, the dimensionality of the sufficient tail statistic Gt would have to grow with the size of the tail xt:T . It is still possible to apply SS-DDPM in this case, however, crafting the (now only approximately) sufficient statistic Gt would require more careful consideration and we leave it for future work. 2.5 Final model definition To maximize the VLB (11), each step of the reverse process should approximate the true reverse distribution: p SS θ (xt 1|xt:T ) q SS(xt 1|xt:T ) = Z q SS(xt 1|x0)q SS(x0|xt:T )dx0. (19) Similarly to DDPM (Ho et al., 2020), we choose to approximate q SS(x0|xt:T ) with a delta function centered at the prediction of some model xθ(Gt(xt:T ), t). This results in the following definition of the reverse process of SS-DDPM: p SS θ (xt 1|xt:T ) = q SS(xt 1|x0)|x0=xθ(Gt(xt:T ),t) . (20) The distribution p SS θ (x0|x1:T ) can be fixed to some small-variance distribution p SS θ (x0|ˆx0) centered at the final prediction ˆx0 = xθ(G1(x1:T ), 1), similar to the dequantization term, commonly used in DDPM. If this distribution has no trainable parameters, the corresponding term can be removed from the training objective. This dequantization distribution would then only be used for log-likelihood estimation and, optionally, for sampling. Together with the forward process (7) and the VLB objective (11), this concludes the general definition of the SS-DDPM model. The model structure is illustrated in Figure 2. The corresponding training and sampling algorithms are provided in Algorithms 1 and 2. The resulting model is similar to DDPM in spirit. We follow the same principles when designing the forward process: starting from a low-variance distribution, centered at x0 at t = 1, we gradually increase the entropy of the distribution q SS(xt|x0) until there is no information shared between x0 and xt at t = T. We provide concrete definitions for Beta, Gamma, Dirichlet, von Mises, von Mises Fisher, Wishart, Gaussian and Categorical distributions in Appendix F. Algorithm 1 SS-DDPM training x0 q(x0) t Uniform(1, . . ., T) xt:T q SS(xt:T |x0) Gt = PT s=t AT s T (xs) Move along θKL(q SS(xt 1|x0) p SS θ (xt 1|Gt)) until Convergence Algorithm 2 SS-DDPM sampling x T q SS(x T ) GT = AT T T (x T ) for t = T to 2 do x0 = xθ(Gt, t) xt 1 q SS(xt 1|x0)|x0= x0 Gt 1 = Gt + AT t 1T (xt 1) end for x0 p SS θ (x0|G1) 2.6 Duality between star-shaped and Markovian diffusion While the variables x1:T follow a star-shaped diffusion process, the corresponding tail statistics G1:T form a Markov chain: s=t AT s T (xs) = Gt+1 + AT t T (xt), (21) since xt is conditionally independent from Gt+2:T given Gt+1 (see Appendix E for details). Moreover, we can rewrite the probabilistic model in terms of Gt and see that variables (x0, G1:T ) form a (not necessarily Gaussian) DDPM. In the case of Gaussian distributions, this duality makes SS-DDPM and DDPM equivalent. This equivalence can be shown explicitly: Theorem 2. Let αDDPM t define the noising schedule for a DDPM model (1 2) via βt = (αDDPM t 1 αDDPM t )/αDDPM t 1 . Let q SS(x0:T ) be a Gaussian SS-DDPM forward process with the following noising schedule and sufficient tail statistic: q SS(xt|x0) = N xt; p αSS t x0, 1 α SS t , (22) Gt(xt:T ) = 1 αDDPM t p αSS s xs 1 αSS s , where (23) αSS t 1 αSS t = αDDPM t 1 αDDPM t αDDPM t+1 1 αDDPM t+1 . (24) Then the tail statistic Gt follows a Gaussian DDPM noising process q DDPM(x0:T )|x1:T =G1:T defined by the schedule αDDPM t . Moreover, the corresponding reverse processes and VLB objectives are also equivalent. We show this equivalence in Appendix D. We make use of this connection when choosing the noising schedule for other distributions. This equivalence means that SS-DDPM is a direct generalization of Gaussian DDPM. While admitting the Gaussian case, SS-DDPM can also be used to implicitly define a non-Gaussian DDPM in the space of sufficient tail statistics for a wide range of distributions. 3 Practical considerations While the model is properly defined, there are several practical considerations that are important for the efficiency of star-shaped diffusion. Choosing the right schedule It is important to choose the right noising schedule for a SS-DDPM model. It significantly depends on the number of diffusion steps T and behaves differently given different noising schedules, typical to DDPMs. This is illustrated in Figure 4, where we show the noising schedules for Gaussian SS-DDPMs that are equivalent to DDPMs with the same cosine schedule. 0.0 0.2 0.4 0.6 0.8 1.0 t / T ss t , T = 50 ss t , T = 100 ss t , T = 250 ss t , T = 1000 ss t , T = 4000 Figure 4: The noising schedule αSS t for Gaussian star-shaped diffusion, defined for different numbers of steps T using eq. (24). All the corresponding equivalent DDPMs have the same cosine schedule αDDPM t . Figure 5: Top: samples xt from a Gaussian DDPM forward process with a cosine noise schedule. Bottom: samples Gt from a Beta SS-DDPM forward process with a noise schedule obtained by matching the mutual information. Middle: corresponding samples xt from that Beta SS-DDPM forward process. The tail statistics have the same level of noise as x DDPM t , while the samples x Beta SS t are diffused much faster. Since the variables Gt follow a DDPM-like process, we would like to somehow reuse those DDPM noising schedules that are already known to work well. For Gaussian distributions, we can transform a DDPM noising schedule into the corresponding SS-DDPM noising schedule analytically by equating I(x0; Gt) = I(x0; x DDPM t ). In general case, we look for schedules that have approximately the same level of mutual information I(x0; Gt) as the corresponding mutual information I(x0; x DDPM t ) for a DDPM model for all timesteps t. We estimate the mutual information using Kraskov (Kraskov et al., 2004) and DSIVI (Molchanov et al., 2019) estimators and build a look-up table to match the noising schedules. This procedure is described in more detail in Appendix G. The resulting schedule for the Beta SS-DDPM is illustrated in Figure 5. Note how with the right schedule appropriately normalized tail statistics Gt look and function similarly to the samples xt from the corresponding Gaussian DDPM. We further discuss this in Appendix H. Implementing the sampler During sampling, we can grow the tail statistic Gt without any overhead, as described in Algorithm 2. However, during training, we need to sample the tail statistic for each object to estimate the loss function. For this we need to sample the full tail xt:T from the forward process q SS(xt:T |x0), and then compute the tail statistic Gt. In practice, this does not add a noticeable overhead and can be computed in parallel to the training process if needed. Reducing the number of steps We can sample from DDPMs more efficiently by skipping some timestamps. This wouldn t work for SS-DDPM, because changing the number of steps would require changing the noising schedule and, consequently, retraining the model. However, we can still use a similar trick to reduce the number of function evaluations. Instead of skipping some timestamps xt1+1:t2 1, we can draw them from the forward process using the current prediction xθ(Gt2, t2), and then use these samples to obtain the tail statistic Gt1. For Gaussian SS-DDPM this is equivalent to skipping these timestamps in the corresponding DDPM. In general case, it amounts to approximating the reverse process with a different reverse process: p SS θ (xt1:t2|Gt2) = t=t1 q SS(xt|x0)|x0=xθ(Gt,t) t=t1 q SS(xt|x0)|x0=xθ(Gt2,t2) . (25) We observe a similar dependence on the number of function evaluations for SS-DDPMs and DDPMs. Time-dependent tail normalization As defined in Theorem 1, the tail statistics can have vastly different scales for different timestamps. The values of coefficients at can range from thousandths when t approaches T to thousands when t approaches zero. To make the tail statistics suitable for use in neural networks, proper normalization is crucial. In most cases, we collect the time-dependent means and variances of the tail statistics across the training dataset and normalize the tail statistics to zero mean and unit variance. We further discuss this issue in Appendix H. Architectural choices To make training the model easier, we make some minor adjustments to the neural network architecture and the loss function. Our neural networks xθ(Gt, t) take the tail statistic Gt as an input and are expected to produce an estimate of x0 as an output. In SS-DDPM the data x0 might lie on some manifold, like the unit sphere or the space of positive definite matrices. Therefore, we need to map the neural network output to that manifold. We do that on a case-by-case basis, as described in Appendices I L. Different terms of the VLB can have drastically different scales. For this reason, it is common practice to train DDPMs with a modified loss function like Lsimple rather than the VLB to improve the stability of training (Ho et al., 2020). Similarly, we can optimize a reweighted variational lower bound when training SS-DDPMs. 4 Related works Our work builds upon Denoising Diffusion Probabilistic Models (Ho et al., 2020). Interest in diffusion models has increased recently due to their impressive results in image (Ho et al., 2020; Song et al., 2020b; Dhariwal & Nichol, 2021) and audio (Popov et al., 2021; Liu et al., 2022) generation. SS-DDPM is most closely related to DDPM. Like DDPM, we only rely on variational inference when defining and working with our model. SS-DDPM can be seen as a direct generalization of DDPM and essentially is a recipe for defining DDPMs with non-Gaussian distributions. The underlying non-Gaussian DDPM is constructed implicitly and can be seen as dual to the star-shaped formulation that we use throughout the paper. Other ways to construct non-Gaussian DDPMs include Binomial diffusion (Sohl-Dickstein et al., 2015), Multinomial diffusion (Hoogeboom et al., 2021) and Gamma diffusion (Nachmani et al., 2021). Each of these works presents a separate derivation of the resulting objective, and extending them to other distributions is not straightforward. On the other hand, SS-DDPM provides a single recipe for a wide range of distributions. DDPMs have several important extensions. Song et al. (2020a) provide a family of non-Markovian diffusions that all result in the same training objective as DDPM. One of them, denoted Denoising Diffusion Implicit Model (DDIM), results in an efficient deterministic sampling algorithm that requires a much lower number of function evaluations than conventional stochastic sampling. Their derivations also admit a star-shaped forward process (at σ2 t = 1 αt 1), however, the model is not studied in the star-shaped regime. Their reverse process remains Markovian, which we show to not be sufficient to invert a star-shaped forward process. Denoising Diffusion Restoration Models (Kawar et al., 2022) provide a way to solve general linear inverse problems using a trained DDPM. They can be used for image restoration, inpainting, colorization and other conditional generation tasks. Both DDIMs and DDRMs rely on the explicit form of the underlying DDPM model and are derived for Gaussian diffusion. Extending these models to SS-DDPMs is a promising direction for future work. Song et al. (2020b) established the connection between DDPMs and models based on score matching. This connection gives rise to continuous-time variants of the models, deterministic solutions and more precise density estimation using ODEs. We suspect that a similar connection might hold for SS-DDPMs as well, and it can be investigated further in future works. Other works that present diffusion-like models with other types of noise or applied to manifold data, generally stem from score matching rather than variational inference. Flow Matching (Lipman et al., 2022) is an alternative probabilistic framework that works with any differentiable degradation process. De Bortoli et al. (2022) and Huang et al. (2022) extended score matching to Riemannian manifolds, and Chen & Lipman (2023) proposed Riemannian Flow Matching. Bansal et al. (2022) proposed Cold Diffusion, a non-probabilistic approach to reversing general degradation processes. To the best of our knowledge, for the first time, an approach for constructing diffusion without a consecutive process was proposed by Rissanen et al. (2022) (IHDM) and further expanded on by Daras et al. (2022) and Hoogeboom & Salimans (2022). IHDM uses a similar star-shaped structure that results in a similar variational lower bound. Adding a deterministic process based on the heat equation allows the authors to keep the reverse process Markovian without having to introduce the tail statistics. As IHDM heavily relies on blurring rather than adding noise, the resulting diffusion dynamics become very different. Conceptually our work is much closer to DDPM than IHDM. 5 Experiments Table 1: KL divergence between the real data distribution and the model distribution DKL (q(x0) pθ(x0)) for Gaussian DDPM and SS-DDPM. Dirichlet Wishart DDPM 0.200 0.096 SS-DDPM 0.011 0.037 We evaluate SS-DDPM with different families of noising distributions. The experiment setup, training details and hyperparameters are listed in Appendices I L. Synthetic data We consider two examples of starshaped diffusion processes with Dirichlet and Wishart noise to generate data from the probabilistic simplex and from the manifold of p.d. matrices respectively. We compare them to DDPM, where the predictive network xθ(xt, t) is parameterized to always satisfy the manifold constraints. As seen in Table 1, using the appropriate distribution results in a better approximation. The data and modeled distributions are illustrated in Table 3. This shows the ability of SS-DDPM to work with different distributions and generate data from exotic domains. Geodesic data We apply SS-DDPM to a geodesic dataset of fires on the Earth s surface (EOSDIS, 2020) using a three-dimensional von Mises Fisher distribution. The resulting samples and the source data are illustrated in Table 3. We find that SS-DDPM is not too sensitive to the distribution family and can fit data in different domains. Table 2: Comparison of Categorical SSDDPM and Multinomial Text Diffusion on text8. NLL is estimated via ELBO. Model NLL (bits/char) MTD 1.72 SS-DDPM 1.69 Discrete data Categorical SS-DDPM is similar to Multinomial Text Diffusion (MTD, (Hoogeboom et al., 2021)). However, unlike in the Gaussian case, these models are not strictly equivalent. We follow a similar setup to MTD and apply Categorical SS-DDPM to unconditional text generation on the text8 dataset (Mahoney, 2011). As shown in Table 2, SS-DDPM achieves similar results to MTD, allowing to use different distributions in a unified manner. While D3PM (Austin et al., 2021) provides some improvements, we follow a simpler setup from MTD. We expect the improvements from D3PM to directly apply to Categorical SS-DDPM. sampling steps DDPM DDIM Beta SS-DDPM Figure 6: Quality of images, generated using Beta SS-DDPM, DDPM and DDIM with different numbers of sampling steps. Models are trained and evaluated on CIFAR-10. DDPM and DDIM results were taken from Nichol & Dhariwal (2021). Image data Finally, we evaluate SS-DDPM on CIFAR-10. Since the training data is constrained to a [0, 1] segment, we use Beta distributions. We evaluate our model with various numbers of generation steps, as described in equation (25), and report the resulting Fréchet Inception Distance (FID, (Heusel et al., 2017)) in Figure 6. Beta SS-DDPM achieves comparable quality with the Improved DDPM (Nichol & Dhariwal, 2021) and is slightly better on lower numbers of steps. As expected, DDIM performs better when the number of diffusion steps is low, but both SS-DDPM and DDPM outperform DDIM on longer runs. The best FID score achieved by Beta SS-DDPM is 3.17. Although the FID curves for DDPM and DDIM do not achieve this score in Figure 6, Ho et al. (2020) reported an FID score of 3.17 for 1000 DDPM steps, meaning that SS-DDPM performs similarly to DDPM in this setting. Table 3: Experiments results. The first row is real data and the second is generated samples. For the von Mises Fisher and Dirichlet models, we show two-dimensional histograms of samples. For the Wishart model, we draw ellipses, corresponding to the p.d. matrices x0 and xθ. The darker the pixel, the more ellipses pass through that pixel. von Mises Fisher Dirichlet Wishart 6 Conclusion We propose an alternative view on diffusion-like probabilistic models. We reveal the duality between star-shaped and Markov diffusion processes that allows us to go beyond Gaussian noise by switching to a star-shaped formulation. It allows us to define diffusion-like models with arbitrary noising distributions and to establish diffusion processes on specific manifolds. We propose an efficient way to construct a reverse process for such models in the case when the noising process lies in a general subset of the exponential family and show that star-shaped diffusion models can be trained on a variety of domains with different noising distributions. On image data, star-shaped diffusion with Beta distributed noise attains comparable performance to Gaussian DDPM, challenging the optimality of Gaussian noise in this setting. The star-shaped formulation opens new applications of diffusion-like probabilistic models, especially for data from exotic domains where domain-specific non-Gaussian diffusion is more appropriate. Acknowledgements We are grateful to Sergey Kholkin for additional experiments with von Mises Fisher SS-DDPM on geodesic data and to Tingir Badmaev for helpful recommendations in experiments on image data. We d also like to thank Viacheslav Meshchaninov for sharing his knowledge of diffusion models for text data. This research was supported in part through computational resources of HPC facilities at HSE University. The results on image data (Image data in Section 5; Section L) were obtained by Andrey Okhotin and Aibek Alanov with the support of the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. arxiv 2017. ar Xiv preprint ar Xiv:1701.07875, 30:4, 2017. Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. Bansal, A., Borgnia, E., Chu, H.-M., Li, J. S., Kazemi, H., Huang, F., Goldblum, M., Geiping, J., and Goldstein, T. Cold diffusion: Inverting arbitrary image transforms without noise. ar Xiv preprint ar Xiv:2208.09392, 2022. Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018. Chen, R. T. and Lipman, Y. Riemannian flow matching on general geometries. ar Xiv preprint ar Xiv:2302.03660, 2023. Chen, R. T., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 32, 2019. Daras, G., Delbracio, M., Talebi, H., Dimakis, A. G., and Milanfar, P. Soft diffusion: Score matching for general corruptions. ar Xiv preprint ar Xiv:2209.05442, 2022. De Bortoli, V., Mathieu, E., Hutchinson, M., Thornton, J., Teh, Y. W., and Doucet, A. Riemannian score-based generative modeling. ar Xiv preprint ar Xiv:2202.02763, 2022. Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. ar Xiv preprint ar Xiv:2105.05233, 2021. EOSDIS. Land, atmosphere near real-time capability for eos (lance) system operated by nasa s earth science data and information system (esdis). https://earthdata.nasa.gov/earth-observation-data/nearreal-time/firms/active-fire-data, 2020. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. corr abs/1512.03385 (2015), 2015. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2006.11239, 2020. Hoogeboom, E. and Salimans, T. Blurring diffusion models. ar Xiv preprint ar Xiv:2209.05557, 2022. Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P., and Welling, M. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454 12465, 2021. Huang, C.-W., Aghajohari, M., Bose, J., Panangaden, P., and Courville, A. C. Riemannian diffusion models. Advances in Neural Information Processing Systems, 35:2750 2761, 2022. Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401 4410, 2019. Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., and Aila, T. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34, 2021. Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. ar Xiv preprint ar Xiv:2107.00630, 2, 2021. Kraskov, A., Stögbauer, H., and Grassberger, P. Estimating mutual information. Physical review E, 69(6):066138, 2004. Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022. Liu, S., Su, D., and Yu, D. Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans. ar Xiv preprint ar Xiv:2201.11972, 2022. Luo, S. and Hu, W. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837 2845, 2021. Mahoney, M. Large text compression benchmark, 2011. URL http://www.mattmahoney.net/ dc/text.html. Molchanov, D., Kharitonov, V., Sobolev, A., and Vetrov, D. Doubly semi-implicit variational inference. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2593 2602. PMLR, 2019. Nachmani, E., Roman, R. S., and Wolf, L. Denoising diffusion gamma models. ar Xiv preprint ar Xiv:2110.05948, 2021. Nichol, A. and Dhariwal, P. Improved denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2102.09672, 2021. Pitman, E. J. G. Sufficient statistics and intrinsic accuracy. Mathematical Proceedings of the Cambridge Philosophical Society, 32(4):567 579, 1936. doi: 10.1017/S0305004100019307. Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., and Kudinov, M. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pp. 8599 8608. PMLR, 2021. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zeroshot text-to-image generation. In International Conference on Machine Learning, pp. 8821 8831. PMLR, 2021. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pp. 1278 1286. PMLR, 2014. Rissanen, S., Heinonen, M., and Solin, A. Generative modelling with inverse heat dissipation. ar Xiv preprint ar Xiv:2206.13397, 2022. Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative refinement. ar Xiv preprint ar Xiv:2104.07636, 2021. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015. Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020a. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020b. Thanh-Tung, H. and Tran, T. Catastrophic forgetting and mode collapse in gans. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1 10. IEEE, 2020. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017. Xiao, Z., Kreis, K., Kautz, J., and Vahdat, A. Vaebm: A symbiosis between variational autoencoders and energy-based models. ar Xiv preprint ar Xiv:2010.00654, 2020. Yin, M. and Zhou, M. Semi-implicit variational inference. In International Conference on Machine Learning, pp. 5660 5669. PMLR, 2018. Zhang, L., Goldstein, M., and Ranganath, R. Understanding failures in out-of-distribution detection with deep generative models. In International Conference on Machine Learning, pp. 12427 12436. PMLR, 2021. Zhao, S., Ren, H., Yuan, A., Song, J., Goodman, N., and Ermon, S. Bias and generalization in deep generative models: An empirical study. Advances in Neural Information Processing Systems, 31, 2018. Zhou, L., Du, Y., and Wu, J. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5826 5835, 2021. A Variational lower bound for the SS-DDPM model L SS(θ) = Eq SS(x0:T ) log p SS θ (x0:T ) q SS(x1:T |x0) = Eq SS(x0:T ) log p SS θ (x0|x1:T )p SS θ (x T ) QT t=2 p SS θ (xt 1|xt:T ) QT t=1 q SS(xt|x0) = (26) = Eq SS(x0:T ) log p SS θ (x0|x1:T ) + t=2 log p SS θ (xt 1|xt:T ) q SS(xt 1|x0) + log p SS θ (x T ) q SS(x T |x0) = Eq SS(x0:T ) log p SS θ (x0|x1:T ) t=2 DKL (q SS(xt 1|x0) p SS θ (xt 1|xt:T )) B True reverse process for Markovian and Star-Shaped DDPM If the forward process is Markovian, the corresponding true reverse process is Markovian too: q DDPM(xt 1|xt:T ) = q DDPM(xt 1:T ) q DDPM(xt:T ) = q DDPM(xt 1)q DDPM(xt|xt 1)((((((((((( QT s=t+1 q DDPM(xs|xs 1) q DDPM(xt)((((((((((( QT s=t+1 q DDPM(xs|xs 1) = q DDPM(xt 1|xt) q DDPM(x0:T ) = q(x0) t=1 q DDPM(xt|xt 1) = q DDPM(x T ) t=1 q DDPM(xt 1|xt:T ) = q DDPM(x T ) t=1 q DDPM(xt 1|xt) (30) For star-shaped models, however, the reverse process has a general structure that cannot be reduced further: q SS(x0:T ) = q(x0) t=1 q SS(xt|x0) = q SS(x T ) t=1 q SS(xt 1|xt:T ) (31) In this case, a Markovian reverse process can be a very poor approximation to the true reverse process. Choosing such an approximation adds an irreducible gap to the variational lower bound: L SS Markov(θ) = Eq SS(x0:T ) log pθ(x T ) QT t=1 pθ(xt 1|xt) q SS(x1:T |x0) = (32) = Eq SS(x0:T ) log pθ(x T ) QT t=1 pθ(xt 1|xt)q(x0) QT t=1 q SS(xt 1|xt) q SS(x T ) QT t=1 q SS(xt 1|xt:T ) QT t=1 = (33) = Eq SS(x0:T ) log q(x0) DKL (q SS(x T ) pθ(x T )) t=1 DKL (q SS(xt 1|xt) pθ(xt 1|xt)) | {z } Reducible t=1 DKL (q SS(xt 1|xt:T ) q SS(xt 1|xt)) | {z } Irreducible Intuitively, there is little information shared between xt 1 and xt, as they are conditionally independent given x0. Therefore, we would expect the distribution q SS(xt 1|xt) to have a much higher entropy than the distribution q SS(xt 1|xt:T ), making the irreducible gap (35) large. The dramatic effect of this gap is illustrated in Figure 3. This gap can also be computed analytically for Gaussian DDPMs when the data is coming from a standard Gaussian distribution q(x0) = N (x0; 0, 1). According to equations (34 35), the best Markovian reverse process in this case is pθ(xt 1|xt) = q SS(xt 1|xt). It results in the following value of the variational lower bound: L SS Markov = H[q(x0)] + H[q SS(x0:T )] H[q SS(x T )] 1 1 + log(2π(1 α SS t 1α SS t )) (36) 0 200 400 600 800 1000 T Markovian reverse True reverse Figure 7: Variational lower bound for a Gaussian Star-Shaped DDPM model computed with the true reverse process and with the best Markovian approximation. If the reverse process is matched exactly (and has a general structure), the variational lower bound reduces to the negative entropy of the data distribution: L SS = Eq SS(x0:T ) log p SS (x0:T ) q SS(x1:T |x0) = (37) = Eq SS(x0:T ) log q SS(x0:T ) q SS(x1:T |x0) = (38) = Eq(x0) log q(x0) = 1 2 log(2π) 1 As shown in Figure 7, the Markovian approximation adds a substantial irreducible gap to the variational lower bound. C Proof of Theorem 1 We start by establishing the following lemma. It would allow us to change the variables in the condition of a conditional distribution. Essentially, we would like to show that if a variable y is only dependent on x through some function h(x), we can write the distribution p(y|x) as p(y|z)|z=h(x). Lemma 1. Assume random variables x and y follow a joint distribution p(x, y) = p(x)p(y|x), where p(y|x) is given by f(y, h(x)). Then p(y|x) = p(y|z)|z=h(x), where the variable z is defined as z = h(x). Proof. We can write down the joint distribution over all variables as follows: p(x, y, z) = p(x)f(y, h(x))δ(z h(x)). (40) By integrating x out, we get p(y, z) = Z p(x, y, z)dx = Z p(x)f(y, h(x))δ(z h(x))dx = Z p(x)f(y, z)δ(z h(x))dx = (41) = f(y, z) Z p(x)δ(z h(x))dx = f(y, z)p(z) (42) Finally, we obtain p(y|z)|z=h(x) = p(y, z) |z=h(x) = f(y, z)|z=h(x) = f(y, h(x)) = p(y|x) (43) Theorem 1. Assume the forward process of a star-shaped model takes the following form: q SS(xt|x0) = ht(xt) exp ηt(x0)TT (xt) Ωt(x0) , (13) ηt(x0) = Atf(x0) + bt. (14) Let Gt be a tail statistic, defined as follows: Gt = Gt(xt:T ) = s=t AT s T (xs). (15) Then, Gt is a sufficient tail statistic: q SS(xt 1|xt:T ) = q SS(xt 1|Gt). (16) q SS(xt 1|xt:T ) = Z q SS(xt 1|x0)q SS(x0|xt:T )dx0 (44) q SS(x0|xt:T ) = q(x0) QT s=t q SS(xs|x0) q SS(xt:T ) = q(x0) q SS(xt:T ) ηs(x0)TT (xs) Ωs(x0) ) = q(x0) q SS(xt:T ) (Asf(x0) + bs)TT (xs) Ωs(x0) ) = q(x0) q SS(xt:T ) s=t AT s T (xs) + b T s T (xs) Ωs(x0) ) = q(x0) q SS(xt:T ) b T s T (xs) Ωs(x0) ) = ha distribution must be normalized = q(x0) exp n f(x0)TGt PT s=t Ωs(x0) o R q(x0) exp n f(x0)TGt PT s=t Ωs(x0) o dx0 = h Lemma 1 for Gt as z, x0 as y and xt:T as x i = q SS(x0|Gt) (49) q SS(xt 1|xt:T ) = Z q SS(xt 1|x0)q SS(x0|xt:T )dx0 = Z q SS(xt 1|x0)q SS(x0|Gt)dx0 = q SS(xt 1|Gt) (50) D Gaussian SS-DDPM is equivalent to Gaussian DDPM Theorem 2. Let αDDPM t define the noising schedule for a DDPM model (1 2) via βt = (αDDPM t 1 αDDPM t )/αDDPM t 1 . Let q SS(x0:T ) be a Gaussian SS-DDPM forward process with the following noising schedule and sufficient tail statistic: q SS(xt|x0) = N xt; p αSS t x0, 1 α SS t , (22) Gt(xt:T ) = 1 αDDPM t p αSS s xs 1 αSS s , where (23) αSS t 1 αSS t = αDDPM t 1 αDDPM t αDDPM t+1 1 αDDPM t+1 . (24) Then the tail statistic Gt follows a Gaussian DDPM noising process q DDPM(x0:T )|x1:T =G1:T defined by the schedule αDDPM t . Moreover, the corresponding reverse processes and VLB objectives are also equivalent. Proof. We start by listing the necessary definitions and expressions used in the Gaussian DDPM model (Ho et al., 2020). q DDPM(x0:T ) = q(x0) i=1 q DDPM(xt|xt 1) (51) q DDPM(xt|xt 1) = N xt; p 1 βtxt 1, βt I (52) α DDPM s = 1 β DDPM s (53) s=1 α DDPM s (54) q DDPM(xt|x0) = N xt; p αDDPM t x0, (1 α DDPM t )I (55) µt(xt, x0) = αDDPM t 1 βt 1 αDDPM t x0 + αDDPM t (1 αDDPM t 1 ) 1 αDDPM t xt (56) β DDPM t = 1 αDDPM t 1 1 αDDPM t βt (57) q DDPM(xt 1|xt, x0) = N xt 1; µt(xt, x0), β DDPM t I (58) p DDPM θ (x0:T ) = q DDPM(x T ) i=1 p DDPM θ (xt 1|xt) (59) p DDPM θ (xt 1|xt) = N xt 1; µt(xt, x DDPM θ (xt, t)), β DDPM t I (60) In this notation, the variational lower bound for DDPM can be written as follows (here we omit the term corresponding to x T because it s either zero or a small constant): L DDPM(θ) = Eq DDPM(x0:T ) log p DDPM θ (x0|x1) t=2 DKL (q DDPM(xt 1|xt, x0) p DDPM θ (xt 1|xt)) = Eq DDPM(x0:T ) log p DDPM θ (x0|x1) t=2 DKL q DDPM(xt 1|xt, x0) q DDPM(xt 1|xt, x0)|x0=x DDPM θ (xt,t) # DKL q DDPM(xt 1|xt, x0) q DDPM(xt 1|xt, x0)|x0=x DDPM θ (xt,t) = αDDPM t 1 βDDPM t 1 αDDPM t (x0 x DDPM θ (xt, t)) 2 2 βDDPM t (63) For convenience, we additionally define αDDPM T +1 = 0. Since all the involved distributions are typically isotropic, we consider a one-dimensional case without the loss of generality. First, we show that the forward processes q SS(x0, G1:T ) and q DDPM(x0:T ) are equivalent. Gt = 1 αDDPM t p αSS s xs 1 αSS s (64) Since Gt is a linear combination of independent (given x0) Gaussian random variables, its distribution (given x0) is also Gaussian. All the terms conveniently cancel out, and we recover the same form as the forward process of DDPM. q SS(Gt|x0) = N Gt; 1 αDDPM t p αSS s x0 1 αSS s ; 1 αDDPM t p αSS s (1 αSS s )2 (1 α SS s ) Gt; 1 αDDPM t p αDDPM t 1 αDDPM t x0; 1 αDDPM t p αDDPM t 1 αDDPM t αDDPM t x0; (1 α DDPM t ) = q DDPM(xt|x0)|xt=Gt (67) Since both processes share the same q(x0), and have the same Markovian structure, matching these marginals sufficiently shows that the processes are equivalent: q SS(x0, G1:T ) = q DDPM(x0:T )|x1:T =G1:T (68) Consequently, the posteriors are matching too: q SS(Gt 1|Gt, x0) = q DDPM(xt 1|xt, x0)|xt 1,t=Gt 1,t (69) Next, we show that the reverse processes p SS θ (Gt 1|Gt) and p DDPM θ (xt 1|xt) are the same. Since Gt in SS-DDPM is the same as xt in DDPM, we can assume the predictive models x SS θ (Gt, t) and x DDPM θ (xt, t) to coincide at xt = Gt. Gt 1 = 1 αDDPM t 1 p αSS s xs 1 αSS s = 1 αDDPM t 1 p αSS t 1xt 1 1 αSS t 1 + 1 αDDPM t 1 p αDDPM t 1 αDDPM t Gt = (70) = 1 αDDPM t 1 p αSS s xs 1 αSS s = 1 αDDPM t 1 p αSS t 1xt 1 1 αSS t 1 + αDDPM t (1 αDDPM t 1 ) 1 αDDPM t Gt, (71) where xt 1 p SS θ (xt 1|Gt) = q SS(xt 1|x0)|x0=xθ(Gt,t). Therefore, p SS θ (Gt 1|Gt) is also a Gaussian distribution. Let s take care of the mean first: Ep SS θ (Gt 1|Gt)Gt 1 = 1 αDDPM t 1 p αSS t 1 1 αSS t 1 xθ(Gt, t) + αDDPM t (1 αDDPM t 1 ) 1 αDDPM t Gt = (72) = 1 αDDPM t 1 p αDDPM t 1 1 αDDPM t 1 αDDPM t 1 αDDPM t xθ(Gt, t) + αDDPM t (1 αDDPM t 1 ) 1 αDDPM t Gt = (73) αDDPM t 1 (1 αDDPM t 1 ) p αDDPM t 1 αDDPM t 1 αDDPM t xθ(Gt, t) + αDDPM t (1 αDDPM t 1 ) 1 αDDPM t Gt = (74) = 1 αDDPM t (1 αDDPM t 1 )αDDPM t 1 αDDPM t αDDPM t 1 xθ(Gt, t) + αDDPM t (1 αDDPM t 1 ) 1 αDDPM t Gt = (75) = 1 αDDPM t αDDPM t + αDDPM t 1 αDDPM t αDDPM t 1 xθ(Gt, t) + αDDPM t (1 αDDPM t 1 ) 1 αDDPM t Gt = (76) αDDPM t 1 βt 1 αDDPM t xθ(Gt, t) + αDDPM t (1 αDDPM t 1 ) 1 αDDPM t Gt (77) Now, let s derive the variance: Dp SS θ (Gt 1|Gt)Gt 1 = 1 αDDPM t 1 p αSS t 1 1 αSS t 1 (1 α SS t 1) = (78) 1 αDDPM t 1 p αSS t 1 1 αSS t 1 = (1 αDDPM t 1 )2 αSS t 1 1 αSS t 1 = (1 αDDPM t 1 )2 αDDPM t 1 1 αDDPM t 1 αDDPM t 1 αDDPM t = (1 α DDPM t 1 ) 1 (1 αDDPM t 1 )αDDPM t 1 αDDPM t = (1 αDDPM t 1 )(1 αDDPM t (1 αDDPM t 1 )αDDPM t ) 1 αDDPM t = (80) = 1 αDDPM t 1 1 αDDPM t (1 α DDPM t α DDPM t + α DDPM t ) = 1 αDDPM t 1 1 αDDPM t β DDPM t = β DDPM t (81) p SS θ (Gt 1|Gt) = N αDDPM t 1 βt 1 αDDPM t xθ(Gt, t) + αDDPM t (1 αDDPM t 1 ) 1 αDDPM t Gt, β DDPM t = p DDPM θ (xt 1|xt)|xt 1,t=Gt 1,t Finally, we show that the variational lower bounds are the same. L SS(θ) = Eq SS(x0:T ) log p SS θ (x0|G1) t=2 DKL (q SS(xt 1|x0) p SS θ (xt 1|Gt)) = Eq SS(x0:T ) log p SS θ (x0|G1) t=2 DKL q SS(xt 1|x0) q SS(xt 1|x0)|x0=xθ(Gt,t) # Since G1 from the star-shaped model is the same as x1 from the DDPM model, the first term log p SS θ (x0|G1) coincides with log p DDPM θ (x0|x1)|x1=G1. DKL q SS(xt 1|x0) q SS(xt 1|x0)|x0=xθ(Gt,t) = αSS t 1(x0 xθ(Gt, t)) 2 2(1 αSS t 1) = (85) = αSS t 1 2(1 αSS t 1)(x0 xθ(Gt, t))2 = 1 αDDPM t 1 1 αDDPM t 1 αDDPM t 1 αDDPM t (x0 xθ(Gt, t))2 = (86) αDDPM t 1 αDDPM t (1 αDDPM t 1 )(1 αDDPM t )(x0 xθ(Gt, t))2 = 1 αDDPM t 1 βDDPM t (1 αDDPM t 1 )(1 αDDPM t )(x0 xθ(Gt, t))2 = (87) = αDDPM t 1 βDDPM t (x0 x DDPM θ (xt, t))2 2(1 αDDPM t 1 )(1 αDDPM t ) = αDDPM t 1 (βDDPM t )2 (1 αDDPM t )2 (x0 x DDPM θ (xt, t))2 2 1 αDDPM t 1 1 αDDPM t βDDPM t = (88) αDDPM t 1 βDDPM t 1 αDDPM t (x0 x DDPM θ (xt, t)) 2 2 βDDPM t = DKL q DDPM(xt 1|xt, x0) q DDPM(xt 1|xt, x0)|x0=x DDPM θ (xt,t) (89) Therefore, we finally obtain LSS(θ) = LDDPM(θ). E Duality between star-shaped and Markovian diffusion: general case We assume that T (xt) and all matrices At (except possibly AT ) are invertible. Then, there is a bijection between the set of tail statistics Gt:T and the tail xt:T . First, we show that the variables G1:T form a Markov chain. q(Gt 1|Gt:T ) = q(Gt 1|xt:T ) = Z q(Gt 1|x0, xt:T )q(x0|xt:T )dx0 = (90) = Z q(Gt 1|Gt, x0)q(x0|Gt)dx0 = q(Gt 1|Gt) (91) q(x0, Gt:T ) = q(x0|G1:T ) t=2 q(Gt 1|Gt:T ) = q(x0|G1) t=2 q(Gt 1|Gt) = q(x0)q(G1|x0) t=2 q(Gt|Gt 1) (92) The last equation holds since the reverse of a Markov chain is also a Markov chain. This means that a star-shaped diffusion process on x1:T implicitly defines some Markovian diffusion process on the tail statistics G1:T . Due to the definition of Gt 1 = Gt + AT t 1T (xt 1), we can write down the following factorization of that process: q(x0, Gt:T ) = q(x0)q(GT ) t=2 q(Gt 1|Gt, x0) (93) The posteriors q(Gt 1|Gt, x0) can then be computed by a change of variables: q(Gt 1|Gt, x0) = q(xt 1|x0)|xt 1=T 1(A T t 1(Gt 1 Gt)) det d d Gt 1 T 1 A T t 1(Gt 1 Gt) (94) This also allows us to define the reverse model like it was defined in DDPM: pθ(Gt 1|Gt) = q(Gt 1|Gt, x0)|x0=xθ(Gt,t) (95) This definition is consistent with the reverse process of SS-DDPM pθ(xt 1|xt:T ) = q(xt 1|x0)|x0=xθ(Gt,t). Now that both the forward and reverse processes are defined, we can write down the corresponding variational lower bound. Because the model is structured exactly like a DDPM, the variational lower bound is going to look the same. However, we can show that it is equivalent to the variational lower bound of SS-DDPM: L Dual(θ) = Eq(x0,G1:T ) log pθ(x0|G1) t=2 DKL (q(Gt 1|Gt, x0) pθ(Gt 1|Gt)) DKL (q(GT |x0) pθ(GT )) = Eq(x0,G1:T ) log pθ(x0|G1) t=2 Eq(Gt 1|Gt,x0) log q(Gt 1|Gt, x0) pθ(Gt 1|Gt) DKL (q(x T |x0) pθ(x T )) = Eq(x0:T ) log pθ(x0|G1) t=2 Eq(xt 1|x0) log q(xt 1|x0) pθ(xt 1|Gt) DKL (q(x T |x0) pθ(x T )) = Eq(x0:T ) log pθ(x0|G1) t=2 DKL (q(xt 1|x0) pθ(xt 1|Gt)) DKL (q(x T |x0) pθ(x T )) = L SS(θ) (99) As we can see, there are two equivalent ways to write down the model. One is SS-DDPM, a star-shaped diffusion model, where we only need to define the marginal transition probabilities q(xt|x0). Another way is to rewrite the model in terms of the tail statistics Gt:T . This way we obtain a non-Gaussian DDPM that is implicitly defined by the star-shaped model. Because of this equivalence, we see the star-shaped diffusion of x1:T and the Markovian diffusion of G1:T as a pair of dual processes. F SS-DDPM in different families The main principles for designing a SS-DDPM model are similar to designing a DDPM model. When t goes to zero, we wish to recover the Dirac delta, centered at x0: q SS(xt|x0) t 0 δ(xt x0) (100) When t goes to T, we wish to obtain some standard distribution that doesn t depend on x0: q SS(xt|x0) t T q SS T (x T ) (101) In exponential families with linear parameterization of the natural parameter ηt(x0) = atf(x0) + bt, we can define the schedule by choosing the parameters at and bt that satisfy conditions (100 101). After that, we can use Theorem 1 to define the tail statistic Gt(xt:T ) using the sufficient statistic T (xt) of the corresponding family. However, as shown in the following sections, in some cases linear parameterization admits a simpler sufficient tail statistic. The following sections contain examples of defining the SS-DDPM model for different families. These results are summarized in Table 6. F.1 Gaussian q SS(xt|x0) = N xt; p αSS t x0, (1 α SS t )I (102) Since Gaussian SS-DDPM is equivalent to a Markovian DDPM, it is natural to directly reuse the schedule from a Markovian DDPM. As we show in Theorem 2, given a Markovian DDPM defined by αDDPM t , the following parameterization will produce the same process in the space of tail statistics: αSS t 1 αSS t = αDDPM t 1 αDDPM t αDDPM t+1 1 αDDPM t+1 (103) Gt(xt:T ) = 1 αDDPM t p αSS s xs 1 αSS s (104) The KL divergence is computed as follows: DKL (q SS(xt|x0) p SS θ (xt|Gt)) = αSS t (x0 xθ)2 2(1 αSS t ) (105) q(xt|x0) = Beta(xt; αt, βt) (106) There are many ways to define a noising schedule. We choose to fix the mode of the distribution at x0 and introduce a concentration parameter νt, parameterizing αt = 1 + νtx0 and βt = 1 + νt(1 x0). By setting νt to zero, we recover a uniform distribution, and by setting it to infinity we obtain the Dirac delta, centered at x0. Generally, the Beta distribution has a two-dimensional sufficient statistic T (x) = log x log(1 x) . However, under this parameterization, we can derive a one-dimensional tail statistic: q(xt|x0) exp {(αt 1) log xt + (βt 1) log(1 xt)} = exp {νtx0 log xt + νt(1 x0) log(1 xt)} = (107) νtx0log xt 1 xt + νt log(1 xt) = exp {ηt(x0)T (xt) + log ht(xt)} (108) Therefore we can use ηt(x0) = νtx0 and T (x) = log x 1 x to define the tail statistic: Gt(xt:T ) = s=t νs log xs 1 xs (109) The KL divergence can then be calculated as follows: DKL (q SS(xt|x0) p SS θ (xt|Gt)) = log Beta(αt(xθ), βt(xθ)) Beta(αt(x0), βt(x0)) + νt(x0 xθ)(ψ(αt(x0)) ψ(βt(x0))) (110) Figure 8: Visualization of the forward process in Dirichlet SS-DDPM on a three-dimensional probabilistic simplex. F.3 Dirichlet q(xt|x0) = Dirichlet(xt; α1 t, . . . , αK t ) (111) Similarly to the Beta distribution, we choose to fix the mode of the distribution at x0 and introduce a concentration parameter νt, parameterizing αk t = 1 + νtxk 0. This corresponds to using the natural parameter ηt(x0) = νtx0. By setting νt to zero, we recover a uniform distribution, and by setting it to infinity we obtain the Dirac delta, centered at x0. We can use the sufficient statistic T (x) = (log x1, . . . , log x K)T to define the tail statistic: Gt(xt:T ) = s=t νs(log x1 s, . . . , log x K s )T (112) The KL divergence can then be calculated as follows: DKL (q SS(xt|x0) p SS θ (xt|Gt)) = log Γ(αk t (xθ)) Γ(αk t (x0)) + νt(xk 0 xk θ)ψ(αk t (x0)) (113) F.4 Categorical q(xt|x0) = Cat(xt; pt) (114) In this case, we mimic the definition of the categorical diffusion model, used in D3PM. The noising process is parameterized by the probability vector pt = x0Qt. By setting Q0 to identity, we recover the Dirac delta, centered at x0. In this parameterization, the natural parameter admits linearization (after some notation abuse): ηt(x0) = log(x0Qt) = x0 log Qt, where (115) log is taken element-wise, and we assume 0 log 0 = 0. The sufficient statistic here is a vector x T t . Therefore, the tail statistic can be defined as follows: Gt(xt:T ) = log Qs x T s = s=t log(Qsx T s ) (116) The KL divergence can then be calculated as follows: DKL (q SS(xt|x0) p SS θ (xt|Gt)) = i=1 (x0Qt)i log (x0Qt)i (xθQt)i (117) Note that, unlike in the Gaussian case, the categorical star-shaped diffusion is not equivalent to the categorical DDPM. The main difference here is that the input Gt to the predictive model xθ(Gt) is now a continuous vector instead of a one-hot vector. As discussed in Section H, proper normalization is important for training SS-DDPMs. In the case of Categorical distributions, we can use the Soft Max( ) function to normalize the tail statistics without breaking sufficiency: Gt = Soft Max (Gt(xt:T )) (118) To see this, we can retrace the steps of the proof of Theorem 1: q SS(x0|xt:T ) q(x0) exp {x0Gt} = q(x0) exp x0 log Gt + log i=1 exp (Gt)i q(x0) exp n x0 log Gt o (119) q SS(x0|xt:T ) = q(x0) exp n x0 log Gt o x0 q( x0) exp n x0 log Gt o = q SS(x0| Gt) (120) Also, Categorical distributions admit a convenient way to compute fractions involving q(Gt|x0): q(Gt|x0) = X s t q(xs|x0) 1 [Gt = Gt(xt:T )] = (121) s t x0 log(Qs)x T s 1 [Gt = Gt(xt:T )] = X xt:T exp {x0Gt} 1 [Gt = Gt(xt:T )] = (122) = exp {x0Gt} X xt:T 1 [Gt = Gt(xt:T )] = exp {x0Gt} #Gt (123) This allows us to define the likelihood term p(x0|x1:T ) as follows: p(x0|G1) = q(G1|x0) p(x0) P x0 q(G1| x0) p( x0) = exp {x0G1} p(x0) P x0 exp { x0G1} p( x0), (124) where p(x0) can be defined as the frequency of the token x0 in the dataset. It also allows us to estimate the mutual information between x0 and Gt: I(x0; Gt) = Eq(x0)q(Gt|x0) log q(Gt|x0) q(Gt) = Eq(x0)q(Gt|x0) log q(Gt|x0) P x0 q(Gt| x0)q( x0) = (125) = Eq(x0)q(Gt|x0) log exp {x0Gt} #Gt P x0 exp { x0Gt} #Gt q( x0) = Eq(x0)q(Gt|x0) log exp {x0Gt} P x0 exp { x0Gt} q( x0) (126) It can be estimated using Monte Carlo. We use it when defining the noising schedule for Categorical SS-DDPM. F.5 von Mises q(xt|x0) = von Mises(xt; x0, κt) (127) The von Mises distribution has two parameters, the mode x and the concentration κ. It is natural to set the mode of the noising distribution to x0 and vary the concentration parameter κt. When κt goes to infinity, the von Mises distribution approaches the Dirac delta, centered at x0. When κt goes to 0, it approaches a uniform distribution on a unit circle. The sufficient statistic is T (xt) = cos xt sin xt , and the corresponding natural parameter is ηt(x0) = κt cos x0 sin x0 . The tail statistic Gt(xt:T ) is therefore defined as follows: Gt(xt:T ) = cos xs sin xs The KL divergence term can be calculated as follows: DKL (q SS(xt|x0) p SS θ (xt|Gt)) = κt I1(κt) I0(κt)(1 cos(x0 xθ)) (129) Figure 9: Visualization of the forward process of the von Mises Fisher SS-DDPM on the three-dimensional unit sphere. F.6 von Mises Fisher q(xt|x0) = v MF(xt; x0, κt) (130) Similar to the one-dimensional case, we set the mode of the distribution to x0, and define the schedule using the concentration parameter κt. When κt goes to infinity, the von Mises Fisher distribution approaches the Dirac delta, centered at x0. When κt goes to 0, it approaches a uniform distribution on a unit sphere. The sufficient statistic is T (x) = x, and the corresponding natural parameter is ηt(x0) = κtx0. The tail statistic Gt(xt:T ) is therefore defined as follows: Gt(xt:T ) = s=t κsxs (131) The KL divergence term can be calculated as follows: DKL (q SS(xt|x0) p SS θ (xt|Gt)) = κt IK/2(κt) IK/2 1(κt)x T 0 (x0 xθ) (132) q(xt|x0) = Γ(xt; αt, βt) (133) There are many ways to define a schedule. We choose to interpolate the mean of the distribution from x0 at t = 0 to 1 at t = T. This can be achieved with the following parameterization: βt(x0) = αt(ξt + (1 ξt)x 1 0 ) (134) The mean of the distribution is αt βt , and the variance is αt β2 t . Therefore, we recover the Dirac delta, centered at x0, when we set ξt to 0 and αt to infinity. To achieve some standard distribution that doesn t depend on x0, we can set ξt to 1 and αt to some fixed value αT . In this parameterization, the natural parameters are αt 1 and βt(x0), and the corresponding sufficient statistics are log xt and xt. Since the parameter αt doesn t depend on x0, we only need the sufficient statistic T (xt) = xt to define the tail statistic: s=t αs(1 ξs)xs (135) The KL divergence can be computed as follows: DKL (q SS(xt|x0) p SS θ (xt|Gt)) = αt βt(xθ) + βt(xθ) βt(x0) 1 (136) Figure 10: Visualization of the forward process in Wishart SS-DDPM on positive definite matrices of size 2 2. F.8 Wishart q(Xt|X0) = Wp(Xt; Vt, nt) (137) The natural parameters for the Wishart distribution are 1 2V 1 t and nt p 1 2 . To achieve linear parameterization, we need to linearly parameterize the inverse of Vt rather than Vt directly. Similar to the Gamma distribution, we interpolate the mean of the distribution from X0 at t = 0 to I at t = T. This can be achieved with the following parameterization: µt(X0) = ξt I + (1 ξt)X 1 0 (138) Vt(X0) = n 1 t µ 1 t (X0) (139) We recover the Dirac delta, centered at X0, when we set ξt to 0 and nt to infinity. To achieve some standard distribution that doesn t depend on X0, we set ξt to 1 and nt to some fixed value n T . The sufficient statistic for this distribution is T (Xt) = Xt log |Xt| . Since the parameter nt doesn t depend on X0, we don t need the corresponding sufficient statistic log |Xt| and can use T (Xt) = Xt to define the tail statistic: G(Xt:T , t) = s=t ns(1 ξs)Xs (140) The KL divergence can then be calculated as follows: DKL (q SS(xt|x0) p SS θ (xt|Gt)) = nt 2 log V 1 t (Xθ)Vt(X0) tr V 1 t (Xθ)Vt(X0) + p (141) G Choosing the noise schedule In DDPMs, we train a neural network to predict x0 given the current noisy sample xt. In SS-DDPMs, we use the tail statistic Gt as the neural network input instead. Therefore, it is natural to search for a schedule, where the level of noise in Gt is similar to the level of noise in xt from some DDPM model. Similarly to D3PM, we formalize the level of noise as the mutual information between the clean and noisy samples. For SS-DDPM it would be I SS(x0; Gt), and for DDPM it would be I DDPM(x0; xt). We would like to start from a well-performing DDPM model and define a similar schedule by matching I SS(x0; Gt) I DDPM(x0; xt). Since Gaussian SS-DDPM is equivalent to DDPM, the desired schedule can be found in Theorem 2. In other cases, however, it is difficult to find a matching schedule. In our experiments, we found the following heuristic to work well enough. First, we find a Gaussian SS-DDPM schedule that is equivalent to the DDPM with the desired schedule. We denote the corresponding mutual information as I SS N (x0; Gt) = I DDPM(x0; xt). Then, we can match it in the original space of xt and hope that the resulting mutual information in the space of tail statistics is close too: I SS(x0; xt) I SS N (x0; xt) ?= I SS(x0; Gt) I SS N (x0; Gt) (142) Assuming the schedule is parameterized by a single parameter ν, we can build a look-up table I SS(x0; xν) for a range of parameters ν. Then we can use binary search to build a schedule to match the mutual information I SS(x0; xt) to the mutual information schedule I SS N (x0; xt). While this procedure doesn t allow to match the target schedule exactly, it provides a good enough approximation and allows to obtain an adequate schedule. We used this procedure to find the schedule for the Beta diffusion in our experiments with image generation. To build the look-up table, we need a robust way to estimate the mutual information. The target mutual information I SS N (x0; xt) can be computed analytically when the data q(x0) follows a Gaussian distribution. When the data follows an arbitrary distribution, it can be approximated with a Gaussian mixture and the mutual information can be calculated using numerical integration. Estimating the mutual information for arbitrary noising distributions is more difficult. We find that the Kraskov estimator (Kraskov et al., 2004) works well when the mutual information is high (I > 2). When the mutual information is lower, we build a different estimator using DSIVI bounds (Molchanov et al., 2019). I SS(x0; xν) = DKL (q SS(x0, xt) q SS(x0)q SS(xt)) = H[xt] H[xt|x0] (143) This conditional entropy is available in closed form for many distributions in the exponential family. Since the marginal distribution q SS(x0, xt) = R q SS(xt|x0)q(x0)dx0 is a semi-implicit distribution (Yin & Zhou, 2018), we can use the DSIVI sandwich (Molchanov et al., 2019) to obtain an upper and lower bound on the marginal entropy H[xt]: H[xt] = Eq SS log q SS(xt) = Eq SS log Z q SS(xt|x0)q(x0)dx0 (144) H[xt] Ex0:K 0 q(x0)Ext q(xt|x0)|x0=x0 0 log 1 K + 1 k=0 q SS(xt|xk 0) (145) H[xt] Ex0:K 0 q(x0)Ext q(xt|x0)|x0=x0 0 log 1 k=1 q SS(xt|xk 0) (146) These bounds are asymptotically exact and can be estimated using Monte Carlo. We use K = 1000 when the mutual information is high ( 1 2 I < 2), K = 100 when the mutual information is lower (0.002 I < 1 2), and estimate the expectations using M = 108K 1 samples for each timestamp. For values I > 2 we use the Kraskov estimator with M = 105 samples and k = 10 neighbors. For values I < 0.002 we fit an exponential curve i(t) = eat+b to interpolate between the noisy SIVI estimates, obtained with K = 50 and M = 105. For evaluating the mutual information I(x0; Gt) between the clean data and the tail statistics, we use the Kraskov estimator with k = 10 and M = 105. The mutual information look-up table for the Beta star-shaped diffusion, as well as the used estimations of the mutual information, are presented in Figure 11. The resulting schedule for the beta diffusion is presented in Figure 12, and the comparison of the mutual information schedules for the tail statistics between Beta SS-DDPM and the referenced Gaussian SS-DDPM is presented in Figure 13. For Categorical SS-DDPM we estimate the mutual information using Monte Carlo. We then choose the noising schedule to match the cosine schedule used by Hoogeboom et al. (2021) using a similar technique. H Normalizing the tail statistics We illustrate different strategies for normalizing the tail statistics in Figure 14. Normalizing by the sum of coefficients is not enough, therefore we resort to matching the mean and the variance empirically. We refer to this trick as time-dependent tail normalization. Also, proper normalization allows us to visualize the tail statistics by projecting them back into the original domain using T 1( Gt). The effect of normalization is illustrated in Figure 15. DSIVI upper bound DSIVI lower bound Kraskov Combined 2000 4000 6000 8000 10000 Figure 11: The mutual information look-up table for Beta star-shaped diffusion. 0 200 400 600 800 1000 t Figure 12: The schedule for Beta star-shaped diffusion. 0 200 400 600 800 1000 t Beta SS-DDPM Gaussian SS-DDPM Figure 13: Mutual information between clean data and the tail statistics for beta star-shaped diffusion. I Synthetic data We compare the performance of DDPM and SS-DDPM on synthetic tasks with exotic data domains. In these tasks, we train and sample from DDPM and SS-DDPM using T = 64 steps. We use an MLP with 3 hidden layers of size 512, swish activations and residual connections through hidden layers (He et al., 2015). We use sinusoidal positional time embeddings (Vaswani et al., 2017) of size 32 and concatenate them with the normalized tail statistics Gt. We add a mapping to the corresponding domain on top of the network. During training, we use gradient clipping and EMA weights to improve stability. All models on synthetic data were trained for 350k iterations with batch size 128. In all our experiments with SS-DDPM on synthetic data we use time-dependent tail normalization. We use DDPM with linear schedule and Lsimple or Lvlb as a loss function. We choose between Lsimple and Lvlb objective based on the KL divergence between the data distribution and the model distribution DKL (q(x0) pθ(x0)). To make an honest comparison, we precompute normalization statistics for DDPM in the same way that we do in time-dependent tail normalization. In DDPM, a neural network makes predictions using xt as an input, so we precompute normalization statistics for xt and fix them during the training and sampling stages. Probabilistic simplex We evaluate Dirichlet SS-DDPM on a synthetic problem of generating objects on a threedimensional probabilistic simplex. We use a mixture of three Dirichlet distributions with different parameters as training 0 200 400 600 800 1000 t x0 = 0.1 x0 = 0.5 x0 = 0.9 (a) No normalization Gt = Gt 0 200 400 600 800 1000 t x0 = 0.1 x0 = 0.5 x0 = 0.9 (b) Normalized by the sum of coefficients Gt = Gt PT s=t as 0 200 400 600 800 1000 t x0 = 0.1 x0 = 0.5 x0 = 0.9 (c) Matched the mean and variance Gt = (Gt EGt) std(T (x0)) std(Gt) + ET (x0) Figure 14: Example trajectories of the normalized tail statistics with different normalization strategies. The trajectories are coming from a single dimension of Beta SS-DDPM. Figure 15: Visualizing the tail statistics for Beta SS-DDPM with different normalization by mapping the normalized tail statistics Gt into the data domain using T 1( Gt). Top row: normalized by the sum of coefficients, Gt = Gt PT s=t as . Bottom row: matched the mean and variance, Gt = (Gt EGt) std(T (x0)) std(Gt) + ET (x0) data. The Dirichlet SS-DDPM forward process is illustrated in Figure 8. To map the predictions to the domain, we put the Softmax function on the top of the MLP. We optimize Dirichlet SS-DDPM on the VLB objective without any modifications using Adam with a learning rate of 0.0004. The DDPM was trained on Lvlb using Adam with a learning rate of 0.0002. An illustration of the samples is presented in Table 4. Table 4: Generating objects from three-dimensional probabilistic simplex. Original DDPM Dirichlet SS-DDPM Symmetric positive definite matrices We evaluate Wishart SS-DDPM on a synthetic problem of generating symmetric positive definite matrices of size 2 2. We use a mixture of three Wishart distributions with different parameters as training data. The Wishart SS-DDPM forward processes are illustrated in Figure 10. For the case of symmetric positive definite matrices V , MLP predicts a lower triangular factor Lθ from the Cholesky decomposition Vθ = LθLT θ . For stability of sampling from the Wishart distribution and estimation of the loss function, we add a scalar matrix 10 4I to the predicted symmetric positive definite matrix Vθ. In Wishart SS-DDPM we use an analogue of Lsimple as a loss function. To improve the training stability, we divide each KL term corresponding to a timestamp t by nt, the de-facto concentration parameter from the used noising schedule (see Table 6). Wishart SS-DDPM was trained using Adam with a learning rate of 0.0004. The DDPM was trained on Lsimple using Adam with a learning rate of 0.0004. Since positive definite 2 2 matrices can be interpreted as ellipses, we visualize the samples by drawing the corresponding ellipses. An illustration of the samples is presented in Table 5. Table 5: Generating symmetric positive definite 2 2 matrices. Original DDPM Wishart SS-DDPM J Geodesic data We evaluate von Mises Fisher SS-DDPM on the problem of restoring the distribution on the sphere from empirical data of fires on the Earth s surface (EOSDIS, 2020). We illustrate points on the sphere using a 2D projection of the Earth s map. We train and sample from von Mises Fisher SS-DDPM using T = 100 steps. The forward process is illustrated in Figure 9. We use the same MLP architecture as described in Appendix I and also use time-dependent tail normalization. To map the prediction onto the unit sphere, we normalize the three-dimensional output of the MLP. We optimize Lvlb using the Adam W optimizer with a learning rate of 0.0002 and exponential decay with γ = 0.999997. The model is trained for 2, 000, 000 iterations with batch size 100. For inference, we also use EMA weights with a decay of 0.9999. K Discrete data We evaluate Categorical SS-DDPM on the task of unconditional character-level text generation on the text8 dataset. We evaluate our model on sequences of length 256. Using the property of Categorical SS-DDPM, that we can directly compute mutual information between x0 and Gt (see Appendix F.4), we match the noising schedule of Gt to the noising schedule of xt by Austin et al. (2021). We optimize Categorical SS-DDPM on the VLB objective. Following Austin et al. (2021), we use the default T5 encoder architecture with 12 layers, 12 heads, mlp dim 3072 and qkv dim 768. We add positional embeddings to the sequence of tail statistics Gt and also add the sinusoidal positional time embedding to the beginning. We use Adam (Kingma & Ba, 2015) with learning rate 5 10 4 with a 10, 000-step learning rate warmup, but instead of inverse sqrt decay, we use exponential decay with γ = 0.999995. We use a standard 90, 000, 000/5, 000, 000/500, 000 train-test-validation split and train neural network for 512 epochs (time costs when using 3 NVIDIA A100 GPUs: training took approx. 112 hours and estimating NLL on the test set took approx. 2.5 hours). Since the Softmax function does not break the sufficiency of tail statistic in Categorical SS-DDPMs (see Appendix F.4), we can use the Softmax function as an efficient input normalization for the neural network. Categorical SS-DDPM also provides us with a convenient way to define log p(x0|G1), and we make use of it to estimate the NLL score (see Appendix F.4). L Image data In our experiments with Beta SS-DDPM, we use the NCSN++ neural network architecture and train strategy by Song et al. (2020b) (time costs when using 4 NVIDIA 1080 GPUs: training took approx. 96 hours, sampling of 50, 000 images took approx. 10 hours). We put a sigmoid function on the top of NCSN++ to map the predictions to the data domain. We use time-dependent tail normalization and train Beta SS-DDPM with T = 1000. We matched the noise schedule in Beta SS-DDPM to the cosine noise schedule in Gaussian DDPM (Nichol & Dhariwal, 2021) in terms of mutual information, as described in Appendix G. M Changing discretization When sampling from DDPMs, we can skip some timestamps to trade off computations for the quality of the generated samples. For example, we can generate xt1 from xt2 in one step, without generating the intermediate variables xt1+1:t2 1: p DDPM θ (xt1|xt2) = q DDPM(xt1|xt2, x0)|x0=x DDPM θ (xt2,t2) = (147) αDDPM t2 αDDPM t1 1 αDDPM t2 x DDPM θ (xt2, t2) + αDDPM t2 αDDPM t1 1 αDDPM t2 xt2, 1 αDDPM t1 1 αDDPM t2 1 αDDPM t2 αDDPM t1 xt1; αDDPM t1 αDDPM t2 pαDDPM t1 (1 αDDPM t2 )x DDPM θ (xt2, t2) + pαDDPM t2 (1 αDDPM t1 ) pαDDPM t1 (1 αDDPM t2 )xt2, (1 αDDPM t1 )(αDDPM t1 αDDPM t2 ) (1 αDDPM t2 )αDDPM t1 (149) In the general case of SS-DDPM, we can t just skip the variables. If we skip the variables, the corresponding tail statistics will become atypical and the generative process will fail. To keep the tail statistics adequate, we can sample all the intermediate variables xt but do it in a way that doesn t use additional function evaluations: s=t1 AT s T (xs) + Gt2, where (150) xs q(xs|x0)|x0=x SS θ (Gt2,t2) (151) In case of Gaussian SS-DDPM this trick is equivalent to skipping the variables in DDPM: Gt1 = 1 αDDPM t1 pαDDPM t1 αSS s xs 1 αSS s + 1 αDDPM t1 pαDDPM t1 pαDDPM t2 1 αDDPM t2 Gt2 = (152) = 1 αDDPM t1 pαDDPM t1 αSS s 1 αSS s x SS θ (Gt2, t2) + 1 αDDPM t1 pαDDPM t1 αSS s 1 αSS s ϵs + pαDDPM t2 (1 αDDPM t1 ) pαDDPM t1 (1 αDDPM t2 )Gt2 (153) EGt1 = 1 αDDPM t1 pαDDPM t1 αDDPM t1 1 αDDPM t1 αDDPM t2 1 αDDPM t2 x SS θ (Gt2, t2) + pαDDPM t2 (1 αDDPM t1 ) pαDDPM t1 (1 αDDPM t2 )Gt2 = (154) = (1 αDDPM t1 )(αDDPM t1 αt2) pαDDPM t1 (1 αDDPM t1 )(1 αDDPM t2 )x SS θ (Gt2, t2) + pαDDPM t2 (1 αDDPM t1 ) pαDDPM t1 (1 αDDPM t2 )Gt2 = (155) = (αDDPM t1 αt2) pαDDPM t1 (1 αDDPM t2 )x SS θ (Gt2, t2) + pαDDPM t2 (1 αDDPM t1 ) pαDDPM t1 (1 αDDPM t2 )Gt2 (156) DGt1 = (1 αDDPM t1 )2 αSS s 1 αSS s = (1 αDDPM t1 )2 αDDPM t1 αDDPM t2 (1 αDDPM t1 )(1 αDDPM t2 ) = (157) = (1 αDDPM t1 )(αDDPM t1 αDDPM t2 ) (1 αDDPM t2 )αDDPM t1 In general case, this trick can be formalized as the following approximation to the reverse process: p SS θ (xt1:t2|Gt2) = t=t1 q SS(xt|x0)|x0=xθ(Gt(xt:T ),t) t=t1 q SS(xt|x0)|x0=xθ(Gt2(xt2:T ),t2) (159) Table 6: Examples of SS-DDPM in different families. There are many different ways to parameterize the distributions and the corresponding schedules. Further details are discussed in Appendix F.1 F.8. DISTRIBUTION q SS(xt|x0) NOISING SCHEDULE TAIL STATISTIC Gt(xt:T ) KL DIVERGENCE DKL (q SS(xt|x0) p SS θ (xt|xt+1:T )) GAUSSIAN N xt; αtx0, 1 αt xt R 1 t 0 αt t T 0 where α t = αs 1 αs 1+PT s=t BETA Beta (xt; αt(x0), βt(x0)) xt [0, 1] αt(x0) = 1 + νtx0 βt(x0) = 1 + νt(1 x0) + t 0 νt t T 0 PT s=t νs log xs 1 xs log Beta(αt(xθ), βt(xθ)) Beta(αt(x0), βt(x0))+ + νt(x0 xθ)(ψ(αt(x0)) ψ(βt(x0))) DIRICHLET Dir xt; α1 t(x0), . . . , αK t (x0) xt [0, 1]K PK i=1 xk t = 1 αk t (x0) = 1 + νtxk 0 + t 0 νt t T 0 PT s=t νs log xs log Γ(αk t (xθ)) Γ(αk t (x0))+ + νt(xk 0 xk θ)ψ(αk t (x0)) CATEGORICAL Cat (xt; pt(x0)) xt {0, 1}D PD i=1 xi t = 1 pt(x0) = x0Qt I t 0 Qt t T QT Qsx T s D X i=1 (pt(x0))i log (pt(x0))i VON MISES v M (xt; x0, κt) xt [ π, π] + t 0 κt t T 0 PT s=t κs cos xs sin xs κt I1(κt) I0(κt)(1 cos(x0 xθ)) VON MISES FISHER v MF (xt; x0, κt) xt [ 1, 1]K + t 0 κt t T 0 PT s=t κsxs κt IK/2(κt) IK/2 1(κt)x T 0(x0 xθ) GAMMA Γ (xt; αt, βt(x0)) xt (0, + ) βt(x0) = αt(ξt + (1 ξt)x 1 0 ) + t 0 αt t T αT 0 t 0 ξt t T 1 PT s=t αs(1 ξs)xs αt h log βt(x0) βt(xθ) + βt(xθ) WISHART W (Xt; nt, Vt(X0)) Xt Rp p µt(X0) = ξt I + (1 ξt)X 1 0 Vt(X0) = n 1 t µ 1 t (X0) + t 0 nt t T n T 0 t 0 ξt t T 1 PT s=t ns(1 ξs)Xs nt 2 log V 1 t (Xθ)Vt(X0) tr V 1 t (Xθ)Vt(X0) + p