# reflected_diffusion_models__e5bf2d43.pdf Reflected Diffusion Models Aaron Lou 1 Stefano Ermon 1 Score-based diffusion models learn to reverse a stochastic differential equation that maps data to noise. However, for complex tasks, numerical error can compound and result in highly unnatural samples. Previous work mitigates this drift with thresholding, which projects to the natural data domain (such as pixel space for images) after each diffusion step, but this leads to a mismatch between the training and generative processes. To incorporate data constraints in a principled manner, we present Reflected Diffusion Models, which instead reverse a reflected stochastic differential equation evolving on the support of the data. Our approach learns the perturbed score function through a generalized score matching loss and extends key components of standard diffusion models including diffusion guidance, likelihood-based training, and ODE sampling. We also bridge the theoretical gap with thresholding: such schemes are just discretizations of reflected SDEs. On standard image benchmarks, our method is competitive with or surpasses the state of the art without architectural modifications and, for classifier-free guidance, our approach enables fast exact sampling with ODEs and produces more faithful samples under high guidance weight. 1. Introduction Originally introduced in Sohl-Dickstein et al. (2015) and later augmented in Song & Ermon (2019b); Ho et al. (2020); Song et al. (2021b), diffusion models have quickly become one of the most ubiquitous deep generative models, with applications in many domains including images (Dhariwal & Nichol, 2021), natural language (Li et al., 2022), and molecule generation (Xu et al., 2022). Additionally, their stability and scabality have enabled the deployment of large text-to-image systems (Ramesh et al., 2022). 1Department of Computer Science, Stanford University. Correspondence to: Aaron Lou . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Code Link: https://github.com/louaaron/Reflected-Diffusion/ Diffusion models learn to reverse a stochastic process that maps data to noise, but, as a result of inherent approximation error of both the SDE discretization and score matching, they often follow incorrect trajectories. This behavior compounds error, so models can diverge and generate highly unnatural samples on more complex tasks (see for instance Figure 3). To mitigate this degeneration, many diffusion models modify the sampling process by projecting to the support of the data after each diffusion step (Ho et al., 2020; Li et al., 2022), a technique known as thresholding. This incorporates the known constraints of the data distribution, stabilizing sampling and avoiding divergent behavior. Notably, this oft-overlooked detail underlies many pixel-based image diffusion models (Ho et al., 2020; Dhariwal & Nichol, 2021) (appearing as a [0, 255] clipping function at each diffusion step) and is essential for text-to-image generation (Saharia et al., 2022). Although thresholding avoids failure, it is theoretically unprincipled because it leads to a mismatch between the training and generative processes. Furthermore, this mismatch can introduce artifacts such as oversaturation (Ho & Salimans, 2022) that necessitate further modifications (Saharia et al., 2022). In this work, we present Reflected Diffusion Models, a class of diffusion models that, by design, respects the known support of the data distribution. Unlike standard diffusion models, which perturbs the data density with Brownian motion, our method evolves the distribution with reflected Brownian motion that always stays within the boundary. We then parameterize the reversed diffusion process with the scores of the perturbed density, which we learn using a new score matching method on bounded domains. The resulting generative model is a reflected SDE that automatically incorporates the data constraints without altering the generative process. We provide an overview of our method in Figure 1. Our proposed methodology has several merits: Scales to high dimensions. To learn the score function on a general bounded domain, we introduce constrained denoising score matching (CDSM). Unlike previous methods (Hyv arinen, 2007), CDSM scales to high dimensions, and we develop an algorithm for fast computation. Since reflection operations are negligible compared to neural network computation, our training and inference times are effectively equivalent to those of standard diffusion models. Reflected Diffusion Models Data Forward Reflected SDE Prior Reversed Reflected SDE Data Figure 1. Overview of Reflected Diffusion Models. We map a data distribution p0 supported on Ωto the prior distribution p T through a reflected stochastic differential equation (Section 3.1). Whenever a Brownian trajectory hits Ω, it is reflected back in instead of escaping (circled in red), so pt is supported on Ωfor all t. We can recover p0 from p T with a reversed reflected stochastic differential equation (Section 3.2) by learning the Stein score x log pt (Section 4). Our generative model is guaranteed to be constrained in Ω. Key features transfer over. We show that ODE sampling (Song et al., 2021b), diffusion guidance (Ho & Salimans, 2022), and maximum likelihood bounds (Song et al., 2021a) extend to the reflected setting. As such, our method can be modularly applied to preexisting diffusion model systems. Justifying and correcting previous methods. We draw connections with the thresholding methods used in pixelspace diffusion models (Saharia et al., 2022). These methods all sample from a reflected stochastic differential equation despite being trained on a standard diffusion process. Correctly training with our CDSM loss avoids pathological behavior and allows for equivalent ODE sampling. Broad Applicability. We apply our method to highdimensional simplices (e.g. class probabilities) and hypercubes (e.g. images). Using a synthetic example, we show that our method is the a simplex diffusion model that scales to high dimensions. On common image generation benchmarks, our results are competitive with or surpass the current state of the art. In particular, on unconditional CIFAR-10 generation (Krizhevsky, 2009), we achieve a state of the art Inception Score of 10.42 and a comparable FID score of 2.72. For likelihood estimation, our method achieves a second best score of 2.68 and 3.74 bits per dimension on CIFAR-10 and Image Net32 (van den Oord et al., 2016) without relying on either importance sampling or learned noise schedules. 2. Background To introduce diffusion models (in the continuous time formalism of (Song et al., 2021b)), we first transform a data density q0 on Rd by applying a forward diffusion process. This takes the form of perturbing points x0 q0 with an SDE with a fixed drift coefficient f : Rd R Rd, diffusion coefficient g : R R, and Brownian motion Bt: dxt = f(xt, t)dt + g(t)d Bt (1) The resulting family of time varied distributions xt qt approaches a known prior distribution q T N(0, σ2 T I). This density evolution process can be reversed by perturbing samples x T q T with a reversed SDE (Anderson, 1982): dxt = (f(xt, t) g2(t) x log qt(xt))dt + g(t)d Bt (2) where Bt is time reversed Brownian motion. Diffusion models approximate this reverse process by learning x log qt, known as the score function, through a λ-weighted score matching loss: Et,xt qtλt sθ(xt, t) x log qt(xt) 2 (3) which most commonly takes the form of the more tractable denoising score matching loss (Vincent, 2011): Et,x0 q0,xt qt( |x0)λt sθ(xt, t) x log qt(xt|x0) 2 (4) Here, qt(xt|x0) is the transition kernel induced by the SDE in Equation 1. With a learned score sθ(x, t) x log qt, one can define a generative model by first sampling y T N(0, σ2 T I) and then solving the reverse SDE dyt = (f(yt, t) g(t)2sθ(yt, t))dt + g(t)d Bt (5) from time T to 0, giving an approximate sample from q0. Diffusion models enjoy many special properties. For example, for certain λt, Equation 4 can be reformulated as Reflected Diffusion Models an ELBO using Girsanov s theorem (Song et al., 2021a; Kingma et al., 2021; Huang et al., 2021), allowing for maximum likelihood training. Furthermore, one can derive an equivalent Neural ODE that can be used for sampling and exact likelihood evaluation (Chen et al., 2018). Guidance. One can also control the diffusion model to sample from a synthetic distribution qt(xt|c) qt(c|xt)wqt(xt). Here, c is a desired condition such as a class or text description, and interpolating the guidance weight w controls the fidelity and diversity of the samples. This requires the score x log qt(xt|c) = w x log qt(c|xt) + x log qt(xt) (6) which can be learned sθ(xt, t, c) x log qt(xt|c) without requiring explicit training on qt. For example, classifier guided methods (Song et al., 2021b; Dhariwal & Nichol, 2021) combine a pretrained score function sθ(xt, t) and classifier qt(c|xt): sθ(xt, t, c) := w x log qt(c|xt) + sθ(xt, t) (7) and classifier-free guidance methods (Ho & Salimans, 2022) uses a c-conditioned score function and an implicit Bayes classifier qt(c|xt) = qt(xt|c) qt(xt)qt(c) sθ(xt, t, c) := (w + 1)sθ(xt, t, c) wsθ(xt, t) (8) Thresholding. However, since sθ is not a perfect score function, there is a mismatch between the modeled backward process and the true forward process. Thus, diffusion models can push a sample to areas where qt(xt) is small, which creates a negative feedback loop since score matching struggles in low probability areas (Song & Ermon, 2019a; Koehler et al., 2022). This causes the sampling process to diverge and commonly occurs for more complex tasks, especially those involving diffusion guidance. To combat this, many previous works alter the diffusion sampling procedure using thresholding (Saharia et al., 2022; Li et al., 2022), which stabilizes the sampling process with inductive biases from the data. In particular, thresholding applies an operator O that projects back to the data domain Ωduring each discretized SDE step: yt t = O(yt f(yt, t) g(t)2sθ(yt, t) t)+g(t)B t (9) For the case of images, O can be static thresholding, which clips each dimension to the pixel range [0, 255], and dynamic thresholding, which first normalizes all pixels by the p-th percentile pixel before clipping (Saharia et al., 2022). Thresholding alleviates divergent sampling but comes with considerable downsides. For example, it breaks the theoretical setup since the generative model no longer approximates the reverse diffusion process. This mismatch induces artifacts during sampling and precludes the use of ODE sampling (Song et al., 2021b). 3. Reflected Diffusion Models In this section, we present Reflected Diffusion Models. These define a generative model on a data domain Ω(assumed to be connected and compact with nonempty interior and uniform Hausdorff dimension) which outer-bounds the support of the data distribution p0. Our method retains the theoretical underpinnings of diffusion models while incorporating inductive biases from thresholding. We highlight the core mechanisms in Figure 1. 3.1. Reflected Stochastic Differential Equations To model diffusion processes on a compact domain Ω, we use reflected SDEs. For ease of presentation, we only give an intuitive definition of reflected SDEs and simplify so that g is scalar and Lt reflects in the normal direction. In Appendix A.1, we provide a more rigorous mathematical definition and generalize to matrix diffusion coefficients and oblique reflections. For a full introduction, we recommend the readers consult a monograph such as Pilipenko (2014). Our reflected SDEs perturb an initial datum x0 p0 and are parameterized by a drift coefficient f : Ω R Rd and diffusion coefficient g : R R: dxt = f(xt, t)dt + g(t)d Bt + d Lt (10) The first two terms on the right hand side of Equation 10 are exactly those of Equation 1, showing that our reflected SDE behaves like a regular SDE in the interior of Ω. Lt is the additional boundary constraint that, intuitively, forces the particle to stay inside Ω. When xt hits Ω, Lt neutralizes the outward normal-pointing component. This reflected SDE has a unique strong solution as long as f and g are Lipschitz in state and time and Ωsatisfies the uniform exterior sphere condition (Pilipenko, 2014, Theorem 2.5.4), which ensures that Ωis sufficiently regular. In particular, the uniform exterior sphere condition holds true when Ωis smooth and even when Ωis a convex polytope. 3.2. Density Evolution and Time Reversal When we perturb p0 with the reflected SDE in Equation 10, our density evolves according to the Fokker-Planck equation with Neumann boundary condition (Schuss, 2013): tpt = div( ptf + g2 2 xpt) n = 0, x Ω, n normal, t > 0 (11) In addition to allowing us to characterize the limiting density p T , this induces a reversed reflected stochastic differential equation (Cattiaux, 1988; Williams, 1988): dxt = (f(xt, t) g(t)2 x log pt(xt))dt +g(t)d Bt + d Lt (12) Reflected Diffusion Models where Lt is the reversed boundary condition. For our case, Lt also reflects in the normal direction. Remark 3.1. The reversed reflected SDE closely resembles the reversed standard SDE given in Equation 2. On one hand, this is natural because local dynamics match: when Ω= Rd, Lt disappears since xt can never hit Rd = . On the other hand, it is surprising that we can reverse a reflected diffusion process with another reflected diffusion process, something that does not hold in the discrete time case. 3.3. Reflected SDEs in Practice In our experiments, Ωwill be either the unit cube Cd := {x Rd : 0 xi 1} or the unit simplex, which is given by d := {x Rd : Pd i=1 xi = 1, xi 0}. We often find it more convenient to work with the projected simplex d := {x Rd : Pd 1 i=1 xi 1, xi 0} as it is bounded in Rd instead of in a hyperplane. We will diffuse with the Reflected Variance Exploding SDE (RVE SDE), a generalization of the Variance-Exploding SDE introduced in Song et al. (2021b). A RVE SDE is parameterized by σ0 σ1 and is defined for t [0, 1] by dxt = σtd Bt + d Lt (13) where σt := σ1 t 0 σt 1 2 log σ1 σ0 . The reverse is dxt = σ2 t x log pt(xt)dt + σtd Bt + d Lt (14) Note that the RVE SDE corresponds to a time dilated version of reflected Brownian motion: time t of a RVE SDE corresponds to time σt of reflected Brownian motion, where σt := σ1 t 0 σt 1. As a result of Equation 11, p0 evolves under a heat equation with Neumann boundary conditions: tpt = g(t)2 2 xpt xpt n = 0 on Ω (15) Note that p1 becomes a uniform density over Ωfor large enough σ1. To see this, we can draw intuition from physics: heat homogenizes in a closed container. 4. Score Matching on Bounded Domains While the reflected SDE framework provides a nice theoretical pathway to construct a reflected diffusion model, it requires one to learn the score function sθ x log pt on Ω. We minimize the constrained score matching loss: 1 2EΩ x p sθ(x) x log p(x) 2 (16) where we omit time-dependence for presentation purposes. Furthermore, EΩindicates the domain of the expectation (as opposed to E which is an integral over Rd). This is because p can be discontinuous at Ω(since it is 0 outside of Ωand can be nonzero on Ω), so constraining the integral ensure regularity properties used for theorems (such as Stokes ). In this section, we review previous methods for score matching on bounded domains, discuss their fundamental limitations, and propose constrained denoising score matching to overcome these difficulties. Additionally, for the RVE SDE introduced in Section 3.3, we show how to quickly compute the score matching training objective. 4.1. Pitfalls of Implicit Score Matching One may hope to draw inspiration from the standard paradigm, which transforms the score matching integral 1 2Ex q sθ(x) x log q(x) 2 (17) into the implicit score matching loss (Hyv arinen, 2005): div(sθ)(x) + 1 2 sθ(x) 2 (18) This removes the intractable x log q(x), allowing for estimation using Monte Carlo sampling. However, the derivation requires the use of Stokes theorem; applying Stokes theorem to Equation 16 would instead result in div(sθ)(x) + 1 Ω p(x) sθ(x), n(x) dx (19) where n(x) is the interior pointing normal vector. Unlike the case of Ω= Rd, where the second term disappears since Rd = , this result is computationally intractable. Thus, previous work instead proposes to re-weight the loss function with a nonnegative function h that vanishes on the boundary (Hyv arinen, 2007; Yu et al., 2020), minimizing 1 2EΩ x ph(x) sθ(x) x log p(x) 2 (20) Since h vanishes on Ω, we can cleanly apply Stokes theorem and derive a result without a boundary term, giving an implicit score matching loss: div(h sθ)(x) + h(x) 2 sθ(x) 2 (21) However, this formulation is not suitable for high dimensions, even with fast numerical algorithms for the divergence operator (Hutchinson, 1989; Song et al., 2019). This is because the loss is downweighted near the boundaries, so, for a fixed budget, the error can become unbounded as x Ω. For high dimensions, the space near the boundary becomes an increasingly larger proportion of the total volume1, which greatly hampers the sample efficiency of the loss. 1Consider the case when Ω= [0, 1]d. For large d, almost all the mass is close to the boundary. Reflected Diffusion Models 4.2. Constrained Denoising Score Matching Inspired by the empirical success of denoising score matching (Vincent, 2011; Song & Ermon, 2019a), we present constrained denoising score matching (CDSM). Crucially, denoising score matching, unlike implicit score matching, can directly generalize to bounded domains due to how it handles discontinuities. This means that, unlike previous methods for constrained score matching, the derivation transfers smoothly. The core mechanism is presented in the following proposition, which we prove in Appendix A.2. Proposition 4.1. Suppose that we perturb an Ω-supported density a(x) with noise b(x| ) (also supported on Ω) to get a new density b(x) := R Ωa(y)b(x|y)dy. Then, under suitable regularity conditions for the smoothness of a and b, the score matching loss for b: 1 2EΩ x b sθ(x) x log b(x) 2 (22) is equal (up to a constant factor that does not depend on s) to the CSDM loss: 1 2EΩ x0 a EΩ x b( |x0) sθ(x) x log b(x|x0) 2 (23) With the constrained denoising score matching loss, we are then able to define a training objective for reflected diffusion models. In particular, since pt(x) is a by definition perturbed density of p0(x) with transition kernel pt(x| ), the weighted score score matching loss directly becomes: EΩ t,x0 p0,xt pt( |x0)λt sθ(xt, t) x log pt(xt|x0) 2 (24) For our reflected SDE, we will set λt g(t)2, mirroring previous work and minimizing variance during optimization. Interestingly, as we prove in Section 7, this corresponds to an ELBO loss when we reverse a RVE SDE. 4.3. Scaling Score Matching Computation We finalize by showing how to sample from and compute the score of the transition density pt(xt|x0) for the RVESDE. Note that this is the transition density of a reflected Brownian Motion (Harrison & Reiman, 1981). We highlight the key features of our method in Figure 2. Sampling. To sample from pt(xt|x0), we can repeatedly reflect a sample y from N(x0, σ2 t 2 I). In particular, we follow the line segment t ty + (1 t)x0, reflecting in the normal direction when it crosses Ωand repeating until we reach t = 1. This works because, intuitively, the boundary redirects the the Brownian motion but does not change the magnitude. In practice, this process can be quickly computed with classic computational geometric techniques. Score Computation. There are two approaches for computing the score of pt(xt|x0) on general geometric domains: Approximation with Sum of Gaussians (Jing et al., 2022b). This method decomposes pt(xt|x0) into an infinite sum of Gaussian densities that depend on x0, xt, and the geometry of the domain. For our bounded Ωwith the reflection condition, this gives us the equation pt(xt|x0) = X x R(xt) p N(x0,σ2 t /2 I)(x ) (25) where p N(x0,σ2 t /2 I) is the pdf of the Gaussian centered at x0 with variance σ2 t I and R(xt) is the set of all x Rd s.t. the repeated reflection of the path t tx +(1 t)x0 ends in xt. Note that this reflection scheme is the same one we use for sampling. Furthermore, through elementary derivations, this gives us a formula for the score x log pt(xt|x0). Generally, this method works quite well for small σt, as we only need to take a small number of local reflections to approximate pt(xt|x0). However, for larger σt, we need to take many more reflections since the underlying Gaussian is too dispersed, greatly increasing the computational cost. Approximation with Laplacian Eigenfunctions (Bortoli et al., 2022). This method instead computes using Laplacian Eigenfunctions, a standard technique for solving the heat equation (Evans, 2010). For our problem, these are a (known for each Ω) set of functions fi L2(Ω), i N that satisfy fi = λifi and fi n = 0 on Ω. In particular, these form an orthonormal basis for L2(Ω), allowing us to solve Equation 15 directly for an initial density of δx0: pt(xt|x0) = i=0 e λiσ2 t /2fi(xt)fi(x0) (26) This method works well for large σt because this means that e λiσ2 t /2 0, removing the need to evaluate many of the terms. However, for small σt, this method becomes costly because it requires many more terms. Similar to the above method, we can derive a formula for x log pt(xt|x0) through this sum. Our Method. We instead propose to combine the above two approaches. In particular, we note that they complement each other: Gaussian sum is accurate for small σt and eigenfunction sum is accurate for large σt. We can therefore set a σ (σ0, σ1) and compute with Gaussian sum when σt < σ and with eigenfunction sum when σt > σ . In practice, this allows us to upper-bound the number of reflections/eigenfunctions used to 5, much fewer than the exponential amount required for each method individually. Scaling to High Dimensions. By itself, this branching method is unfortunately not enough to scale to very high dimensions. In particular, our computation is, in the worst case, O(dk) where d is the dimension of Ωand k is the number of reflection steps or the highest eigenfunction frequency. We have only bounded k to something more manageable. Reflected Diffusion Models Sampled Reflected Sampled Outside Sampled Inside Figure 2. An overview of our computational method for constrained denoising score matching with Brownian transition probabilities. (i) We can draw samples by sampling N(x0, σ2 t I) and then applying reflections on the boundary. (ii) When t is small, we compute the transition density by summing up a mixture of Gaussians (shown for Ω= [0, 1]). (iii) When t is large, we compute using the frequencies of Ω(shown for Ω= [0, 1]). (iv) We diffeomorphically transform Ω [0, 1]d, where the transition score is tractable. To overcome this scaling issue, we consider the simple case of the hypercube [0, 1]d. Since our Brownian motion does not have inter-dimensional interactions, reflections do not interact between non-parallel hyperplanes, and Laplacian eigenfunctions factorize by dimension, we can decompose the probability along each component interval: pt(xt|x0) = i=1 pi t(xi t|xi 0) (27) where xi t and xi 0 are the i-components of xt and x0 respectively and pi t is the marginal probability on the i-th coordinate. Note that the RVE SDE on [0, 1]d marginalizes to a RVE SDE on [0, 1] for each dimension: dxi t = σtd Bi t + d Li t (28) where Bi t and Li t are the Brownian motion and boundary condition (respectively) for dimension i. We can therefore compute on each Ωi = [0, 1] and combine the results, reducing the cost from O(dk) to O(kd). Regular score matching is O(d), and since k is small, we can train Reflected Diffusion Models just as quickly as regular diffusion models. For more general domains, under certain conditions, we can smoothly and bijectively map from int(Ω) (0, 1)d. Thus, we can instead learn a diffusion model on [0, 1]d and then project back to Ω. More details are given in Appendix B.2. In particular, this mapping procedure allows us to learn a diffusion model on high-dimensional simplices d. 5. Simulating Reflected SDEs Combining a score sθ learned through CSDM and the reverse reflected SDE, we have a Reflected Diffusion Model: sample x T U(Ω) and solve the reflected SDE: dxt = σ2 tsθ(xt, t)dt + σtd Bt + d Lt (29) In this section, we examine numerical methods for simulating examples from this reflected SDE. 5.1. Euler-Maruyama Discretizations and Thresholding The typical Euler-Maruyama discretization of a standard SDE (Equation 1) is given by xt+ t = xt + f(xt, t) t + g(t)B t (30) where B t N(0, t I). For reflected SDEs, one can adapt this discretization by approximating the effect of Lt with some suitable operators O. xt+ t = O(xt + f(xt, t) t + g(t)B t) (31) Common examples of O include the projection operator proj(x) = arg miny Ωd(x, y) (Liu, 1993) or the reflection operator reflused in Section 4.3 (Schuss, 2013). One can see that, as t 0, both the projection and reflection schemes converge in distribution. Empirically, we find that reflection generates better samples. Interestingly, this closely mirrors the thresholding step given in Equation 9, with the only difference being the choice of operator O and whether O is applied before or after the noise step. This difference disappears when t 0: Proposition 5.1 (Thresholding solves a reflected SDE). Both types of thresholding solve the reflected SDE (Equation 10) as t 0 under suitable conditions. The full proposition and proof are given in Appendix A.5. 5.2. Predictor Corrector We extend the predictor-corrector (PC) framework of Song et al. (2021b), which has been shown to improve results. In particular, our learned scores can be used to augment Reflected Diffusion Models Model IS FID NCSN++ (Song et al., 2021b) 9.89 2.20 DDPM++ (Song et al., 2021b) 9.68 2.41 Styleformer (Park & Kim, 2021) 9.94 2.82 UNCSN++ (Kim et al., 2021) 10.11 Vit GAN (Lee et al., 2021) 9.89 4.87 Subspace NCSN++ (Jing et al., 2022a) 9.99 2.17 EDM (Karras et al., 2022) 1.97 Reflected Diffusion (ours) 10.46 2.72 Table 1. CIFAR10-Sample Quality Results. We test Reflected Diffusion Models on CIFAR-10 Image Generation and report IS and FID scores. Our model is highly competitive, achieving a state of the art-inception score for unconditional generation. However, FID lags behind due to noise (as discussed in Appendix B.3) the sampling procedure using Langevin Dynamics (Song & Ermon, 2019a). However, this requires Langevin dynamics for a constrained domain (Bubeck et al., 2015), which, for the probability p, are given by the reflected SDE: 2 x log p(xt)dt + d Bt + d Lt (32) During our reversed diffusion iterations, we can discretize the langevin dynamics using Reflected Euler-Maruyama and apply our learned score s( , t): x t = refl(xt + ϵ 2sθ(xt, t) + 2ϵ z) z N(0, 1) (33) In practice, we find that PC sampling with a small signal-tonoise ratio noticeably improves image generation results. CIFAR-10 Quality Results. With these components, we test our method for image generation on the CIFAR-10 dataset and report Inception Score (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID) (Heusel et al., 2017) in table 1. Our models remain competitive, achieving a SOTA Inception score of 10.42. However, Tweedies formula does generalize to reflected diffusion (Efron, 2011) (more details are in Appendix B.3), so our model generates images with imperceptible noise (on the scale of 1 2 pixels), which degrades the FID score to 2.72 (Jolicoeur Martineau et al., 2020). Despite this, our samples are diverse and visually indistinguishable (Appendix D). 5.3. Probability Flow ODE Similarly to the probability flow ODE derived in Song et al. (2021b), one can construct an equivalent deterministic process for a reflected SDE. Interestingly, doing this removes the boundary reflection term, so our deterministic process is exactly the original probability flow ODE derived in Song et al. (2021b): dx = f(x, t) 1 2g(t)2 x log pt(x) dt (34) Figure 3. Without thresholding, standard diffusion models easily diverge. We sample using classifier-free guidance (w = 1) from a standard diffusion model without using thresholding. Around half of the samples diverge (generating blank images). Crucially, the thresholding effect is maintained due to the Neumann condition for x log pt (Equation 11 line 2) and can t be replicated for standard diffusion models. We elaborate on this construction, as well as connections with DDIM (Song et al., 2020) in Appendix A.3. 6. Diffusion Guidance Both classifier and classifier-free guidance (Equations 7 and 8) extend to Reflected Diffusion Models by logarithm and gradient rules. Since thresholding is primarily useful for diffusion guidance, we investigate the relationship between thresholding, diffusion guidance, and Reflected Diffusion Models on the relatively simple downsampled 64x64 Image Net dataset (Russakovsky et al., 2014). Thresholding is critical. We corroborate Saharia et al. (2022), showing that pixel-spaced diffusion guidance requires thresholding. We show this for classifier-free guidance in Figure 3, where even a low weight w = 1 causes about half of the samples to diverge. For classifier guidance, around 75% of samples diverge (Figure 14). Our method retain fidelity under high guidance weight. Thresholding produces oversaturated images under high guidance weight w (Ho & Salimans, 2022; Saharia et al., 2022), hampering applications which require high fidelity generation. We hypothesize that this is caused by the training and sampling mismatch, and we show in Figure 4 that our method retains fidelity under high guidance weight. We did not find dynamic thresholding method to perform better. ODE sampling works for classifier-free guidance. The composed score function in classifier-free guidance (Equa- Reflected Diffusion Models Figure 4. Non cherry-picked guided samples from a reflected and standard diffusion model with high guidance weight. We compare Reflected Diffusion Models with standard diffusion models for generating class-conditioned 64x64 Image Net samples for a guidance weight w = 15. Our generated images are shown on the left, and the baseline is shown on the right (same positions have same classes). Our method retains fidelity while the baseline suffers from oversaturation. Figure 5. Guided ODE samples. We sample using our ODE with a guidance weight w = 1.5, retaining image fidelity with fewer forward evaluations (around 100 compared with 1000). tion 8) maintains the Neumann boundary condition (Equation 11), allowing for ODE sampling. Using this, we demonstrate the first case of high-fidelity classifier-free guided generation using ODEs in Figure 5. Interestingly, ODE equivalent DDIM sampling fails for classifier-free guidance but works for classifier guidance, despite classifier guidance being worse without thresholding (Appendix D). 7. Likelihood Bound Incidentally, our weighted score matching loss corresponds to an ELBO for our generative model. To show this, we extend Girsanov s Theorem (Øksendal, 1987), which is used to th derive the ELBO for standard diffusion models (Song et al., 2021a; Kingma et al., 2021; Huang et al., 2021): Theorem 7.1 (Reflected Girsanov for KL divergence). Suppose we have two reflected SDEs on the same domain Ω dxt = f1(xt, t)dt + g(t)d Bt + d Lt (35) dyt = f2(yt, t)dt + g(t)d Bt + d Lt (36) from t = 0 to T with x0 = y0 = z Ω. Let µ, ν be the path measures for (resp.) x and y. Then, 0 Epxt(y) h g(t)2 (f1 f2)(y, t) 2i dt The full theorem and proof are given in Appendix A.4. Note that, by also incorporating the prior and reconstruction loss, Equation 37 gives us an upper bound on the negative log-likelihood (Appendix A.4). Furthermore, for our reversed RVE SDE, Equation 37 becomes σ2 t sθ(xt, t) x log pt(xt) 2i dt (38) Reflected Diffusion Models Model C-10 IN32 Non-diffusion Flow++ (Ho et al., 2019) 3.08 Pixel-CNN++ (Salimans et al., 2017) 2.92 Sparse Transformer (Child et al., 2019) 2.80 Diffusion: Modified Noise Schedule Score Flow (Song et al., 2021a) 2.83 3.76 VDM (Kingma et al., 2021) 2.65 3.72 Diffusion: No Noise Modifications Score SDE (Song et al., 2021b) 2.99 ARDM (Hoogeboom et al., 2021) 2.71 Score Flow (Song et al., 2021a) 2.86 3.83 VDM (Kingma et al., 2021) 2.70 Reflected Diffusion (ours) 2.68 3.74 Table 2. CIFAR-10 and Image Net32 Bits-per-Dimension (BPD). No data augmentaiton; lower is Better. We test the likelihood of Reflected Diffusion Models for CIFAR-10 and downsampled Image Net32 without data augmentation. Our method is second best, nearly matching the state of the art (VDM), without requiring importance sampling or a learned noise schedule. which is a scaled version of our proposed weighted score matching loss in Equation 24. Therefore, we already implicitly train with maximum likelihood. Furthermore, when applied to an individual data point x, we recover the constrained denoising score matching loss: 0 Ext pt( |x) h g(t)2 sθ(xt, t) x log pt(xt|x) 2i dt (39) which allows us to derive an upper bound on log p(x). Image Likelihood Results. We test Reflected Diffusion Models on CIFAR-10 (Krizhevsky, 2009) and Image Net32 (van den Oord et al., 2016) for likelihoods, both without data augmentation. Our method performs comparatively to the SOTA while reducing the number of hyperparameters (in the form of importance sampling and learned noise schedules)2. Note that we can compute exact likelihoods through the probability flow ODE, which typically improves results (Song et al., 2021a), but, for a fair comparison with VDM, we report the likelihood bound. 8. Simplex Diffusion We also demonstrate that our reflected diffusion model can scale to high dimensional simplices. We train on softmaxed Inception classifier logits for Image Net (Szegedy et al., 2014), which take values in a 1000-dimensional simplex. 2We omit several results which report a better BPD than VDMs (Kingma et al., 2021) on Imagenet32 but a much worse CIFAR-10 result as they test on the the Image Net32 dataset used for classification (Chrabaszcz et al., 2017), which is significantly easier and incomparable due to the use of anti-aliasing. Figure 6. Simplex Diffusion Training and Validation Curves. Our method trains stably in high dimensions. Our training dynamics are reported in Figure 6 (with a 0.99 EMA), showing that our method is able to optimize the loss (and thus maximize the ELBO) even in high dimensions. Our diffusion process is fundamentally different from the simplex diffusion method from Richemond et al. (2022), as we evolve our dynamics directly on the simplex while the previous method diffuses on a higher dimensional space (the positive orthant) and projects to the simplex. 9. Conclusion We introduced Reflected Diffusion Models, a diffusion model which respects natural data constraints through reflected SDEs. Our method scales score matching on general bounded geometries and retains theoretical constructs from standard diffusion models. Our analysis also sheds light on the commonly used thresholding sampling method and provides improvements through correct training. We did not explore architecture or noise scheduling, which are critical for state of the art results; we leave this (and scaling to text to-image-generation) for future work. Latent Diffusion (LD) (Rombach et al., 2021) is a diffusion model method that also incidentally does not require thresholding. We hypothesize that this is because both our method and LD directly incorporate data space constraints. Notably, we work over an outer bound of the support of the data distribution, while LD works over a submanifold learned by a VAE. Future work could try to find a middle ground between these two data support approaches. 10. Acknowledgements This project was supported by NSF (#1651565), ARO (W911NF-21-1-0125), ONR (N00014-23-1-2159), CZ Biohub, and Stanford HAI GCP Grants. AL is supported by a NSF Graduate Research Fellowship. We would also like to thank Chenlin Meng for helpful discussions. Reflected Diffusion Models Anderson, B. D. O. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12:313 326, 1982. Ba, J., Kiros, J. R., and Hinton, G. E. Layer normalization. Ar Xiv, abs/1607.06450, 2016. Bortoli, V. D., Mathieu, E., Hutchinson, M., Thornton, J., Teh, Y. W., and Doucet, A. Riemannian score-based generative modeling. Ar Xiv, abs/2202.02763, 2022. Bubeck, S., Eldan, R., and Lehec, J. Finite-time analysis of projected langevin monte carlo. In NIPS, 2015. Cattiaux, P. Time reversal of diffusion processes with a boundary condition. Stochastic Processes and their Applications, 28:275 292, 1988. Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Neural Information Processing Systems, 2018. Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. Ar Xiv, abs/1904.10509, 2019. Chrabaszcz, P., Loshchilov, I., and Hutter, F. A downsampled variant of imagenet as an alternative to the cifar datasets. Ar Xiv, abs/1707.08819, 2017. Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Ar Xiv, abs/2105.05233, 2021. Dormand, J. R. and Prince, P. J. A family of embedded runge-kutta formulae. Journal of Computational and Applied Mathematics, 6:19 26, 1980. Efron, B. Tweedies formula and selection bias. Journal of the American Statistical Association, 106:1602 1614, 2011. Evans, L. C. Partial differential equations. American Mathematical Society, Providence, R.I., 2010. ISBN 9780821849743 0821849743. Harrison, J. M. and Reiman, M. I. Reflected brownian motion on an orthant. Annals of Probability, 9:302 308, 1981. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017. Ho, J. and Salimans, T. Classifier-free diffusion guidance. Ar Xiv, abs/2207.12598, 2022. Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving flow-based generative models with variational dequantization and architecture design. Ar Xiv, abs/1902.00275, 2019. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Ar Xiv, abs/2006.11239, 2020. Hoogeboom, E., Gritsenko, A. A., Bastings, J., Poole, B., van den Berg, R., and Salimans, T. Autoregressive diffusion models. Ar Xiv, abs/2110.02037, 2021. Huang, C.-W., Lim, J. H., and Courville, A. C. A variational perspective on diffusion-based generative models and score matching. In Neural Information Processing Systems, 2021. Hutchinson, M. F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 18:1059 1076, 1989. Hyv arinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res., 6:695 709, 2005. Hyv arinen, A. Some extensions of score matching. Comput. Stat. Data Anal., 51:2499 2512, 2007. Jing, B., Corso, G., Berlinghieri, R., and Jaakkola, T. Subspace diffusion generative models. In European Conference on Computer Vision, 2022a. Jing, B., Corso, G., Chang, J., Barzilay, R., and Jaakkola, T. Torsional diffusion for molecular conformer generation. Ar Xiv, abs/2206.01729, 2022b. Jolicoeur-Martineau, A., Piche-Taillefer, R., des Combes, R. T., and Mitliagkas, I. Adversarial score matching and improved sampling for image generation. Ar Xiv, abs/2009.05475, 2020. Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Ar Xiv, abs/2206.00364, 2022. Kim, D., Shin, S.-J., Song, K., Kang, W., and Moon, I.-C. Soft truncation: A universal training technique of scorebased diffusion model for high precision score estimation. In International Conference on Machine Learning, 2021. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2014. Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Ar Xiv, abs/2107.00630, 2021. Koehler, F., Heckett, A., and Risteski, A. Statistical efficiency of score matching: The view from isoperimetry. Ar Xiv, abs/2210.00726, 2022. Reflected Diffusion Models Krizhevsky, A. Learning multiple layers of features from tiny images. 2009. Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., and Liu, C. Vitgan: Training gans with vision transformers. Ar Xiv, abs/2107.04589, 2021. Li, X. L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T. Diffusion-lm improves controllable text generation. Ar Xiv, abs/2205.14217, 2022. Liu, Y. Numerical approaches to stochastic differential equations with boundary conditions. 1993. Øksendal, B. Stochastic differential equations : an introduction with applications. Journal of the American Statistical Association, 82:948, 1987. Park, J. and Kim, Y. Styleformer: Transformer based generative adversarial networks with style vector. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8973 8982, 2021. Pilipenko, A. An introduction to stochastic differential equations with reflection. 2014. Ramachandran, P., Zoph, B., and Le, Q. V. Swish: a selfgated activation function. ar Xiv: Neural and Evolutionary Computing, 2017. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. Ar Xiv, abs/2204.06125, 2022. Richemond, P. H., Dieleman, S., and Doucet, A. Categorical sdes with simplex diffusion. Ar Xiv, abs/2210.14784, 2022. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674 10685, 2021. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211 252, 2014. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J., and Norouzi, M. Photorealistic text-to-image diffusion models with deep language understanding. Ar Xiv, abs/2205.11487, 2022. Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. Ar Xiv, abs/1606.03498, 2016. Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. Ar Xiv, abs/1701.05517, 2017. Schuss, Z. Brownian dynamics at boundaries and interfaces. 2013. Skorokhod, A. V. Stochastic equations for diffusion processes in a bounded region. Theory of Probability and Its Applications, 6:264 274, 1961. Sohl-Dickstein, J. N., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. Ar Xiv, abs/1503.03585, 2015. Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. Ar Xiv, abs/2010.02502, 2020. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Ar Xiv, abs/1907.05600, 2019a. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019b. Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Conference on Uncertainty in Artificial Intelligence, 2019. Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. In Neural Information Processing Systems, 2021a. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum? id=Px TIG12RRHS. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1 9, 2014. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818 2826, 2015. Reflected Diffusion Models van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. Ar Xiv, abs/1601.06759, 2016. Vincent, P. A connection between score matching and denoising autoencoders. Neural Computation, 23:1661 1674, 2011. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning, 2008. Williams, R. J. On time-reversal of reflected brownian motions. 1988. Xu, M., Yu, L., Song, Y., Shi, C., Ermon, S., and Tang, J. Geodiff: a geometric diffusion model for molecular conformation generation. Ar Xiv, abs/2203.02923, 2022. Yu, S., Drton, M., and Shojaie, A. Generalized score matching for general domains. Information and inference : a journal of the IMA, 11 2:739 780, 2020. Reflected Diffusion Models A. Theoretical Constructs A.1. Reflected Stochastic Differential Equations We follow Pilipenko (2014) in our presentation. Given a domain Ωand an oblique reflection vector field v that satisfies v(x) n(x) = 1, where n is the inward pointing unit normal vector field, the reflected SDE is defined as dxt = f(xt, t)dt + G(xt, t)d Bt + v(xt)d Lt (40) where Lt is defined recursively as R t 0 1xt Ωd Ls. Here, we see that Lt is a process that determines whether xt hits the boundary and then applies a reflection. For our purposes in the main paper, v = n and we surppress the notation for compactness. Define σ = 1 v(xt, t) = σ(xt, t)n(x) σ(xt, t)n(x) (41) then Equation 11 generalizes (under suitable regularity conditions) (Schuss, 2013): tpt(x) = div( pt(x)f(x, t) + σ(x, t) xpt) (pt(x)f(x, t) σ(x, t) xpt) n(x) = 0 when x Ω, n is normal, t > 0 (42) As such, this induces a reverse process (Williams, 1988; Cattiaux, 1988) that one can easily check has the same marginal probability distributions: dxt = [f(xt, t) σ(xt, t) x log pt(xt)] dt + G(xt, t)d Bt + v(xt)d Lt (43) Here v is a vector field that satisfies the condition v n = 1 and v + v is a positive multiple of n. A.2. Constrained Denoising Score Matching Proposition A.1. Suppose that we perturb an Ω-supported density a(x) with noise b(x| ) (also supported on Ω) to get a new density b(x) := R Ωa(y)b(x|y)dy. Then, under suitable regularity conditions for the smoothness of a and b, the score matching loss for b: 1 2EΩ x b sθ(x) x log b(x) 2 (44) is equal (up to a constant factor that does not depend on s) to the CSDM loss: 1 2EΩ x0 a EΩ x b( |x0) sθ(x) x log b(x|x0) 2 (45) Proof. This proof comes down to showing that EΩ x b sθ(x), x log b(x) = EΩ x0 a EΩ x b( |x0) sθ(x), x log b(x|x0) (46) which can be done directly Reflected Diffusion Models EΩ x b sθ(x), x log b(x) = Z Ω sθ(x), x log b(x) b(x)dx (47) sθ(x), xb(x) b(x)dx (48) Ω sθ(x), xb(x) dx (49) Ω a(y)b(x|y)dy dx (50) Ω a(y) xb(x|y)dy dx (51) Ω a(y)b(x|y) x log b(x|y)dy dx (52) Ω a(y)b(x|y) sθ(x), x log b(x|y) dydx (53) = EΩ x0 a EΩ x b( |x0) sθ(x), x log b(x|x0) (54) This proof is exactly the same as the one presented in (Vincent, 2011). The only difference is that we replace the domain of integration with Ω. Note that the key property that allows us to complete the proof is the convolution identity, which generalizes unlike Stokes theorem for implicit score matching. A.3. Probability Flow ODE and Connections to DDIM We now derive the probability flow ODE, show how to use it to sample, and discuss connections with DDIM. For convenience, we will work with the assumptions given in the paper (that the diffusion coefficient is a scalar depending on only time and that reflection is in the normal direction), but our results directly generalize (given sufficient regularity conditions) to general noise schedules and oblique reflections. Proposition A.2 (Probability Flow ODE). For the reflected SDE dxt = f(xt, t)dt + g(t)d Bt + d Lt (55) The ODE given by dxt = f(xt, t) g(t)2 2 x log pt(xt) dt (56) follows the same probability evolution pt. Proof. By the forward Kolmogorov Equation, we ca see that the ODE follows tpt(x) = div( pt(x)f(x, t) + g(t)2 2 xpt) (57) However, we must confirm that the ODE doesn t exit Ω. By the Neumann boundary conditions for the SDE, we see that (f(x, t) g(t)2 2 x log pt(x)) n(x) (58) on the boundary, so the flow induced by the ODE is indeed a valid diffeomorphism from Ω Ω. Similar to DDIM, we can derive equivalent processes by annealing the noise. Reflected Diffusion Models Proposition A.3 (Annealing Noise Level). For the reflected SDE dxt = f(xt, t)dt + g(t)d Bt + d Lt (59) The reflected SDE dxt = f(xt, t) g(t)2 g(t)2 2 x log pt(x)dt + g(t)d Bt + d Lt (60) follows the same probability evoluation pt for all noise levels g > 0. Proof. This follows directly from our Fokker-Planck Equation. Remark A.4. In the above proposition, g > 0 as there is no concept of a reflected ordinary differential equation. However, when the noise is 0, our limiting process yields an ODE. To sample with our score function sθ, we simply solve the reversed process, which is dxt = f(xt, t) g(t)2 2 sθ(xt, t) dt (61) for our ODE and dxt = f(xt, t) g(t)2 + g(t)2 2 sθ(xt, t) + g(t)d Bt + d Lt (62) for our annealed reflected SDE. When training with our CSDM objective sθ(xt, t), the ODE sampler (Equation 61) to mimic standard reflected diffusion sampling, which includes thresholding. Conversely, when the score is trained with standard score matching, the sampler just removes thresholding, causing the process to simulate the diffusion path without thresholding. To mimic the thresholding effect, one must instead turn to the the annealed reflected SDE sampler of Equation 62. If we discretize the equation and (with an abuse of notation) set g = 0, then we recover the thresholded DDIM sampler. Unfortunately, changing g necessarily causes the sampled distribution to shift since sθ(xt, t) is not trained to mimic the correct x log pt(x), so the reverse process necessarily results in divergent behavior. A.4. Girsanov Theorem for Reflected SDEs and Likelihood Evaluation We derive our likelihood bounds. We first recall Girsanov s Theorem for SDEs (Øksendal, 1987) Theorem A.5 (Girsanov Theorem). Let Φ be a bounded functional on the space of continuous functions C([0, T]). For the SDE evolving on [0, T] with d Xt = µ(t, Xt)dt + σ(t, Xt)d Bt (63) 0 µ(s, Xt)d Bt 1 0 µ(t, Xt) 2 dt where the expectation is taken is the path measure of the SDE. We then prove the analogue of this for reflected SDEs: Theorem A.6 (Girsanov Theorem for Reflected SDEs). Let Φ be a bounded functional on the space of continuous functions C([0, T]). For the reflected SDE evolving on Ωspace and [0, T] time with d Xt = µ(t, Xt)dt + σ(t, Xt)d Bt + d Lt (65) Reflected Diffusion Models where Lt is assumed to have normal reflection. We have 0 µ(s, Xt)d Bt 1 0 µ(t, Xt) 2 dt where the expectation is taken over the path measure of the reflected SDE. Proof. We first smoothly extend µ and σ to all of Rd s.t. the value goes to 0 very quickly on Ω. We then consider the processes Xn t defined i R+ by d Xi t = µi(t, Xt)dt + σ(t, Xt)d Bt + d Lt (67) where µi(t, x) = µ(t, x) + id(x, Ω)v(x) where d is the distance function and v(x) is the unit normal vector pointing from x to y where y := arg minz Ωd(x, z). It is well known that Xi t Xt in measure as i (Liu, 1993). Since Φ is a bounded (and thus continuous) functional, we thus have EΦ(Xi t) EΦ(Xt) as i . We finalize by noting that EΦ(Xi t) = E 0 µi(s, Xi t)d Bt 1 µi(t, Xi t) 2 dt As i , Xi t will remain in Ωw.p. 1 and µi = µ on Ω. Therefore, we have the desired convergence 0 µ(s, Xt)d Bt 1 0 µ(t, Xt) 2 dt as desired. Corollary A.7. As a corollary, when Φ is log dµ dν , this gives us Theorem 7.1. Remark A.8. It is possible that we can generalize our theorem to obliquely reflected SDEs, although we did not pursue this line of inquiry. Remark A.9. Theorem 7.1 recovers the denoising score matching loss if we slice an initial δx distribution. In particular, this is the continuous time diffusion loss LT that is used to the form the ELBO for standard diffusion models (Kingma et al., 2021; Ho et al., 2020). A.5. Thresholding On [ 1, 1]D, the dynamic thresholding operator is defined by dynthreshp(x) = (proj[ 1,1](xi/ max(s, 1))) s is the p-th percentile of |xi| (70) which of course can be scaled to [0, 1]D (for our setup). Proposition A.10 (Static Thresholding Solves the Reflected SDE). On domains Ωbetween times [0, T], the discretization xt+ t = proj(xt + f(xt, t) t) + g(t)B t (71) solves the reflected SDE d Xt = f(Xt, t) + g(t)d Bt + nd Lt (72) as t 0 when f and g are uniformly Lipschitz in time and space and satisfy the linear growth condition for any Lipschitz extension of f to the general space Rd. Proof. This closely mirrors the proof showing that the standard projection scheme yt+ t = proj(yt + f(xt, t) t) + g(t)B t (73) converges to the solution of the reflected SDE as t 0 (Skorokhod, 1961; Schuss, 2013; Liu, 1993). The key difference is that the process is not supported on Ωsince the projection happens after each. However, since our extension on f is Lipschitz, this error is well behaved and disappears as t 0. Reflected Diffusion Models Corollary A.11 (Dynamic Thresholding Solves the Reflected SDE). With the same conditions as given above, on [ 1, 1]D, the discretization xt+ t = dynthreshp(xt + f(xt, t) t) + g(t)B t (74) converges to the solution of the reflected SDE when f(xt, t) does not point outside of [ 1, 1]D on (1 p)D dimensions. Proof. Under our conditions, dynthreshp becomes the projection operator since the p-th percentile of |xi| will always be 1. This replicates the above proposition. Remark A.12. In practice, we found that learned score networks f satisfy the pointing inside condition above. In particular, as t 0, dynthreshp tends to behave exactly like proj for all p < 1. B. Practical Implementation B.1. Exact Equations for Reflected Brownian Transition Probabilities For [0, 1], the reflected transition probability for a source x with diffusion value σ (which correspond to the mean and standard deviation for the standard normal distribution) is given by p R(x,σ2)(y) = X z:y+z Z p N(x,σ)(z) = 1 + 2 k=1 e k2π2σ2/2 cos(kπx) cos(kπy) (75) Note that this means that the eigenfunctions of [0, 1] under our Neumann boundary condition are 1 and cos(kπx), with eigenvalues of 0 and πk2 B.2. Mapping Ωto [0, 1]d Our domain Ωhas an interior which maps bijectively to (0, 1)d iff Ωis simply connected. Note that this encompasses a wide variety of domains, notably convex sets. To construct a map f : [0, 1]d t, we use a variant of the common stick breaking procedure: (f(x))i = xi j=i+1 (1 xi) (76) which admits an inverse (f 1(y))i = yi 1 Pd j=i+1 yi (77) B.3. Denoising The Final Proability Distribution We note that Tweedies formula (Efron, 2011) does not hold for general bounded domains Ω. We show this for [0, 1]: given an initial distribution X, a perturbed distribution Y constructed by y R(x, σ2), where x X has a Tweedie denoiser: E[x|y] = Z 1 0 xp(x|y)dx (78) 0 xp(y|x)p X(x) p Y (y) dx (79) 0 xp R(x,σ2)(y)p X(x) p Y (y) dy (80) dy log p Y (y) (81) The reason why this works in the standard case is because the score of the Gaussian distribution is y x σ , which allows us to extract out the desired value. For reflected Gaussians, this does not hold. Reflected Diffusion Models Instead, for our experimental results on CIFAR-10, we denoise by training a denoising autoencoder (Vincent et al., 2008) trained on our reflected Gaussian noise. This follows the same architecture as our score network, and is trained to predict the noise (so that subtracting it recovers the initial sample). In general, this is required to get a decent FID score, but makes little difference in terms of perceptual quality as our images are accurate to within a 2.5 standard deviation noise to begin with. C. Experimental Setup C.1. Image Generation We exactly follow Song et al. (2021b) for both models and training hyperparameters. The only differences are that we set σ1 = 5 instead of 50 (for the VE SDE) since we mix well with σ1 = 5 while VE SDE needs a much larger σ1 to mix. Furthermore, we use the deep DDPM++ architecture, but we rescale the output 1 σ as is done for the NCSN++ architecture (for VE SDE). For sampling, we sample with 1000 predictor (Reflected Euler-Maruyama) steps with 1000 corrector (Reflected Langevin) steps (Song et al., 2021b). We use a signal-to-noise ratio of 0.03. C.2. Image Likelihood We almost exactly follow Kingma et al. (2021) for both models and training hyperparameters, replacing the standard diffusion with our reflected diffusion. We do not train with the noise schedule, instead setting σ0 = 10 4 and σ1 = 5, which causes the reconstruction and prior losses to be (numerically) 0. C.3. Guided Diffusion We exactly follow Ho & Salimans (2022) and train using the ADM architecture (Dhariwal & Nichol, 2021) with the same parameters (for standard diffusion). For reflected diffusion, we train with σ0 = 0.01 and σ1 = 5, following our CIFAR-10 experiments. Furthermore, we scale the output by 1/σ as the neural network outputs the noise vector and not the score. For the classifier-free guidance baseline, we retrain a Image Net64 model following Ho & Salimans (2022). For the classifier guided basleine, we use the pretrained models from Dhariwal & Nichol (2021). We sample using 1000 steps each. For our diffusion model, we use reflected Euler Maruyama. For the standard model, we use a standard Euler-Maruyama with thresholding after each step. For ODE sampling, we sample using a RK45 solver (Dormand & Prince, 1980). C.4. Simplex Diffusion We consider class probabilities outputted from the Inceptionv3 Image Net classifier (Szegedy et al., 2015). In particular, if {Xi} is a set of images and f is a classifier that outputs a 1000-dimensional vector of class probabilities, then we learn a distribution over {Li := f(Xi)}. For learning purposes, we clip this to a value in 999 and apply the transformation given in Appendix B.2. Our model is a simple MLP autoencoder with 4 intermediate layers of width 512. We use the Swish activation (Ramachandran et al., 2017) and apply Layer Norm (Ba et al., 2016). We train with Adam (Kingma & Ba, 2014) at a 2 10 4 learning rate. We apply an exponential moving average with a rate of 0.9999 before evaluating/generating our data. We visualize our full training and eval graphs below, as well as some samples taken from our model. Overall, we seem to be able to match the distribution reasonably well. Reflected Diffusion Models Figure 7. Training Dynamics for Simplex Diffusion To ensure that we are able to generate data, we generated 10000 and compare the generated histograms of the (most likely) classes. Results are shown below: (a) Ground Truth (b) Generated Figure 8. Generated Simplex Probabilities for Simplex Diffusion Reflected Diffusion Models D. Additional Generated Images Figure 9. CIFAR-10 Generated Images. Reflected Diffusion Models Figure 10. Image Net64 w = 1 classifier-free guided samples. Reflected Diffusion Models Figure 11. Image Net64 w = 2.5 classifier-free guided samples. Reflected Diffusion Models Figure 12. CIFAR-10 Generated Images (trained for BPD). Reflected Diffusion Models Figure 13. Image Net32 Generated Images (trained for BPD). Reflected Diffusion Models Figure 14. w = 1 baseline classifier guided images without thresholding. We sample from the pretrained model from Dhariwal & Nichol (2021). Around 75% of samples diverge, while most of the rest have noticeable artifacts such as a glaring white background. Figure 15. w = 1 baseline classifier guided images without thresholding and DDIM sampling. Interestingly, these samples don t diverge. Reflected Diffusion Models Figure 16. w = 1 baseline classifier-free guided images sampled with DDIM without thresholding. This corresponds to ODE sampling. Figure 17. Dynamically thresholded images, matches Figure 4. Reflected Diffusion Models Figure 18. w = 0.5 ODE samples, Reflected Diffusion. Figure 19. ODE number of forward evaluations (NFE) vs guidance weight for Reflected Diffusion. Increasing the guidance weight tends to increase the number of forward evaluations, but this is still relatively low.