# reflected_diffusion_models__e5bf2d43.pdf

Reﬂected Diffusion Models

Aaron Lou 1 Stefano Ermon 1

Score-based diffusion models learn to reverse a stochastic differential equation that maps data to noise. However, for complex tasks, numerical error can compound and result in highly unnatural samples. Previous work mitigates this drift with thresholding, which projects to the natural data domain (such as pixel space for images) after each diffusion step, but this leads to a mismatch between the training and generative processes. To incorporate data constraints in a principled manner, we present Reﬂected Diffusion Models, which instead reverse a reﬂected stochastic differential equation evolving on the support of the data. Our approach learns the perturbed score function through a generalized score matching loss and extends key components of standard diffusion models including diffusion guidance, likelihood-based training, and ODE sampling. We also bridge the theoretical gap with thresholding: such schemes are just discretizations of reﬂected SDEs. On standard image benchmarks, our method is competitive with or surpasses the state of the art without architectural modiﬁcations and, for classiﬁer-free guidance, our approach enables fast exact sampling with ODEs and produces more faithful samples under high guidance weight.

1. Introduction

Originally introduced in Sohl-Dickstein et al. (2015) and later augmented in Song & Ermon (2019b); Ho et al. (2020); Song et al. (2021b), diffusion models have quickly become one of the most ubiquitous deep generative models, with applications in many domains including images (Dhariwal & Nichol, 2021), natural language (Li et al., 2022), and molecule generation (Xu et al., 2022). Additionally, their stability and scabality have enabled the deployment of large text-to-image systems (Ramesh et al., 2022).

1Department of Computer Science, Stanford University. Correspondence to: Aaron Lou <aaronlou@stanford.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Code Link: https://github.com/louaaron/Reﬂected-Diffusion/

Diffusion models learn to reverse a stochastic process that maps data to noise, but, as a result of inherent approximation error of both the SDE discretization and score matching, they often follow incorrect trajectories. This behavior compounds error, so models can diverge and generate highly unnatural samples on more complex tasks (see for instance Figure 3). To mitigate this degeneration, many diffusion models modify the sampling process by projecting to the support of the data after each diffusion step (Ho et al., 2020; Li et al., 2022), a technique known as thresholding. This incorporates the known constraints of the data distribution, stabilizing sampling and avoiding divergent behavior. Notably, this oft-overlooked detail underlies many pixel-based image diffusion models (Ho et al., 2020; Dhariwal & Nichol, 2021) (appearing as a [0, 255] clipping function at each diffusion step) and is essential for text-to-image generation (Saharia et al., 2022). Although thresholding avoids failure, it is theoretically unprincipled because it leads to a mismatch between the training and generative processes. Furthermore, this mismatch can introduce artifacts such as oversaturation (Ho & Salimans, 2022) that necessitate further modiﬁcations (Saharia et al., 2022).

In this work, we present Reﬂected Diffusion Models, a class of diffusion models that, by design, respects the known support of the data distribution. Unlike standard diffusion models, which perturbs the data density with Brownian motion, our method evolves the distribution with reﬂected Brownian motion that always stays within the boundary. We then parameterize the reversed diffusion process with the scores of the perturbed density, which we learn using a new score matching method on bounded domains. The resulting generative model is a reﬂected SDE that automatically incorporates the data constraints without altering the generative process. We provide an overview of our method in Figure 1.

Our proposed methodology has several merits:

Scales to high dimensions. To learn the score function on a general bounded domain, we introduce constrained denoising score matching (CDSM). Unlike previous methods (Hyv arinen, 2007), CDSM scales to high dimensions, and we develop an algorithm for fast computation. Since reﬂection operations are negligible compared to neural network computation, our training and inference times are effectively equivalent to those of standard diffusion models.

Reﬂected Diffusion Models

Data Forward Reflected SDE Prior Reversed Reflected SDE Data

Figure 1. Overview of Reﬂected Diffusion Models. We map a data distribution p0 supported on Ωto the prior distribution p T through a reﬂected stochastic differential equation (Section 3.1). Whenever a Brownian trajectory hits Ω, it is reﬂected back in instead of escaping (circled in red), so pt is supported on Ωfor all t. We can recover p0 from p T with a reversed reﬂected stochastic differential equation (Section 3.2) by learning the Stein score x log pt (Section 4). Our generative model is guaranteed to be constrained in Ω.

Key features transfer over. We show that ODE sampling (Song et al., 2021b), diffusion guidance (Ho & Salimans, 2022), and maximum likelihood bounds (Song et al., 2021a) extend to the reﬂected setting. As such, our method can be modularly applied to preexisting diffusion model systems.

Justifying and correcting previous methods. We draw connections with the thresholding methods used in pixelspace diffusion models (Saharia et al., 2022). These methods all sample from a reﬂected stochastic differential equation despite being trained on a standard diffusion process. Correctly training with our CDSM loss avoids pathological behavior and allows for equivalent ODE sampling.

Broad Applicability. We apply our method to highdimensional simplices (e.g. class probabilities) and hypercubes (e.g. images). Using a synthetic example, we show that our method is the a simplex diffusion model that scales to high dimensions. On common image generation benchmarks, our results are competitive with or surpass the current state of the art. In particular, on unconditional CIFAR-10 generation (Krizhevsky, 2009), we achieve a state of the art Inception Score of 10.42 and a comparable FID score of 2.72. For likelihood estimation, our method achieves a second best score of 2.68 and 3.74 bits per dimension on CIFAR-10 and Image Net32 (van den Oord et al., 2016) without relying on either importance sampling or learned noise schedules.

2. Background

To introduce diffusion models (in the continuous time formalism of (Song et al., 2021b)), we ﬁrst transform a data density q0 on Rd by applying a forward diffusion process.

This takes the form of perturbing points x0 q0 with an SDE with a ﬁxed drift coefﬁcient f : Rd R Rd, diffusion coefﬁcient g : R R, and Brownian motion Bt:

dxt = f(xt, t)dt + g(t)d Bt (1)

The resulting family of time varied distributions xt qt approaches a known prior distribution q T N(0, σ2 T I). This density evolution process can be reversed by perturbing samples x T q T with a reversed SDE (Anderson, 1982):

dxt = (f(xt, t) g2(t) x log qt(xt))dt + g(t)d Bt (2)

where Bt is time reversed Brownian motion. Diffusion models approximate this reverse process by learning x log qt, known as the score function, through a λ-weighted score matching loss:

Et,xt qtλt sθ(xt, t) x log qt(xt) 2 (3)

which most commonly takes the form of the more tractable denoising score matching loss (Vincent, 2011):

Et,x0 q0,xt qt( |x0)λt sθ(xt, t) x log qt(xt|x0) 2

(4) Here, qt(xt|x0) is the transition kernel induced by the SDE in Equation 1. With a learned score sθ(x, t) x log qt, one can deﬁne a generative model by ﬁrst sampling y T N(0, σ2 T I) and then solving the reverse SDE

dyt = (f(yt, t) g(t)2sθ(yt, t))dt + g(t)d Bt (5)

from time T to 0, giving an approximate sample from q0.

Diffusion models enjoy many special properties. For example, for certain λt, Equation 4 can be reformulated as

Reﬂected Diffusion Models

an ELBO using Girsanov s theorem (Song et al., 2021a; Kingma et al., 2021; Huang et al., 2021), allowing for maximum likelihood training. Furthermore, one can derive an equivalent Neural ODE that can be used for sampling and exact likelihood evaluation (Chen et al., 2018).

Guidance. One can also control the diffusion model to sample from a synthetic distribution qt(xt|c) qt(c|xt)wqt(xt). Here, c is a desired condition such as a class or text description, and interpolating the guidance weight w controls the ﬁdelity and diversity of the samples. This requires the score

x log qt(xt|c) = w x log qt(c|xt) + x log qt(xt) (6) which can be learned sθ(xt, t, c) x log qt(xt|c) without requiring explicit training on qt. For example, classiﬁer guided methods (Song et al., 2021b; Dhariwal & Nichol, 2021) combine a pretrained score function sθ(xt, t) and classiﬁer qt(c|xt):

sθ(xt, t, c) := w x log qt(c|xt) + sθ(xt, t) (7)

and classiﬁer-free guidance methods (Ho & Salimans, 2022) uses a c-conditioned score function and an implicit Bayes classiﬁer qt(c|xt) = qt(xt|c) qt(xt)qt(c)

sθ(xt, t, c) := (w + 1)sθ(xt, t, c) wsθ(xt, t) (8)

Thresholding. However, since sθ is not a perfect score function, there is a mismatch between the modeled backward process and the true forward process. Thus, diffusion models can push a sample to areas where qt(xt) is small, which creates a negative feedback loop since score matching struggles in low probability areas (Song & Ermon, 2019a; Koehler et al., 2022). This causes the sampling process to diverge and commonly occurs for more complex tasks, especially those involving diffusion guidance.

To combat this, many previous works alter the diffusion sampling procedure using thresholding (Saharia et al., 2022; Li et al., 2022), which stabilizes the sampling process with inductive biases from the data. In particular, thresholding applies an operator O that projects back to the data domain Ωduring each discretized SDE step:

yt t = O(yt f(yt, t) g(t)2sθ(yt, t) t)+g(t)B t (9) For the case of images, O can be static thresholding, which clips each dimension to the pixel range [0, 255], and dynamic thresholding, which ﬁrst normalizes all pixels by the p-th percentile pixel before clipping (Saharia et al., 2022).

Thresholding alleviates divergent sampling but comes with considerable downsides. For example, it breaks the theoretical setup since the generative model no longer approximates the reverse diffusion process. This mismatch induces artifacts during sampling and precludes the use of ODE sampling (Song et al., 2021b).

3. Reﬂected Diffusion Models

In this section, we present Reﬂected Diffusion Models. These deﬁne a generative model on a data domain Ω(assumed to be connected and compact with nonempty interior and uniform Hausdorff dimension) which outer-bounds the support of the data distribution p0. Our method retains the theoretical underpinnings of diffusion models while incorporating inductive biases from thresholding. We highlight the core mechanisms in Figure 1.

3.1. Reﬂected Stochastic Differential Equations

To model diffusion processes on a compact domain Ω, we use reﬂected SDEs. For ease of presentation, we only give an intuitive deﬁnition of reﬂected SDEs and simplify so that g is scalar and Lt reﬂects in the normal direction. In Appendix A.1, we provide a more rigorous mathematical deﬁnition and generalize to matrix diffusion coefﬁcients and oblique reﬂections. For a full introduction, we recommend the readers consult a monograph such as Pilipenko (2014).

Our reﬂected SDEs perturb an initial datum x0 p0 and are parameterized by a drift coefﬁcient f : Ω R Rd

and diffusion coefﬁcient g : R R:

dxt = f(xt, t)dt + g(t)d Bt + d Lt (10)

The ﬁrst two terms on the right hand side of Equation 10 are exactly those of Equation 1, showing that our reﬂected SDE behaves like a regular SDE in the interior of Ω. Lt is the additional boundary constraint that, intuitively, forces the particle to stay inside Ω. When xt hits Ω, Lt neutralizes the outward normal-pointing component.

This reﬂected SDE has a unique strong solution as long as f and g are Lipschitz in state and time and Ωsatisﬁes the uniform exterior sphere condition (Pilipenko, 2014, Theorem 2.5.4), which ensures that Ωis sufﬁciently regular. In particular, the uniform exterior sphere condition holds true when Ωis smooth and even when Ωis a convex polytope.

3.2. Density Evolution and Time Reversal

When we perturb p0 with the reﬂected SDE in Equation 10, our density evolves according to the Fokker-Planck equation with Neumann boundary condition (Schuss, 2013): tpt = div( ptf + g2

2 xpt) n = 0, x Ω, n normal, t > 0 (11)

In addition to allowing us to characterize the limiting density p T , this induces a reversed reﬂected stochastic differential equation (Cattiaux, 1988; Williams, 1988):

dxt = (f(xt, t) g(t)2 x log pt(xt))dt

+g(t)d Bt + d Lt (12)

Reﬂected Diffusion Models

where Lt is the reversed boundary condition. For our case, Lt also reﬂects in the normal direction. Remark 3.1. The reversed reﬂected SDE closely resembles the reversed standard SDE given in Equation 2. On one hand, this is natural because local dynamics match: when Ω= Rd, Lt disappears since xt can never hit Rd = . On the other hand, it is surprising that we can reverse a reﬂected diffusion process with another reﬂected diffusion process, something that does not hold in the discrete time case.

3.3. Reﬂected SDEs in Practice

In our experiments, Ωwill be either the unit cube Cd := {x Rd : 0 xi 1} or the unit simplex, which is given by d := {x Rd : Pd i=1 xi = 1, xi 0}. We often ﬁnd it more convenient to work with the projected simplex d := {x Rd : Pd 1 i=1 xi 1, xi 0} as it is bounded in Rd instead of in a hyperplane.

We will diffuse with the Reﬂected Variance Exploding SDE (RVE SDE), a generalization of the Variance-Exploding SDE introduced in Song et al. (2021b). A RVE SDE is parameterized by σ0 σ1 and is deﬁned for t [0, 1] by

dxt = σtd Bt + d Lt (13)

where σt := σ1 t 0 σt 1

2 log σ1 σ0

. The reverse is

dxt = σ2 t x log pt(xt)dt + σtd Bt + d Lt (14)

Note that the RVE SDE corresponds to a time dilated version of reﬂected Brownian motion: time t of a RVE SDE corresponds to time σt of reﬂected Brownian motion, where σt := σ1 t 0 σt 1. As a result of Equation 11, p0 evolves under a heat equation with Neumann boundary conditions:

tpt = g(t)2

2 xpt xpt n = 0 on Ω (15)

Note that p1 becomes a uniform density over Ωfor large enough σ1. To see this, we can draw intuition from physics: heat homogenizes in a closed container.

4. Score Matching on Bounded Domains

While the reﬂected SDE framework provides a nice theoretical pathway to construct a reﬂected diffusion model, it requires one to learn the score function sθ x log pt on Ω. We minimize the constrained score matching loss:

1 2EΩ x p sθ(x) x log p(x) 2 (16)

where we omit time-dependence for presentation purposes. Furthermore, EΩindicates the domain of the expectation (as opposed to E which is an integral over Rd). This is because

p can be discontinuous at Ω(since it is 0 outside of Ωand can be nonzero on Ω), so constraining the integral ensure regularity properties used for theorems (such as Stokes ).

In this section, we review previous methods for score matching on bounded domains, discuss their fundamental limitations, and propose constrained denoising score matching to overcome these difﬁculties. Additionally, for the RVE SDE introduced in Section 3.3, we show how to quickly compute the score matching training objective.

4.1. Pitfalls of Implicit Score Matching

One may hope to draw inspiration from the standard paradigm, which transforms the score matching integral

1 2Ex q sθ(x) x log q(x) 2 (17)

into the implicit score matching loss (Hyv arinen, 2005):

div(sθ)(x) + 1

2 sθ(x) 2 (18)

This removes the intractable x log q(x), allowing for estimation using Monte Carlo sampling. However, the derivation requires the use of Stokes theorem; applying Stokes theorem to Equation 16 would instead result in

div(sθ)(x) + 1

Ω p(x) sθ(x), n(x) dx (19)

where n(x) is the interior pointing normal vector. Unlike the case of Ω= Rd, where the second term disappears since Rd = , this result is computationally intractable. Thus, previous work instead proposes to re-weight the loss function with a nonnegative function h that vanishes on the boundary (Hyv arinen, 2007; Yu et al., 2020), minimizing 1 2EΩ x ph(x) sθ(x) x log p(x) 2 (20)

Since h vanishes on Ω, we can cleanly apply Stokes theorem and derive a result without a boundary term, giving an implicit score matching loss:

div(h sθ)(x) + h(x)

2 sθ(x) 2 (21)

However, this formulation is not suitable for high dimensions, even with fast numerical algorithms for the divergence operator (Hutchinson, 1989; Song et al., 2019). This is because the loss is downweighted near the boundaries, so, for a ﬁxed budget, the error can become unbounded as x Ω. For high dimensions, the space near the boundary becomes an increasingly larger proportion of the total volume1, which greatly hampers the sample efﬁciency of the loss.

1Consider the case when Ω= [0, 1]d. For large d, almost all the mass is close to the boundary.

Reﬂected Diffusion Models

4.2. Constrained Denoising Score Matching

Inspired by the empirical success of denoising score matching (Vincent, 2011; Song & Ermon, 2019a), we present constrained denoising score matching (CDSM). Crucially, denoising score matching, unlike implicit score matching, can directly generalize to bounded domains due to how it handles discontinuities. This means that, unlike previous methods for constrained score matching, the derivation transfers smoothly. The core mechanism is presented in the following proposition, which we prove in Appendix A.2.

Proposition 4.1. Suppose that we perturb an Ω-supported density a(x) with noise b(x| ) (also supported on Ω) to get a new density b(x) := R

Ωa(y)b(x|y)dy. Then, under suitable regularity conditions for the smoothness of a and b, the score matching loss for b:

1 2EΩ x b sθ(x) x log b(x) 2 (22)

is equal (up to a constant factor that does not depend on s) to the CSDM loss:

1 2EΩ x0 a EΩ x b( |x0) sθ(x) x log b(x|x0) 2 (23)

With the constrained denoising score matching loss, we are then able to deﬁne a training objective for reﬂected diffusion models. In particular, since pt(x) is a by deﬁnition perturbed density of p0(x) with transition kernel pt(x| ), the weighted score score matching loss directly becomes:

EΩ t,x0 p0,xt pt( |x0)λt sθ(xt, t) x log pt(xt|x0) 2

(24) For our reﬂected SDE, we will set λt g(t)2, mirroring previous work and minimizing variance during optimization. Interestingly, as we prove in Section 7, this corresponds to an ELBO loss when we reverse a RVE SDE.

4.3. Scaling Score Matching Computation

We ﬁnalize by showing how to sample from and compute the score of the transition density pt(xt|x0) for the RVESDE. Note that this is the transition density of a reﬂected Brownian Motion (Harrison & Reiman, 1981). We highlight the key features of our method in Figure 2.

Sampling. To sample from pt(xt|x0), we can repeatedly reﬂect a sample y from N(x0, σ2 t 2 I). In particular, we follow the line segment t ty + (1 t)x0, reﬂecting in the normal direction when it crosses Ωand repeating until we reach t = 1. This works because, intuitively, the boundary redirects the the Brownian motion but does not change the magnitude. In practice, this process can be quickly computed with classic computational geometric techniques.

Score Computation. There are two approaches for computing the score of pt(xt|x0) on general geometric domains:

Approximation with Sum of Gaussians (Jing et al., 2022b). This method decomposes pt(xt|x0) into an inﬁnite sum of Gaussian densities that depend on x0, xt, and the geometry of the domain. For our bounded Ωwith the reﬂection condition, this gives us the equation

pt(xt|x0) = X

x R(xt) p N(x0,σ2 t /2 I)(x ) (25)

where p N(x0,σ2 t /2 I) is the pdf of the Gaussian centered at x0 with variance σ2 t I and R(xt) is the set of all x Rd s.t. the repeated reﬂection of the path t tx +(1 t)x0 ends in xt. Note that this reﬂection scheme is the same one we use for sampling. Furthermore, through elementary derivations, this gives us a formula for the score x log pt(xt|x0).

Generally, this method works quite well for small σt, as we only need to take a small number of local reﬂections to approximate pt(xt|x0). However, for larger σt, we need to take many more reﬂections since the underlying Gaussian is too dispersed, greatly increasing the computational cost.

Approximation with Laplacian Eigenfunctions (Bortoli et al.,

2022). This method instead computes using Laplacian Eigenfunctions, a standard technique for solving the heat equation (Evans, 2010). For our problem, these are a (known for each Ω) set of functions fi L2(Ω), i N that satisfy fi = λifi and fi n = 0 on Ω. In particular, these form an orthonormal basis for L2(Ω), allowing us to solve Equation 15 directly for an initial density of δx0:

pt(xt|x0) =

i=0 e λiσ2 t /2fi(xt)fi(x0) (26)

This method works well for large σt because this means that e λiσ2 t /2 0, removing the need to evaluate many of the terms. However, for small σt, this method becomes costly because it requires many more terms. Similar to the above method, we can derive a formula for x log pt(xt|x0) through this sum.

Our Method. We instead propose to combine the above two approaches. In particular, we note that they complement each other: Gaussian sum is accurate for small σt and eigenfunction sum is accurate for large σt. We can therefore set a σ (σ0, σ1) and compute with Gaussian sum when σt < σ and with eigenfunction sum when σt > σ . In practice, this allows us to upper-bound the number of reﬂections/eigenfunctions used to 5, much fewer than the exponential amount required for each method individually.

Scaling to High Dimensions. By itself, this branching method is unfortunately not enough to scale to very high dimensions. In particular, our computation is, in the worst case, O(dk) where d is the dimension of Ωand k is the number of reﬂection steps or the highest eigenfunction frequency. We have only bounded k to something more manageable.

Reﬂected Diffusion Models

Sampled Reflected

Sampled Outside Sampled Inside

Figure 2. An overview of our computational method for constrained denoising score matching with Brownian transition probabilities. (i) We can draw samples by sampling N(x0, σ2 t I) and then applying reﬂections on the boundary. (ii) When t is small, we compute the transition density by summing up a mixture of Gaussians (shown for Ω= [0, 1]). (iii) When t is large, we compute using the frequencies of Ω(shown for Ω= [0, 1]). (iv) We diffeomorphically transform Ω [0, 1]d, where the transition score is tractable.

To overcome this scaling issue, we consider the simple case of the hypercube [0, 1]d. Since our Brownian motion does not have inter-dimensional interactions, reﬂections do not interact between non-parallel hyperplanes, and Laplacian eigenfunctions factorize by dimension, we can decompose the probability along each component interval:

pt(xt|x0) =

i=1 pi t(xi t|xi 0) (27)

where xi t and xi 0 are the i-components of xt and x0 respectively and pi t is the marginal probability on the i-th coordinate. Note that the RVE SDE on [0, 1]d marginalizes to a RVE SDE on [0, 1] for each dimension:

dxi t = σtd Bi t + d Li t (28)

where Bi t and Li t are the Brownian motion and boundary condition (respectively) for dimension i. We can therefore compute on each Ωi = [0, 1] and combine the results, reducing the cost from O(dk) to O(kd). Regular score matching is O(d), and since k is small, we can train Reﬂected Diffusion Models just as quickly as regular diffusion models.

For more general domains, under certain conditions, we can smoothly and bijectively map from int(Ω) (0, 1)d. Thus, we can instead learn a diffusion model on [0, 1]d and then project back to Ω. More details are given in Appendix B.2. In particular, this mapping procedure allows us to learn a diffusion model on high-dimensional simplices d.

5. Simulating Reﬂected SDEs

Combining a score sθ learned through CSDM and the reverse reﬂected SDE, we have a Reﬂected Diffusion Model: sample x T U(Ω) and solve the reﬂected SDE:

dxt = σ2 tsθ(xt, t)dt + σtd Bt + d Lt (29)

In this section, we examine numerical methods for simulating examples from this reﬂected SDE.

5.1. Euler-Maruyama Discretizations and Thresholding

The typical Euler-Maruyama discretization of a standard SDE (Equation 1) is given by

xt+ t = xt + f(xt, t) t + g(t)B t (30)

where B t N(0, t I). For reﬂected SDEs, one can adapt this discretization by approximating the effect of Lt with some suitable operators O.

xt+ t = O(xt + f(xt, t) t + g(t)B t) (31)

Common examples of O include the projection operator proj(x) = arg miny Ωd(x, y) (Liu, 1993) or the reﬂection operator reﬂused in Section 4.3 (Schuss, 2013). One can see that, as t 0, both the projection and reﬂection schemes converge in distribution. Empirically, we ﬁnd that reﬂection generates better samples.

Interestingly, this closely mirrors the thresholding step given in Equation 9, with the only difference being the choice of operator O and whether O is applied before or after the noise step. This difference disappears when t 0:

Proposition 5.1 (Thresholding solves a reﬂected SDE). Both types of thresholding solve the reﬂected SDE (Equation 10) as t 0 under suitable conditions.

The full proposition and proof are given in Appendix A.5.

5.2. Predictor Corrector

We extend the predictor-corrector (PC) framework of Song et al. (2021b), which has been shown to improve results. In particular, our learned scores can be used to augment

Reﬂected Diffusion Models

Model IS FID NCSN++ (Song et al., 2021b) 9.89 2.20 DDPM++ (Song et al., 2021b) 9.68 2.41 Styleformer (Park & Kim, 2021) 9.94 2.82 UNCSN++ (Kim et al., 2021) 10.11 Vit GAN (Lee et al., 2021) 9.89 4.87 Subspace NCSN++ (Jing et al., 2022a) 9.99 2.17 EDM (Karras et al., 2022) 1.97 Reﬂected Diffusion (ours) 10.46 2.72

Table 1. CIFAR10-Sample Quality Results. We test Reﬂected Diffusion Models on CIFAR-10 Image Generation and report IS and FID scores. Our model is highly competitive, achieving a state of the art-inception score for unconditional generation. However, FID lags behind due to noise (as discussed in Appendix B.3)

the sampling procedure using Langevin Dynamics (Song & Ermon, 2019a). However, this requires Langevin dynamics for a constrained domain (Bubeck et al., 2015), which, for the probability p, are given by the reﬂected SDE:

2 x log p(xt)dt + d Bt + d Lt (32)

During our reversed diffusion iterations, we can discretize the langevin dynamics using Reﬂected Euler-Maruyama and apply our learned score s( , t):

x t = reﬂ(xt + ϵ

2sθ(xt, t) +

2ϵ z) z N(0, 1) (33)

In practice, we ﬁnd that PC sampling with a small signal-tonoise ratio noticeably improves image generation results.

CIFAR-10 Quality Results. With these components, we test our method for image generation on the CIFAR-10 dataset and report Inception Score (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID) (Heusel et al., 2017) in table 1. Our models remain competitive, achieving a SOTA Inception score of 10.42. However, Tweedies formula does generalize to reﬂected diffusion (Efron, 2011) (more details are in Appendix B.3), so our model generates images with imperceptible noise (on the scale of 1 2 pixels), which degrades the FID score to 2.72 (Jolicoeur Martineau et al., 2020). Despite this, our samples are diverse and visually indistinguishable (Appendix D).

5.3. Probability Flow ODE

Similarly to the probability ﬂow ODE derived in Song et al. (2021b), one can construct an equivalent deterministic process for a reﬂected SDE. Interestingly, doing this removes the boundary reﬂection term, so our deterministic process is exactly the original probability ﬂow ODE derived in Song et al. (2021b):

dx = f(x, t) 1

2g(t)2 x log pt(x) dt (34)

Figure 3. Without thresholding, standard diffusion models easily diverge. We sample using classiﬁer-free guidance (w = 1) from a standard diffusion model without using thresholding. Around half of the samples diverge (generating blank images).

Crucially, the thresholding effect is maintained due to the Neumann condition for x log pt (Equation 11 line 2) and can t be replicated for standard diffusion models. We elaborate on this construction, as well as connections with DDIM (Song et al., 2020) in Appendix A.3.

6. Diffusion Guidance

Both classiﬁer and classiﬁer-free guidance (Equations 7 and 8) extend to Reﬂected Diffusion Models by logarithm and gradient rules. Since thresholding is primarily useful for diffusion guidance, we investigate the relationship between thresholding, diffusion guidance, and Reﬂected Diffusion Models on the relatively simple downsampled 64x64 Image Net dataset (Russakovsky et al., 2014).

Thresholding is critical. We corroborate Saharia et al. (2022), showing that pixel-spaced diffusion guidance requires thresholding. We show this for classiﬁer-free guidance in Figure 3, where even a low weight w = 1 causes about half of the samples to diverge. For classiﬁer guidance, around 75% of samples diverge (Figure 14).

Our method retain ﬁdelity under high guidance weight. Thresholding produces oversaturated images under high guidance weight w (Ho & Salimans, 2022; Saharia et al., 2022), hampering applications which require high ﬁdelity generation. We hypothesize that this is caused by the training and sampling mismatch, and we show in Figure 4 that our method retains ﬁdelity under high guidance weight. We did not ﬁnd dynamic thresholding method to perform better.

ODE sampling works for classiﬁer-free guidance. The composed score function in classiﬁer-free guidance (Equa-

Reﬂected Diffusion Models

Figure 4. Non cherry-picked guided samples from a reﬂected and standard diffusion model with high guidance weight. We compare Reﬂected Diffusion Models with standard diffusion models for generating class-conditioned 64x64 Image Net samples for a guidance weight w = 15. Our generated images are shown on the left, and the baseline is shown on the right (same positions have same classes). Our method retains ﬁdelity while the baseline suffers from oversaturation.

Figure 5. Guided ODE samples. We sample using our ODE with a guidance weight w = 1.5, retaining image ﬁdelity with fewer forward evaluations (around 100 compared with 1000).

tion 8) maintains the Neumann boundary condition (Equation 11), allowing for ODE sampling. Using this, we demonstrate the ﬁrst case of high-ﬁdelity classiﬁer-free guided generation using ODEs in Figure 5. Interestingly, ODE equivalent DDIM sampling fails for classiﬁer-free guidance but works for classiﬁer guidance, despite classiﬁer guidance being worse without thresholding (Appendix D).

7. Likelihood Bound

Incidentally, our weighted score matching loss corresponds to an ELBO for our generative model. To show this, we extend Girsanov s Theorem (Øksendal, 1987), which is used to th derive the ELBO for standard diffusion models (Song et al., 2021a; Kingma et al., 2021; Huang et al., 2021):

Theorem 7.1 (Reﬂected Girsanov for KL divergence). Suppose we have two reﬂected SDEs on the same domain Ω

dxt = f1(xt, t)dt + g(t)d Bt + d Lt (35)

dyt = f2(yt, t)dt + g(t)d Bt + d Lt (36)

from t = 0 to T with x0 = y0 = z Ω.

Let µ, ν be the path measures for (resp.) x and y. Then,

0 Epxt(y) h g(t)2 (f1 f2)(y, t) 2i dt

The full theorem and proof are given in Appendix A.4.

Note that, by also incorporating the prior and reconstruction loss, Equation 37 gives us an upper bound on the negative log-likelihood (Appendix A.4). Furthermore, for our reversed RVE SDE, Equation 37 becomes

σ2 t sθ(xt, t) x log pt(xt) 2i dt (38)

Reﬂected Diffusion Models

Model C-10 IN32 Non-diffusion Flow++ (Ho et al., 2019) 3.08 Pixel-CNN++ (Salimans et al., 2017) 2.92 Sparse Transformer (Child et al., 2019) 2.80 Diffusion: Modiﬁed Noise Schedule Score Flow (Song et al., 2021a) 2.83 3.76 VDM (Kingma et al., 2021) 2.65 3.72 Diffusion: No Noise Modiﬁcations Score SDE (Song et al., 2021b) 2.99 ARDM (Hoogeboom et al., 2021) 2.71 Score Flow (Song et al., 2021a) 2.86 3.83 VDM (Kingma et al., 2021) 2.70 Reﬂected Diffusion (ours) 2.68 3.74

Table 2. CIFAR-10 and Image Net32 Bits-per-Dimension (BPD). No data augmentaiton; lower is Better. We test the likelihood of Reﬂected Diffusion Models for CIFAR-10 and downsampled Image Net32 without data augmentation. Our method is second best, nearly matching the state of the art (VDM), without requiring importance sampling or a learned noise schedule.

which is a scaled version of our proposed weighted score matching loss in Equation 24. Therefore, we already implicitly train with maximum likelihood. Furthermore, when applied to an individual data point x, we recover the constrained denoising score matching loss:

0 Ext pt( |x) h

g(t)2 sθ(xt, t) x log pt(xt|x) 2i dt

(39) which allows us to derive an upper bound on log p(x).

Image Likelihood Results. We test Reﬂected Diffusion Models on CIFAR-10 (Krizhevsky, 2009) and Image Net32 (van den Oord et al., 2016) for likelihoods, both without data augmentation. Our method performs comparatively to the SOTA while reducing the number of hyperparameters (in the form of importance sampling and learned noise schedules)2. Note that we can compute exact likelihoods through the probability ﬂow ODE, which typically improves results (Song et al., 2021a), but, for a fair comparison with VDM, we report the likelihood bound.

8. Simplex Diffusion

We also demonstrate that our reﬂected diffusion model can scale to high dimensional simplices. We train on softmaxed Inception classiﬁer logits for Image Net (Szegedy et al., 2014), which take values in a 1000-dimensional simplex.

2We omit several results which report a better BPD than VDMs (Kingma et al., 2021) on Imagenet32 but a much worse CIFAR-10 result as they test on the the Image Net32 dataset used for classiﬁcation (Chrabaszcz et al., 2017), which is signiﬁcantly easier and incomparable due to the use of anti-aliasing.

Figure 6. Simplex Diffusion Training and Validation Curves. Our method trains stably in high dimensions.

Our training dynamics are reported in Figure 6 (with a 0.99 EMA), showing that our method is able to optimize the loss (and thus maximize the ELBO) even in high dimensions. Our diffusion process is fundamentally different from the simplex diffusion method from Richemond et al. (2022), as we evolve our dynamics directly on the simplex while the previous method diffuses on a higher dimensional space (the positive orthant) and projects to the simplex.

9. Conclusion

We introduced Reﬂected Diffusion Models, a diffusion model which respects natural data constraints through reﬂected SDEs. Our method scales score matching on general bounded geometries and retains theoretical constructs from standard diffusion models. Our analysis also sheds light on the commonly used thresholding sampling method and provides improvements through correct training.

We did not explore architecture or noise scheduling, which are critical for state of the art results; we leave this (and scaling to text to-image-generation) for future work.

Latent Diffusion (LD) (Rombach et al., 2021) is a diffusion model method that also incidentally does not require thresholding. We hypothesize that this is because both our method and LD directly incorporate data space constraints. Notably, we work over an outer bound of the support of the data distribution, while LD works over a submanifold learned by a VAE. Future work could try to ﬁnd a middle ground between these two data support approaches.

10. Acknowledgements

This project was supported by NSF (#1651565), ARO (W911NF-21-1-0125), ONR (N00014-23-1-2159), CZ Biohub, and Stanford HAI GCP Grants. AL is supported by a NSF Graduate Research Fellowship. We would also like to thank Chenlin Meng for helpful discussions.

Reﬂected Diffusion Models

Anderson, B. D. O. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12:313 326, 1982.

Ba, J., Kiros, J. R., and Hinton, G. E. Layer normalization. Ar Xiv, abs/1607.06450, 2016.

Bortoli, V. D., Mathieu, E., Hutchinson, M., Thornton, J., Teh, Y. W., and Doucet, A. Riemannian score-based generative modeling. Ar Xiv, abs/2202.02763, 2022.

Bubeck, S., Eldan, R., and Lehec, J. Finite-time analysis of projected langevin monte carlo. In NIPS, 2015.

Cattiaux, P. Time reversal of diffusion processes with a boundary condition. Stochastic Processes and their Applications, 28:275 292, 1988.

Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Neural Information Processing Systems, 2018.

Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. Ar Xiv, abs/1904.10509, 2019.

Chrabaszcz, P., Loshchilov, I., and Hutter, F. A downsampled variant of imagenet as an alternative to the cifar datasets. Ar Xiv, abs/1707.08819, 2017.

Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Ar Xiv, abs/2105.05233, 2021.

Dormand, J. R. and Prince, P. J. A family of embedded runge-kutta formulae. Journal of Computational and Applied Mathematics, 6:19 26, 1980.

Efron, B. Tweedies formula and selection bias. Journal of the American Statistical Association, 106:1602 1614, 2011.

Evans, L. C. Partial differential equations. American Mathematical Society, Providence, R.I., 2010. ISBN 9780821849743 0821849743.

Harrison, J. M. and Reiman, M. I. Reﬂected brownian motion on an orthant. Annals of Probability, 9:302 308, 1981.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.

Ho, J. and Salimans, T. Classiﬁer-free diffusion guidance. Ar Xiv, abs/2207.12598, 2022.

Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving ﬂow-based generative models with variational dequantization and architecture design. Ar Xiv, abs/1902.00275, 2019.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Ar Xiv, abs/2006.11239, 2020.

Hoogeboom, E., Gritsenko, A. A., Bastings, J., Poole, B., van den Berg, R., and Salimans, T. Autoregressive diffusion models. Ar Xiv, abs/2110.02037, 2021.

Huang, C.-W., Lim, J. H., and Courville, A. C. A variational perspective on diffusion-based generative models and score matching. In Neural Information Processing Systems, 2021.

Hutchinson, M. F. A stochastic estimator of the trace of the inﬂuence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation, 18:1059 1076, 1989.

Hyv arinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res., 6:695 709, 2005.

Hyv arinen, A. Some extensions of score matching. Comput. Stat. Data Anal., 51:2499 2512, 2007.

Jing, B., Corso, G., Berlinghieri, R., and Jaakkola, T. Subspace diffusion generative models. In European Conference on Computer Vision, 2022a.

Jing, B., Corso, G., Chang, J., Barzilay, R., and Jaakkola, T. Torsional diffusion for molecular conformer generation. Ar Xiv, abs/2206.01729, 2022b.

Jolicoeur-Martineau, A., Piche-Taillefer, R., des Combes, R. T., and Mitliagkas, I. Adversarial score matching and improved sampling for image generation. Ar Xiv, abs/2009.05475, 2020.

Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. Ar Xiv, abs/2206.00364, 2022.

Kim, D., Shin, S.-J., Song, K., Kang, W., and Moon, I.-C. Soft truncation: A universal training technique of scorebased diffusion model for high precision score estimation. In International Conference on Machine Learning, 2021.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2014.

Kingma, D. P., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Ar Xiv, abs/2107.00630, 2021.

Koehler, F., Heckett, A., and Risteski, A. Statistical efﬁciency of score matching: The view from isoperimetry. Ar Xiv, abs/2210.00726, 2022.

Reﬂected Diffusion Models

Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.

Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., and Liu, C. Vitgan: Training gans with vision transformers. Ar Xiv, abs/2107.04589, 2021.

Li, X. L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T. Diffusion-lm improves controllable text generation. Ar Xiv, abs/2205.14217, 2022.

Liu, Y. Numerical approaches to stochastic differential equations with boundary conditions. 1993.

Øksendal, B. Stochastic differential equations : an introduction with applications. Journal of the American Statistical Association, 82:948, 1987.

Park, J. and Kim, Y. Styleformer: Transformer based generative adversarial networks with style vector. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8973 8982, 2021.

Pilipenko, A. An introduction to stochastic differential equations with reﬂection. 2014.

Ramachandran, P., Zoph, B., and Le, Q. V. Swish: a selfgated activation function. ar Xiv: Neural and Evolutionary Computing, 2017.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. Ar Xiv, abs/2204.06125, 2022.

Richemond, P. H., Dieleman, S., and Doucet, A. Categorical sdes with simplex diffusion. Ar Xiv, abs/2210.14784, 2022.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674 10685, 2021.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211 252, 2014.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J., and Norouzi, M. Photorealistic text-to-image diffusion models with deep language understanding. Ar Xiv, abs/2205.11487, 2022.

Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. Ar Xiv, abs/1606.03498, 2016.

Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. Ar Xiv, abs/1701.05517, 2017.

Schuss, Z. Brownian dynamics at boundaries and interfaces. 2013.

Skorokhod, A. V. Stochastic equations for diffusion processes in a bounded region. Theory of Probability and Its Applications, 6:264 274, 1961.

Sohl-Dickstein, J. N., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. Ar Xiv, abs/1503.03585, 2015.

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. Ar Xiv, abs/2010.02502, 2020.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Ar Xiv, abs/1907.05600, 2019a.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019b.

Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Conference on Uncertainty in Artiﬁcial Intelligence, 2019.

Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum likelihood training of score-based diffusion models. In Neural Information Processing Systems, 2021a.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b. URL https://openreview.net/forum? id=Px TIG12RRHS.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1 9, 2014.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818 2826, 2015.

Reﬂected Diffusion Models

van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. Ar Xiv, abs/1601.06759, 2016.

Vincent, P. A connection between score matching and denoising autoencoders. Neural Computation, 23:1661 1674, 2011.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning, 2008.

Williams, R. J. On time-reversal of reﬂected brownian motions. 1988.

Xu, M., Yu, L., Song, Y., Shi, C., Ermon, S., and Tang, J. Geodiff: a geometric diffusion model for molecular conformation generation. Ar Xiv, abs/2203.02923, 2022.

Yu, S., Drton, M., and Shojaie, A. Generalized score matching for general domains. Information and inference : a journal of the IMA, 11 2:739 780, 2020.

Reﬂected Diffusion Models

A. Theoretical Constructs

A.1. Reﬂected Stochastic Differential Equations

We follow Pilipenko (2014) in our presentation. Given a domain Ωand an oblique reﬂection vector ﬁeld v that satisﬁes v(x) n(x) = 1, where n is the inward pointing unit normal vector ﬁeld, the reﬂected SDE is deﬁned as

dxt = f(xt, t)dt + G(xt, t)d Bt + v(xt)d Lt (40)

where Lt is deﬁned recursively as R t 0 1xt Ωd Ls. Here, we see that Lt is a process that determines whether xt hits the boundary and then applies a reﬂection. For our purposes in the main paper, v = n and we surppress the notation for compactness. Deﬁne σ = 1

v(xt, t) = σ(xt, t)n(x) σ(xt, t)n(x) (41)

then Equation 11 generalizes (under suitable regularity conditions) (Schuss, 2013):

tpt(x) = div( pt(x)f(x, t) + σ(x, t) xpt)

(pt(x)f(x, t) σ(x, t) xpt) n(x) = 0 when x Ω, n is normal, t > 0 (42)

As such, this induces a reverse process (Williams, 1988; Cattiaux, 1988) that one can easily check has the same marginal probability distributions:

dxt = [f(xt, t) σ(xt, t) x log pt(xt)] dt + G(xt, t)d Bt + v(xt)d Lt (43)

Here v is a vector ﬁeld that satisﬁes the condition v n = 1 and v + v is a positive multiple of n.

A.2. Constrained Denoising Score Matching

Proposition A.1. Suppose that we perturb an Ω-supported density a(x) with noise b(x| ) (also supported on Ω) to get a new density b(x) := R

Ωa(y)b(x|y)dy. Then, under suitable regularity conditions for the smoothness of a and b, the score matching loss for b:

1 2EΩ x b sθ(x) x log b(x) 2 (44)

is equal (up to a constant factor that does not depend on s) to the CSDM loss:

1 2EΩ x0 a EΩ x b( |x0) sθ(x) x log b(x|x0) 2 (45)

Proof. This proof comes down to showing that

EΩ x b sθ(x), x log b(x) = EΩ x0 a EΩ x b( |x0) sθ(x), x log b(x|x0) (46)

which can be done directly

Reﬂected Diffusion Models

EΩ x b sθ(x), x log b(x) = Z

Ω sθ(x), x log b(x) b(x)dx (47)

sθ(x), xb(x)

b(x)dx (48)

Ω sθ(x), xb(x) dx (49)

Ω a(y)b(x|y)dy dx (50)

Ω a(y) xb(x|y)dy dx (51)

Ω a(y)b(x|y) x log b(x|y)dy dx (52)

Ω a(y)b(x|y) sθ(x), x log b(x|y) dydx (53)

= EΩ x0 a EΩ x b( |x0) sθ(x), x log b(x|x0) (54)

This proof is exactly the same as the one presented in (Vincent, 2011). The only difference is that we replace the domain of integration with Ω. Note that the key property that allows us to complete the proof is the convolution identity, which generalizes unlike Stokes theorem for implicit score matching.

A.3. Probability Flow ODE and Connections to DDIM

We now derive the probability ﬂow ODE, show how to use it to sample, and discuss connections with DDIM. For convenience, we will work with the assumptions given in the paper (that the diffusion coefﬁcient is a scalar depending on only time and that reﬂection is in the normal direction), but our results directly generalize (given sufﬁcient regularity conditions) to general noise schedules and oblique reﬂections.

Proposition A.2 (Probability Flow ODE). For the reﬂected SDE

dxt = f(xt, t)dt + g(t)d Bt + d Lt (55)

The ODE given by

dxt = f(xt, t) g(t)2

2 x log pt(xt) dt (56)

follows the same probability evolution pt.

Proof. By the forward Kolmogorov Equation, we ca see that the ODE follows

tpt(x) = div( pt(x)f(x, t) + g(t)2

2 xpt) (57)

However, we must conﬁrm that the ODE doesn t exit Ω. By the Neumann boundary conditions for the SDE, we see that

(f(x, t) g(t)2

2 x log pt(x)) n(x) (58)

on the boundary, so the ﬂow induced by the ODE is indeed a valid diffeomorphism from Ω Ω.

Similar to DDIM, we can derive equivalent processes by annealing the noise.

Reﬂected Diffusion Models

Proposition A.3 (Annealing Noise Level). For the reﬂected SDE

dxt = f(xt, t)dt + g(t)d Bt + d Lt (59)

The reﬂected SDE

dxt = f(xt, t) g(t)2 g(t)2

2 x log pt(x)dt + g(t)d Bt + d Lt (60)

follows the same probability evoluation pt for all noise levels g > 0.

Proof. This follows directly from our Fokker-Planck Equation.

Remark A.4. In the above proposition, g > 0 as there is no concept of a reﬂected ordinary differential equation. However, when the noise is 0, our limiting process yields an ODE.

To sample with our score function sθ, we simply solve the reversed process, which is

dxt = f(xt, t) g(t)2

2 sθ(xt, t) dt (61)

for our ODE and

dxt = f(xt, t) g(t)2 + g(t)2

2 sθ(xt, t) + g(t)d Bt + d Lt (62)

for our annealed reﬂected SDE.

When training with our CSDM objective sθ(xt, t), the ODE sampler (Equation 61) to mimic standard reﬂected diffusion sampling, which includes thresholding. Conversely, when the score is trained with standard score matching, the sampler just removes thresholding, causing the process to simulate the diffusion path without thresholding.

To mimic the thresholding effect, one must instead turn to the the annealed reﬂected SDE sampler of Equation 62. If we discretize the equation and (with an abuse of notation) set g = 0, then we recover the thresholded DDIM sampler. Unfortunately, changing g necessarily causes the sampled distribution to shift since sθ(xt, t) is not trained to mimic the correct x log pt(x), so the reverse process necessarily results in divergent behavior.

A.4. Girsanov Theorem for Reﬂected SDEs and Likelihood Evaluation

We derive our likelihood bounds. We ﬁrst recall Girsanov s Theorem for SDEs (Øksendal, 1987)

Theorem A.5 (Girsanov Theorem). Let Φ be a bounded functional on the space of continuous functions C([0, T]). For the SDE evolving on [0, T] with

d Xt = µ(t, Xt)dt + σ(t, Xt)d Bt (63)

0 µ(s, Xt)d Bt 1

0 µ(t, Xt) 2 dt

where the expectation is taken is the path measure of the SDE.

We then prove the analogue of this for reﬂected SDEs:

Theorem A.6 (Girsanov Theorem for Reﬂected SDEs). Let Φ be a bounded functional on the space of continuous functions C([0, T]). For the reﬂected SDE evolving on Ωspace and [0, T] time with

d Xt = µ(t, Xt)dt + σ(t, Xt)d Bt + d Lt (65)

Reﬂected Diffusion Models

where Lt is assumed to have normal reﬂection. We have

0 µ(s, Xt)d Bt 1

0 µ(t, Xt) 2 dt

where the expectation is taken over the path measure of the reﬂected SDE.

Proof. We ﬁrst smoothly extend µ and σ to all of Rd s.t. the value goes to 0 very quickly on Ω. We then consider the processes Xn t deﬁned i R+ by

d Xi t = µi(t, Xt)dt + σ(t, Xt)d Bt + d Lt (67)

where µi(t, x) = µ(t, x) + id(x, Ω)v(x) where d is the distance function and v(x) is the unit normal vector pointing from x to y where y := arg minz Ωd(x, z). It is well known that Xi t Xt in measure as i (Liu, 1993). Since Φ is a bounded (and thus continuous) functional, we thus have EΦ(Xi t) EΦ(Xt) as i . We ﬁnalize by noting that

EΦ(Xi t) = E

0 µi(s, Xi t)d Bt 1

µi(t, Xi t) 2 dt

As i , Xi t will remain in Ωw.p. 1 and µi = µ on Ω. Therefore, we have the desired convergence

0 µ(s, Xt)d Bt 1

0 µ(t, Xt) 2 dt

as desired.

Corollary A.7. As a corollary, when Φ is log dµ

dν , this gives us Theorem 7.1. Remark A.8. It is possible that we can generalize our theorem to obliquely reﬂected SDEs, although we did not pursue this line of inquiry. Remark A.9. Theorem 7.1 recovers the denoising score matching loss if we slice an initial δx distribution. In particular, this is the continuous time diffusion loss LT that is used to the form the ELBO for standard diffusion models (Kingma et al., 2021; Ho et al., 2020).

A.5. Thresholding

On [ 1, 1]D, the dynamic thresholding operator is deﬁned by

dynthreshp(x) = (proj[ 1,1](xi/ max(s, 1))) s is the p-th percentile of |xi| (70)

which of course can be scaled to [0, 1]D (for our setup). Proposition A.10 (Static Thresholding Solves the Reﬂected SDE). On domains Ωbetween times [0, T], the discretization

xt+ t = proj(xt + f(xt, t) t) + g(t)B t (71)

solves the reﬂected SDE d Xt = f(Xt, t) + g(t)d Bt + nd Lt (72)

as t 0 when f and g are uniformly Lipschitz in time and space and satisfy the linear growth condition for any Lipschitz extension of f to the general space Rd.

Proof. This closely mirrors the proof showing that the standard projection scheme

yt+ t = proj(yt + f(xt, t) t) + g(t)B t (73)

converges to the solution of the reﬂected SDE as t 0 (Skorokhod, 1961; Schuss, 2013; Liu, 1993). The key difference is that the process is not supported on Ωsince the projection happens after each. However, since our extension on f is Lipschitz, this error is well behaved and disappears as t 0.

Reﬂected Diffusion Models

Corollary A.11 (Dynamic Thresholding Solves the Reﬂected SDE). With the same conditions as given above, on [ 1, 1]D, the discretization xt+ t = dynthreshp(xt + f(xt, t) t) + g(t)B t (74)

converges to the solution of the reﬂected SDE when f(xt, t) does not point outside of [ 1, 1]D on (1 p)D dimensions.

Proof. Under our conditions, dynthreshp becomes the projection operator since the p-th percentile of |xi| will always be 1. This replicates the above proposition.

Remark A.12. In practice, we found that learned score networks f satisfy the pointing inside condition above. In particular, as t 0, dynthreshp tends to behave exactly like proj for all p < 1.

B. Practical Implementation

B.1. Exact Equations for Reﬂected Brownian Transition Probabilities

For [0, 1], the reﬂected transition probability for a source x with diffusion value σ (which correspond to the mean and standard deviation for the standard normal distribution) is given by

p R(x,σ2)(y) = X

z:y+z Z p N(x,σ)(z) = 1 + 2

k=1 e k2π2σ2/2 cos(kπx) cos(kπy) (75)

Note that this means that the eigenfunctions of [0, 1] under our Neumann boundary condition are 1 and cos(kπx), with eigenvalues of 0 and πk2

B.2. Mapping Ωto [0, 1]d

Our domain Ωhas an interior which maps bijectively to (0, 1)d iff Ωis simply connected. Note that this encompasses a wide variety of domains, notably convex sets.

To construct a map f : [0, 1]d t, we use a variant of the common stick breaking procedure:

(f(x))i = xi

j=i+1 (1 xi) (76)

which admits an inverse (f 1(y))i = yi 1 Pd j=i+1 yi (77)

B.3. Denoising The Final Proability Distribution

We note that Tweedies formula (Efron, 2011) does not hold for general bounded domains Ω. We show this for [0, 1]: given an initial distribution X, a perturbed distribution Y constructed by y R(x, σ2), where x X has a Tweedie denoiser:

E[x|y] = Z 1

0 xp(x|y)dx (78)

0 xp(y|x)p X(x)

p Y (y) dx (79)

0 xp R(x,σ2)(y)p X(x)

p Y (y) dy (80)

dy log p Y (y) (81)

The reason why this works in the standard case is because the score of the Gaussian distribution is y x

σ , which allows us to extract out the desired value. For reﬂected Gaussians, this does not hold.

Reﬂected Diffusion Models

Instead, for our experimental results on CIFAR-10, we denoise by training a denoising autoencoder (Vincent et al., 2008) trained on our reﬂected Gaussian noise. This follows the same architecture as our score network, and is trained to predict the noise (so that subtracting it recovers the initial sample). In general, this is required to get a decent FID score, but makes little difference in terms of perceptual quality as our images are accurate to within a 2.5 standard deviation noise to begin with.

C. Experimental Setup

C.1. Image Generation

We exactly follow Song et al. (2021b) for both models and training hyperparameters. The only differences are that we set σ1 = 5 instead of 50 (for the VE SDE) since we mix well with σ1 = 5 while VE SDE needs a much larger σ1 to mix. Furthermore, we use the deep DDPM++ architecture, but we rescale the output 1

σ as is done for the NCSN++ architecture (for VE SDE).

For sampling, we sample with 1000 predictor (Reﬂected Euler-Maruyama) steps with 1000 corrector (Reﬂected Langevin) steps (Song et al., 2021b). We use a signal-to-noise ratio of 0.03.

C.2. Image Likelihood

We almost exactly follow Kingma et al. (2021) for both models and training hyperparameters, replacing the standard diffusion with our reﬂected diffusion. We do not train with the noise schedule, instead setting σ0 = 10 4 and σ1 = 5, which causes the reconstruction and prior losses to be (numerically) 0.

C.3. Guided Diffusion

We exactly follow Ho & Salimans (2022) and train using the ADM architecture (Dhariwal & Nichol, 2021) with the same parameters (for standard diffusion). For reﬂected diffusion, we train with σ0 = 0.01 and σ1 = 5, following our CIFAR-10 experiments. Furthermore, we scale the output by 1/σ as the neural network outputs the noise vector and not the score.

For the classiﬁer-free guidance baseline, we retrain a Image Net64 model following Ho & Salimans (2022). For the classiﬁer guided basleine, we use the pretrained models from Dhariwal & Nichol (2021).

We sample using 1000 steps each. For our diffusion model, we use reﬂected Euler Maruyama. For the standard model, we use a standard Euler-Maruyama with thresholding after each step. For ODE sampling, we sample using a RK45 solver (Dormand & Prince, 1980).

C.4. Simplex Diffusion

We consider class probabilities outputted from the Inceptionv3 Image Net classiﬁer (Szegedy et al., 2015). In particular, if {Xi} is a set of images and f is a classiﬁer that outputs a 1000-dimensional vector of class probabilities, then we learn a distribution over {Li := f(Xi)}. For learning purposes, we clip this to a value in 999 and apply the transformation given in Appendix B.2.

Our model is a simple MLP autoencoder with 4 intermediate layers of width 512. We use the Swish activation (Ramachandran et al., 2017) and apply Layer Norm (Ba et al., 2016). We train with Adam (Kingma & Ba, 2014) at a 2 10 4 learning rate. We apply an exponential moving average with a rate of 0.9999 before evaluating/generating our data. We visualize our full training and eval graphs below, as well as some samples taken from our model. Overall, we seem to be able to match the distribution reasonably well.

Reﬂected Diffusion Models

Figure 7. Training Dynamics for Simplex Diffusion

To ensure that we are able to generate data, we generated 10000 and compare the generated histograms of the (most likely) classes. Results are shown below:

(a) Ground Truth

(b) Generated

Figure 8. Generated Simplex Probabilities for Simplex Diffusion

Reﬂected Diffusion Models

D. Additional Generated Images

Figure 9. CIFAR-10 Generated Images.

Reﬂected Diffusion Models

Figure 10. Image Net64 w = 1 classiﬁer-free guided samples.

Reﬂected Diffusion Models

Figure 11. Image Net64 w = 2.5 classiﬁer-free guided samples.

Reﬂected Diffusion Models

Figure 12. CIFAR-10 Generated Images (trained for BPD).

Reﬂected Diffusion Models

Figure 13. Image Net32 Generated Images (trained for BPD).

Reﬂected Diffusion Models

Figure 14. w = 1 baseline classiﬁer guided images without thresholding. We sample from the pretrained model from Dhariwal & Nichol (2021). Around 75% of samples diverge, while most of the rest have noticeable artifacts such as a glaring white background.

Figure 15. w = 1 baseline classiﬁer guided images without thresholding and DDIM sampling. Interestingly, these samples don t diverge.

Reﬂected Diffusion Models

Figure 16. w = 1 baseline classiﬁer-free guided images sampled with DDIM without thresholding. This corresponds to ODE sampling.

Figure 17. Dynamically thresholded images, matches Figure 4.

Reﬂected Diffusion Models

Figure 18. w = 0.5 ODE samples, Reﬂected Diffusion.

Figure 19. ODE number of forward evaluations (NFE) vs guidance weight for Reﬂected Diffusion. Increasing the guidance weight tends to increase the number of forward evaluations, but this is still relatively low.