# learning_diffusion_bridges_on_constrained_domains__01674dcb.pdf Published as a conference paper at ICLR 2023 LEARNING DIFFUSION BRIDGES ON CONSTRAINED DOMAINS Xingchao Liu, Lemeng Wu, Mao Ye, Qiang Liu Department of Computer Science University of Texas at Austin {xcliu, lmwu, my21, lqiang}@cs.utexas.edu Diffusion models have achieved promising results on generative learning recently. However, because diffusion processes are most naturally applied on the unconstrained Euclidean space Rd, key challenges arise for developing diffusion based models for learning data on constrained and structured domains. We present a simple and unified framework to achieve this that can be easily adopted to various types of domains, including product spaces of any type (be it bounded/unbounded, continuous/discrete, categorical/ordinal, or their mix). In our model, the diffusion process is driven by a drift force that is a sum of two terms: one singular force designed by Doob s h-transform that ensures all outcomes of the process to belong to the desirable domain, and one non-singular neural force field that is trained to make sure the outcome follows the data distribution statistically. Experiments show that our methods perform superbly on generating tabular data, images, semantic segments and 3D point clouds. Code is available at https: //github.com/gnobitab/Constrained Diffusion Bridge. 1 INTRODUCTION Diffusion-based deep generative models, notably score matching with Langevin dynamics (SMLD) (Song & Ermon, 2019, 2020), denoising diffusion probabilistic models (DDPM) (Ho et al., 2020), and their variants (e.g., Song et al., 2020b,a; Kong & Ping, 2021; Song et al., 2021; Nichol & Dhariwal, 2021), have shown to achieve new state of the art results for image synthesis (Dhariwal & Nichol, 2021; Ramesh et al., 2022; Ho et al., 2022; Liu et al., 2021), audio synthesis (Chen et al., 2020; Kong et al., 2020), point cloud synthesis (Luo & Hu, 2021a,b; Zhou et al., 2021), and many other AI tasks. These methods train a deep neural network to drive as drift force a diffusion process to generate data, and are shown to outperform competitors, mainly GANs and VAEs, on stability and sample diversity (Xiao et al., 2021; Ho et al., 2020; Song et al., 2020b). However, due to the continuous nature of diffusion processes, the standard approaches are restricted to generating unconstrained continuous data in Rd. For generating data constrained on special structured domains, such as discrete, bound data or mixes of them, special techniques , e.g., dequantization (Uria et al., 2013; Ho et al., 2019) and multinomial diffusion (Hoogeboom et al., 2021; Austin et al., 2021), need to be developed case by case and the results still tend to be unsatisfying despite promising recent advances (Hoogeboom et al., 2021; Austin et al., 2021). Figure 1: An Ω-Bridge on discrete domain Ω= {1, 2, 3, 4}. This work proposes a simple and unified framework for learning diffusion models on general constrained domains Ωembedded in the Euclidean space Rd. The idea is to learn a continuous Rd-valued diffusion process Zt on time interval t [0, T], with a carefully designed force field, such that the final state ZT guarantees to 1) fall into the desirable domain Ω, and 2) follows the data distribution asymptotically. We achieve both steps by leveraging a key tool in stochastic calculus called Doob s h-transform (Doob, 1984), which provides formula for deriving the diffusion processes whose final states are guaranteed to fall into a specific set or equal a specific value. Published as a conference paper at ICLR 2023 Algorithm 1 Learning Diffusion Models on Constrained Domains (a Simple Example) Input: A dataset D := {x(i)} drawn from distribution Π on a domain Ω= {e1, e2, . . . , e K}. Goal: Learn a diffusion model that terminates at time T to generate samples from Π . Learning: Solve the optimization below with stochastic gradient descent (or other optimizers) θ = arg min θ 0 Ex D h f θ(Zt, t) Zt log ωΩ(x | Zt, t) 2i dt, (1) ωΩ(x | z, t) = with x drawn from the dataset D, ξ N(0, I), and x0 any initial point. Sampling: Generate sample ZT from f θ (Zt, t) + Zt log X dt + d Wt, Z0 = x0. Remark When the domain Ωis a manifold (e.g., line or surface) in Rd, simply replace the sum P e Ωwith the corresponding line or surface (in general Hausdorff) integration R Our simple procedure can be applied to any domain Ωonce a properly defined summation (for discrete sets) or integration (for continuous domains) can be evaluated. To give a quick overview on the practical intuition without invoking the mathematical theory, we show in Algorithm 1 a simple instance of the framework when the domain is a discrete set Ω= {e1, . . . , e K}. The idea is to set up the diffusion model to have a form of d Zt = f θ(Zt, t) + ZtψΩ(Zt, t) dt + d Wt, ψΩ(z, t) := log X where the drift is a sum of a non-singular (e.g., bounded) term f θ(z, t) which is a trainable neural force field with parameter θ, and a singular term zψΩ(z, t), which drives Zt towards set Ωas a gradient ascent on ψΩ(z, t). The ψΩ(z, t) measures the closeness of z to set Ω, as the log-likelihood of a Gaussian mixture model (GMM) centered on the elements in Ωwith variance T t. When t approaches to the terminal time T, the variance T t of the GMM goes to zero, and the magnitude of zψΩ(z, t) grows to infinity, hence ensuring that ZT must belong to Ω. In particular, note that zψΩ(z, t) = X e Ω ωΩ(e | z, t) e z T t, ωΩ(e | z, t) = exp z e 2 exp(ψΩ(z, t)) , which increases with an O(1/(T t)) rate as t T; here ωΩ(e | z, t) is the softmax probability measuring the relative closeness of z to the elements e in Ω(see also Eq (2)). As we show in Section 2.3, once f θ is non-singular in the sense of the mild condition of R T 0 E[ f θ(Zt, t) 2]dt < + , the diffusion model in (3) guarantees to yield a final state ZT that belongs to Ω, and hence provides a flexible model family on Ω. Moreover, as shown in Eq 1 in Algorithm 1, the neural field f θ can be simply trained to approximate log ωΩ(e | z, t) with e plugged as the data point that we expect to achieve when starting from z at time t. Intuitively, such fitted f θ increases the relative probability of the observed data points and hence allows us to fit the data distribution. Empirically, diffusion models learned through Ω-bridge achieves favorable results in generating mixed discrete/continuous tabular data, point clouds on grids, categorical semantic segments and discrete CIFAR10 images. Published as a conference paper at ICLR 2023 Outline The rest of the paper is organized as follows. Section 2 introduces h-transform, which allows us to derive bridge processes that are guaranteed to enter specific sets at the terminal time, and Section 2.3 specifies the parametric diffusion models for Ω-bridges. Then, with the learnable diffusion models, Section 3 introduces the general learning framework along with the loss function. 2 BACKGROUND: DIFFUSION PROCESSES AND h-TRANSFORM A diffusion process Z = {Zt : t [0, T]} on Rd follows a stochastic differential equation of form Q : d Zt = b(Zt, t)dt + σ(Zt, t)d Wt, (4) where Wt is a Wiener process, and σ: [0, T] Rd R is a positive diffusion coefficient, and b [0, T] Rd Rd is a drift function. We use Q (or P) to denote the path measure of stochastic processes Z, which are probability measures on the space of continuous paths. Let Qt be the marginal distribution of Zt at time t under Q. Our framework heavily relies on the bridge processes, special stochastic processes that guarantee to achieve a deterministic value or fall into a given set at the final state T. For a set Ω Rd, a process Z in Rd with law Q is called an Ω-bridge if Q(ZT Ω) = 1. One natural approach to constructing bridge processes is to derive the conditioned process of a general unconstrained process given that the desirable bridge constraint happens. Specifically, assume that Q is the law of a general unconstrained diffusion process of form (4), and denote by QΩ( ) = Q( | ZT Ω) the conditioned distribution given that the event of ZT Ωhappens. Then QΩis guaranteed to be an Ω-bridge by definition. Importantly, a remarkable result from Doob (Doob, 1984), now known as h-transform, shows that QΩis the law of a diffusion process with a properly modified drift term. Below, we introduce this results, first for the case x-bridge when Ω= {x} includes a single point, and then for more general sets Ω. For simplicity, we only state the formula from h-transform that are useful for us without proofs. See e.g., Oksendal (2013); Rogers & Williams (2000) for more background on h-transform. 2.1 x-BRIDGES Let us first consider the x-bridge Qx( ) := Q( | ZT = x), the process Q pinned at a deterministic terminal point ZT = x. By the result from h-transform (see e.g., Oksendal (2013)), the conditioned process Qx( ) := Q( | ZT = x), if it exists, can be shown to be the law of d Zt = b(Zt, t) + σ2(Zt, t) z log q T |t(x | Zt) dt + σ(Zt, t)d Wt, (5) where q T |t(x|z) is the density function of the transition probability QT |t(dx | z) := Q(ZT dx | Zt = z), where dx denotes an infinitesimal volume centering around x. Compared with the diffusion process (4) of Q, the main difference is that the conditioned process has an additional drift force σ2(z, t) z log q T |t(x|z) which plays the role of steering Zt towards the target ZT = x; this is a singular force whose magnitude increases to infinity as t T, because q T |t( | z) is a delta measure centered at z when t = T. In addition, by Bayes rule, the distribution of the initial state Z0 should be given by Z0 Q0|T ( | x), Q0|T (dz|x) Q0(dz)q T |0(x|z). (6) Example 2.1. If Q is the law of d Zt = σtd Wt, we have QT |t( |z) = N(z, βT βt), where βt = R t 0 σ2 sds. Hence, following the formula in (5), Qx := Q( |ZT = x) is the law of d Zt = σ2 t x Zt βT βt dt + σtd Wt, (7) and Z0 Q0|T (dz) Q0(dz)ϕ(x | z, βT βt), and ϕ( |µ, σ2) is the density function of N(µ, σ2). The process in (7) is known as a (time-scaled) Brownian bridge. Note that the drift in (7) grows to infinity in magnitude with a rate of O(1/(βT βt)) as t T, which ensures that Zt = x with probability one. Published as a conference paper at ICLR 2023 Arbitrary initialization To make (5) the conditioned process of (4), the initial distribution must follow the Bayes rule in (6). However, thanks to the singular force z log q T |t(x|z), the process (5) can guarantee Zx t = x from an arbitrary initialization once the process is well defined. When the initialization is different from (6), the process in (5) is no longer the conditioned process of (4), but it remains to be an x-bridge in that ZT = x is still guaranteed. To see why this is the case, assume that Q is initialized from a deterministic point Z0 = x0. Then we would still have Z0 = x0 when conditioned on ZT Ωby Bayes rule. This suggests that (5) starting from any deterministic initialization is the condition process of Q with the same deterministic initialization, and is hence an x-bridge. As a result, (5) from any stochastic initialization is also an x-bridge because it can be viewed as the mixture of the processes with different deterministic initialization, all of which are x-bridges. See Appendix A.4 for a detailed analysis, in which it is shown that (5) with an arbitrary initialization can be viewed as the conditioned process of a special class of non-Markov processes called reciprocal process. 2.2 Ω-BRIDGES More generally, for the law Q of (4) and a set Ω Rd, the Ω-bridge QΩ:= Q( | ZT Ω) follows QΩ: d Zt = ηΩ(Zt, t)dt + σ(Zt, t)d Wt, (8) with ηΩ(z, t) = b(z, t) + σ2(z, t)Ex QT |t,z,Ω[ z log q T |t(x | z)], Z0 Q0|T ( | ZT Ω), where drift force ηΩis similar to that of the x-bridge in (5), except that the final state x is now randomly drawn from an Ω-truncated (or Ω-conditioned) transition probability: QT |t,z,Ω(dx | z) := Q(ZT dx | Zt = z, ZT Ω), which is the transition probability from Zt to ZT , conditioned on that ZT Ω. In practice, its form can be derived using Bayes rule. Example 2.2. Assume Q follows d Zt = σtd Wt. Then QΩyields the following Ω-bridge: d Zt = ηΩ(Zt, t)dt + σtd Wt, ηΩ(z, t) = σ2 t Ex NΩ(z,βT βt) where NΩ(z, σ2) = Law(Z | Z Ω) with Z N(µ, σ2), which is an Ω-truncated Gaussian distribution N(µ, σ2), whose density function is ϕΩ(x) I(x Ω)ϕ(x|µ, σ) with ϕ(x|µ, σ) the density function of N(µ, σ2). Note that it is tractable to calculate ηΩonce we can evaluate the expectation of NΩ(z, βT βt). A general case is when Ω= I1 Id, for which the expectation reduces to one dimensional Gaussian integrals. See Appendix A.6 and A.7 for details and examples of ηΩ. As in the x-bridge, we can set the initialization to be any distribution supported on the set of points that can reach Ωfollowing Q (precisely, points z0 that satisfy Ω supp(QT ( |Z0 = z0)) = ) using the mixture of initialization argument. 2.3 A PARAMETRIC FAMILY OF Ω-BRIDGES The formula in (8) only provides a fixed process for a given Q. For the purpose of learning generative models, however, we need a rich family of Ω-bridges within which we can search for a best one to fit with the data distribution. It turns out we can achieve this by simply adding an extra non-singular drift force, which can be a trainable neural network, on top of the Ω-bridge in (8). Specifically, we construct the following parametric diffusion model Pθ: Pθ : d Zt = (σ(Zt, t)f θ(Zt, t) + ηΩ(Zt, t))dt + σ(Zt, t)d Wt, Z0 Pθ 0, (10) where f θ(z, t) is a neural network with input (z, t) and parameter θ, which will be trained based on the empirical observations. Adding the neural drift σ(Zt, t)f θ(Zt, t) term does not break the Ω-bridge condition, once it satisfies a very mild regularization condition: Proposition 2.3. For any QΩfollowing d Zt = ηΩ(Zt, t)dt + σ(Zt, t)d Wt that is an Ω-bridge, the Pθ in (10) is also an Ω-bridge if EZ QΩ[ R T 0 f θ(Zt, t) 2 2 dt] < + and KL(QΩ 0 || Pθ 0) < + . Published as a conference paper at ICLR 2023 The condition on f θ is very mild, and it is satisfied if f θ is bounded. Moreover, it can easily hold even when fθ is not bounded. For example, assuming that fθ(x) a x β + b, which holds for Re LU network with β = 1, we just need to require that the underlying process has a bounded moment EZ QΩ[ R T 0 Zt 2β dt] < + , which is a typical regularity condition to expect. 3 LEARNING Ω-BRIDGE MODELS Let {x(i)}n i=1 be an i.i.d. sample from an unknown distribution Π on a domain Ω Rd. Our goal is to learn the parameter θ for the Ω-bridge model Pθ in (10) such that the terminal distribution ZT Pθ T matches the data X Π . We should distinguish Pθ, which is the trainable generative model, and Q, which is a fixed baseline process that helps us to derive methods for constructing and learning the model. Q can be the simple Brownian motion in Example 2.1 and 2.2. As the case of other diffusion models, Pθ can be viewed as a model with an infinite dimensional latent variable of the intermediate trajectories of Z. Hence, a canonical learning approach is expectation maximization (EM), which alternates between 1) E-step: estimating the posterior Pθ,x := Pθ(Z | ZT = x) of the latent trajectories Z given the observation ZT = x; 2) M-step: estimating the parameter θ with Z imputed from Pθ,x. A key challenge, however, is that the posterior distribution Pθ,x is difficult to calculate due to the presence of neural force field in Pθ (as the h-transform formula would have no closed form), and it need to be iteratively updated as θ changes. Following DDPM (Ho et al., 2020), we consider a simpler approach that replaces the posterior Pθ,x with an arbitrary x-bridge, denoted by Qx. This yields a simplified EM algorithm without the expensive posterior inference in the E-step. A natural choice is the conditioned process Qx := Q(Z | ZT = x), but the method works for a general x-bridge. Specifically, let QΠ ( ) = R Qx( )Π (dx) be the mixture of the x-bridges whose end point x is randomly drawn from the data distribution x Π . The trajectories from QΠ can be generated in the following backward way: first drawing a data point x Π , and then Z Qx conditioned on the end point x. Obviously, by construction, the terminal distribution of QΠ equals Π , that is, QΠ T = Π . Then, the model Pθ can be estimated by fitting data drawn from QΠ using maximum likelihood estimation: n L(θ) := KL(QΠ || Pθ) o . (11) The classical (variational) EM would alternatively update θ (M-step) and Qx (E-step) to make Qx Pθ,x. Why is it OK to simply drop the E-step? At the high level, it is the benefit from using universal approximators like deep neural networks: if the model space of Pθ is sufficiently rich, by minimizing the KL divergence in (11), Pθ can approximate the given QΠ well enough (in a way that is made precise in the Appendix A) such that their terminal distributions are close: Pθ T QΠ T = Π . Learning latent variable models require no E-step if the model space is sufficiently rich. We should see that in this case the latent variables Z in the learned model Pθ is dictated by the choice of the imputation distribution Q since we have Pθ,x = Qx when the KL divergence in (11) is fully minimized to zero; EM also achieves Pθ,x = Qx but has the imputation distribution Qx determined by the model Pθ, not the other way. Loss Function In its general form, the x-bridge Qx that we use can be a non-Markov diffusion process Qx : d Zt = ηx(Z, t)dt + σ(Zt, t)d Wt, Z0 µx, (12) which has the same diffusion coefficient σ(Zt, t) as Pθ in (10), and any x-dependent ηx and initialization µx once the x-bridge condition is ensured. As the general framework, we assume that ηx can depend on the whole trajectory Z. Published as a conference paper at ICLR 2023 Algorithm 2 Learning Ω-Bridge Diffusion Models Input: A dataset D := {x(i)} drawn from distribution Π on a domain Ω. Setup: Specify an x-bridge Qx and an Ω-bridge QΩ Qx : d Zt = ηx(Z, t)dt + σ(Zt, t)d Wt, QΩ: d Zt = ηΩ(Zt, t)dt + σ(Zt, t)d Wt, Specify the generative model Pθ based on QΩand a neural network f θ: Pθ : d Zt = (σ(Zt, t)f θ(Zt, t) + ηΩ(Zt, t))dt + σ(Zt, t)d Wt, Z0 Pθ 0. Default: let Q be the law of d Zt = σtd Wt and derive the bridges by h-transform as Qx = Q( |ZT = x) in Eq (7) and QΩ= Q( |ZT Ω) in Eq (9). Training: Estimating θ by minimizing the loss function (13) using any off-the-shelf optimizer. Sampling: Generate sample ZT from Pθ with the trained parameter θ. Using Girsanov theorem (e.g., Oksendal, 2013), with Pθ in (10) and Qx in (12), the KL divergence in (11) can be shown to equal to L(θ) = Ex Π Z Qx log pθ 0(Z0) | {z } MLE of initial dist. σ 1(Zt, t)(sθ(Zt, t) ηx(Z, t)) 2 | {z } score matching + const, (13) where we write sθ as the overall drift force of Pθ in (10), that is, sθ(z, t) = σ(z, t)f θ(z, t) + ηΩ(z, t), and pθ 0 is the probability density function (PDF) of the initial distribution Pθ 0. Therefore, L(θ) is a sum of the negative log-likelihood of the initial distribution that encourages Pθ 0 QΠ 0 , and a least squares loss between sθ and ηx. In practice, we simply fix the initial distribution Pθ 0 to be a delta measure on a fixed point (say x0 = 0), so we only need to train the drift function f θ. Algorithm 1 shows a simple instance of the framework when the baseline process Q is the standard Brownian motion d Zt = d Wt, Qx = Q( | ZT = x) and σ(z, t) = 1. Note that the least squares term in (13) can be viewed as enforcing f θ σ 1(ηx ηΩ), which reduces to f θ log ωΩin the case of Algorithm 1. Related Works Bridge processes provide a simple and flexible approach to learning diffusion generative models, which was explored in Peluchetti (2021); Ye et al. (2022); Wu et al. (2022); De Bortoli et al. (2021). Heng et al. (2021) investigates the orthogonal problem of simulating from the bridge Qx for a given Q. In comparison, our method learns diffusion models on general domains Ωon which an Ω-bridge can be derived (using h-transform or any other method), and hence provides a highly flexible framework for learning with structured data (including discrete, continuous, and their mixes). This distinguishes it with existing approaches that are designed for special types of data (e.g., Ho et al. (2020); Hoogeboom et al. (2021); Austin et al. (2021); Li et al. (2022); Dieleman et al. (2022) for discrete data). De Bortoli et al. (2022) discusses how to learn score-based generative models on general Riemannian manifolds. Another highly related work is Ye et al. (2022), which proposes to learn first hitting diffusion models for generating data on both discrete sets and spheres. The advantage of our approach is that it is simpler and easier to derive for more complex types of domains. 4 EXPERIMENTS We evaluate our algorithms for generating mixed-typed tabular data, grid-valued point clouds, categorical semantic segmentation maps, discrete CIFAR10 images. We observe that Ω-bridge provides a particularly attractive and superb approach to generating data from various constrained domains. Algorithm Overview For all experiments, we use Algorithm 2 with the default choice of Qx in (7) and QΩin (9). The specific form of ηΩis derived based on the specific choice of the domain Ω. By default, we set the initialization Z0 = 0 and the optimizer Adam. Published as a conference paper at ICLR 2023 Logistic ( ) Ada Boost ( ) MLP ( ) Real Training Data 0.877 0.021 0.912 0.013 0.897 0.012 TVAE (Xu et al., 2019) 0.825 0.012 0.876 0.005 0.845 0.008 CTGAN (Xu et al., 2019) 0.649 0.014 0.841 0.021 0.843 0.016 Copula GAN (Patki et al., 2016) 0.683 0.015 0.859 0.004 0.853 0.009 Mixed-Bridge 0.868 0.010 0.884 0.005 0.877 0.006 Table 1: Classification accuracy on the Adult Income dataset with different classifiers when trained with data synthesized by generative models. Real Training Data shows the upper bound of the metrics. 4.1 GENERATING MIXED-TYPE TABULAR DATA Learning to generate tabular data is challenging, because tabular data usually contains a mixture of discrete and continuous attributes (Xu et al., 2019; Park et al., 2018). Unlike carefully designing special GANs as in previous works (Xu et al., 2019; Srivastava et al., 2017), Ω-bridge can be seamlessly applied to mixed-typed tabular data generation without any further modification. In contrast, diffusion processes that solely work on discrete domain (Austin et al., 2021; Hoogeboom et al., 2021) cannot be applied to this task. In this experiment, we use the Adult Income dataset (Kohavi, 1996), which contains 30,162 training samples. The data points are described by a series of attributes, including continuous (age, capital-gain, etc.) and discrete (sex, race, etc.). We compare with conditional tabular GAN (Xu et al., 2019) (CTGAN), Copula GAN (Patki et al., 2016), and Table VAE (Xu et al., 2019) (TVAE), which are state-of-the-art GAN-based and VAE-based generative models for mixed-typed tabular data. Following previous works (Xu et al., 2019; Patki et al., 2016), we measure the classification accuracy on the real data of logistic regression, Ada Boost classifier and MLP classifier when trained on the generated data. In the Ω-bridge model, we set σt = 3 exp( 3t) and f θ a 3-layer MLP. In this case, Ω= I1 I15, where I1 to I9 are discrete domains and I10 to I15 are non-negative continuous domains. For discrete domains, I = {e1, . . . , ed}, we have, ηI(z, t) = σ2 t z log P e I exp z e 2 2(βT βt) ; for non- negative continuous domains, I = [0, + ), derivation shows ηI(z, t) = σ2 t z log F( z βT βt ) , where F is the standard Gaussian CDF. Finally, we have ηΩ(z, t) = P15 i=1 ηIi(zi, t) for the whole domain Ω. We set the number of diffusion steps to K = 2000. Results are shown in Table 1. Result All the three different classifiers yield the highest accuracy when trained on the data generated by our method, referred to as Mixed-Bridge in this case. The result reflects that the data generated by Mixed-Bridge is closer to the real distribution than the baseline methods. Method MMD COV 1-NNA PCD (Luo & Hu, 2021a) 13.37 46.60 58.94 Rd-Bridge 13.30 46.52 59.32 Grid-Bridge 12.85 47.78 56.25 Figure 2 & Table 2: The point clouds (upper row) generated by different methods and meshes reconstructed from them (lower row). Grid-Bridge obtains more uniform points and hence better mesh thanks to the integer constraints. Numbers in the table are multiplied by 103. 4.2 GENERATING INTEGER-VALUED POINT CLOUDS A feature of point clouds in 3D objects in graphics is that they tend to distribute evenly, especially if they are discretized from a mesh. This aspect is omitted in most existing works on point cloud generation. As a result they tend to generate non-uniform points that are unsuitable for real applications, which often involve converting back to meshes with procedures like Ball-Pivoting (Bernardini Published as a conference paper at ICLR 2023 Real Generated Value Distribution Figure 3: Results on generating categorical segmentation maps. Each pixel here an one-hot vector. Each dimension of the Ω-bridge starts from a deterministic and evolve through a stochastic trajectory to converge to either 0 or 1. The generated samples have similar visual quality to the training data. Methods ELBO ( ) IWBO ( ) Uniform Dequantization (Uria et al., 2013) 1.010 0.930 Variational Dequantization (Ho et al., 2019) 0.334 0.315 Argmax Flow (Softplus thres.) (Hoogeboom et al., 2021) 0.303 0.290 Argmax Flow (Gumbel distr.) (Hoogeboom et al., 2021) 0.365 0.341 Argmax Flow (Gumbel thres.) (Hoogeboom et al., 2021) 0.307 0.287 Multinomial Diffusion (Hoogeboom et al., 2021) 0.305 - Cat.-Bridge (Constant Noise) 0.844 0.707 Cat.-Bridge (Noise Decay A) 0.276 0.232 Cat.-Bridge (Noise Decay B) 0.301 0.285 Cat.-Bridge (Noise Decay C) 0.363 0.302 Table 3: Results on the City Scapes dataset. Cat. refers to Categorical . et al., 1999). We apply our method to generate point clouds that constrained on a integer grid which we show yields much more uniformly distributed points. To the best of our knowledge, we are the first work on integer-valued 3D point cloud generation. A point cloud is a set of points {xi}m i=1, xi R3 in the 3D space, where m refers to the number of points. We apply two variants of our method: Rd-Bridge and Grid-Bridge. Rd-Bridge generates points in the continuous 3D space, i.e., ΩR = R3m. Grid-Bridge generate points that on integer grids, ΩGrid = {1, . . . , 128}3m. We fix the diffusion coefficient σt = 1. The number of diffusion steps K is set to 1000. We test our method on Shape Net (Chang et al., 2015) chair models, and compare it with Point Cloud Diffusion (PCD) (Luo & Hu, 2021a), a state-of-the-art continuous diffusion-based generative model for point clouds. The neural network f θ in our methods are the same as that of PCD for fair comparison. Qualitative results and quantitative results are shown in Figure 2 and Table 2. As common practice (Luo & Hu, 2021a,b), we measure minimum matching distance (MMD), coverage score (COV) and 1-NN accuracy (1-NNA) using Chamfer Distance (CD) with the test dataset. Result Both Rd-Bridge and Grid-Bridge get better MMD, COV, and 1-NNA than PCD. Moreover, by constraining the domain of interest to the integer grids, Grid-Bridge yields even better performance than Rd-Bridge. In Figure 2, since the point clouds generated by Grid-Bridge are limited to integer grids, the reconstructed meshes from Ball-Pivoting clearly have higher quality than Rd-Bridge and PCD. 4.3 GENERATING SEMANTIC SEGMENTATION MAPS ON CITYSCAPES We consider unconditionally generating categorical semantic segmentation maps. We represent each pixels as a one-hot categorical vector. Hence the data domain is Ω= {e1, . . . , ec}h w, where c is the number of classes and ei is the i-th c-dimensional one-hot vector, and h, w represent the height and width of the image. In City Scapes (Cordts et al., 2016), h = 32, w = 64, c = 8. In this experiment, we test different schedule of the diffusion coefficient σt, including (Constant Noise): σt = 1; (Noise Decay A): σt = a exp( bt); (Noise Decay B): σt = a(1 t); (Noise Decay C) σt = a(1 exp( b(1 t))). Here a and b are hyper-parameters. The number of diffusion steps K is set to 500. We measure the negative log-likelihood (NLL) of the test set using the learned models. The NLL (bits-per-dimension) is estimated with evidence lower bound (ELBO) and importance weighted bound (IWBO) (Burda et al., 2016), respectively, as in (Hoogeboom et al., 2021). We compare Ω-Bridge with a state-of-the-art categorical diffusion algorithm, Argmax Flow (and Multinomial Diffusion) (Hoogeboom et al., 2021), and the traditional methods, uniform dequantization (Uria Published as a conference paper at ICLR 2023 Methods IS ( ) FID ( ) NLL ( ) Discrete D3PM uniform Lvb (Austin et al., 2021) 5.99 51.27 5.08 D3PM absorbing Lvb (Austin et al., 2021) 6.26 41.28 4.83 D3PM Gauss Lvb (Austin et al., 2021) 7.75 15.30 3.966 D3PM Gauss Lλ=0.001 (Austin et al., 2021) 8.54 8.34 3.975 D3PM Gauss + logistic Lλ=0.001 8.56 7.34 3.435 Integer-Bridge (Init. A) 8.77 6.77 3.46 Integer-Bridge (Init. B) 8.68 6.91 3.35 Integer-Bridge (Init. C) 8.72 6.94 3.40 Table 4: Discrete CIFAR10 Image Generation et al., 2013) and variational dequantization (Ho et al., 2019). The numerical results of the baselines are directly adopted from (Hoogeboom et al., 2021), and experiment configuration is kept the same for fair comparison. The neural network f θ is the same as (Hoogeboom et al., 2021). The results are shown in Figure 3 and Table 3. Our Ω-bridge is named Categorical-Bridge (Cat.-Bridge) in this experiment. Result We observe that all the four kinds of Cat.-Bridge can successfully generate categorical semantic segments, and different noise schedules result in different empirical performance. Among the four variants of Cat.-Bridge, Cat.-Bridge with Noise Decay A yields the best ELBO and IWBO, surpassing all the other algorithms in comparison. 4.4 GENERATING DISCRETE CIFAR10 IMAGES Initialization A Initialization B Initialization C Time 𝒕 Figure 4: Integer-bridges can generate high-quality discrete samples with different initial distribution. In this experiment, we apply three types of bridges. All of these bridges use the same output domain Ω= {0, . . . , 255}h w c, where h, w, c are the height, width and number of channels of the images, respectively. We set σt = 3 exp( 3t). We consider different initial distributions: (Init. A) Z0 = 128; (Init. B) Z0 = ˆµ0, (Init. C) Z0 N(ˆµ0, ˆσ0), where ˆµ0 and ˆσ0 are the empirical mean and variance of pixels in the CIFAR10 training set. The number of diffusion steps K is set to 1000. We compare with the variants of a state-of-the-art discrete diffusion model, D3PM (Austin et al., 2021). For fair comparison, we use the DDPM backbone (Ho et al., 2020) as the neural drift f θ in our method, similar to D3PM. We report the Inception Score (IS) Salimans et al. (2016), Fr echet Inception Distance (FID) Heusel et al. (2017) and negative log-likelihood (NLL) of the test dataset. We call our method Integer-Bridge in this case. The results are shown in Table 4 and Figure 4. Result In Table 4, Integer-Bridge with Initialization A,B,C can all get lower FIDs ( 7) than the variants of D3PM. Among the three kinds of Integer-bridges, Integer-Bridge (Init. B) obtains the lowest NLL (3.35). It also beats D3PM Gauss + logistic (3.435) on NLL, which has the best NLL in the variants of D3PM. 5 CONCLUSION AND LIMITATIONS We present a framework for learning diffusion generative models on constrained data domains. It leaves a number of directions for further explorations and improvement. For example, the practical impact of the choices of the bridges Q, in terms of initialization, dynamics, and noise schedule, are still not well understood and need more systematical studies. Besides, our current method is limited to Ωthat are factorizable and integrable. Moreover, application of Ω-bridge to many other practical fields also needs investigation in the future. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENTS This research is supported by NSF CAREER1846421, Sen SE2037267, EAGER-2041327, Office of Navy Research, and NSF AI Institute for Foundations of Machine Learning (IFML). Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. Fausto Bernardini, Joshua Mittleman, Holly Rushmeier, Cl audio Silva, and Gabriel Taubin. The ball-pivoting algorithm for surface reconstruction. IEEE transactions on visualization and computer graphics, 5(4):349 359, 1999. Yuri Burda, Roger B Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In ICLR (Poster), 2016. Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. ar Xiv preprint ar Xiv:1512.03012, 2015. Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2020. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schr odinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021. Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, and Arnaud Doucet. Riemannian score-based generative modeling. ar Xiv preprint ar Xiv:2202.02763, 2022. Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 2021. Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data. ar Xiv preprint ar Xiv:2211.15089, 2022. Joseph L Doob. Classical potential theory and its probabilistic counterpart, volume 549. Springer, 1984. Jeremy Heng, Valentin De Bortoli, Arnaud Doucet, and James Thornton. Simulating diffusion bridges with score matching. ar Xiv preprint ar Xiv:2111.07243, 2021. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving flowbased generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pp. 2722 2730. PMLR, 2019. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Published as a conference paper at ICLR 2023 Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1 33, 2022. Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forr e, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34, 2021. Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd, volume 96, pp. 202 207, 1996. Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021. Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2020. Antoine Lejay. The girsanov theorem without (so much) stochastic analysis. In S eminaire de Probabilit es XLIX, pp. 329 361. Springer, 2018. Christian L eonard, Sylvie Rœlly, and Jean-Claude Zambrini. Reciprocal processes. a measuretheoretical point of view. Probability Surveys, 11:237 269, 2014. Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. ar Xiv preprint ar Xiv:2205.14217, 2022. Xingchao Liu, Xin Tong, and Qiang Liu. Sampling with trusthworthy constraints: A variational gradient framework. Advances in Neural Information Processing Systems, 34:23557 23568, 2021. Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837 2845, 2021a. Shitong Luo and Wei Hu. Score-based point cloud denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4583 4592, 2021b. Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162 8171. PMLR, 2021. Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013. Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment, 11(10), 2018. N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399 410, Oct 2016. doi: 10. 1109/DSAA.2016.49. Stefano Peluchetti. Non-denoising forward-time diffusions. 2021. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. L Chris G Rogers and David Williams. Diffusions, Markov processes and martingales: Volume 2, Itˆo calculus, volume 2. Cambridge university press, 2000. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020a. Published as a conference paper at ICLR 2023 Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019. Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438 12448, 2020. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020b. Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of scorebased diffusion models. Advances in Neural Information Processing Systems, 34, 2021. Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017. Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neural autoregressive density-estimator. Advances in Neural Information Processing Systems, 26, 2013. Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. Lemeng Wu, Chengyue Gong, Xingchao Liu, Mao Ye, and Qiang Liu. Diffusion-based molecule generation with informative prior bridges. ar Xiv preprint ar Xiv:2209.00865, 2022. Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion gans. ar Xiv preprint ar Xiv:2112.07804, 2021. Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems, 32, 2019. Mao Ye, Lemeng Wu, and Qiang Liu. First hitting diffusion models. ar Xiv preprint ar Xiv:2209.01170, 2022. Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5826 5835, 2021. Published as a conference paper at ICLR 2023 Roadmap The appendix is structured as follows: Appendix A provides the theoretical analysis and derivation of diffusion bridges. In particular, Appendix A.1 shows the derivation of the main training loss; Appendix A.2 derives the drift term ηΠ of QΠ ; Appendix A.3 proves that we can actually use a Markov model Pθ to match all time-marginals with QΠ ; Appendix A.4 explains why we can use arbitrary initialization when constructing bridge processes and discusses the reciprocal structure of QΠ ; Appendix A.5 provides analysis on the time-discretization error and statistical error of the practical discretized algorithm. Appendix A.6 and A.7 presents details on the condition and examples of Ω-bridge construction. Appendix B shows additional experiment details and results. A THEORETICAL ANALYSIS ON BRIDGES A.1 DERIVATION OF THE MAIN LOSS IN EQUATION (13) [Proof of Equation (13)] Denote by Qx = Q( |ZT = x). Note that KL(QΠ || Pθ) = Ex Π ,Z Qx log d QΠ = Ex Π ,Z Qx log d Qx d Pθ (Z) + log d QΠ = Ex Π KL(Qx || Pθ) + const, where const denotes a constant that is independent of θ. Recall that Qx follows d Zt = ηx(Z[0,t], t)dt + σ(Zt, t)d Wt, and Pθ follows d Zt = sθ(Zt, t)dt + σ(Zt, t)d Wt. By Girsanov theorem (e.g., Lejay, 2018), KL(Qx || Pθ) = KL(Qx 0 || Pθ 0) + 1 sθ(Zt, t) ηx(Z[0,t], t) 2 log pθ 0(Z0) + 1 sθ(Zt, t) ηx(Z[0,t], t) 2 L(θ) = Ex Π ,Z Qx log pθ 0(Z0) + 1 sθ(Zt, t) ηx(Z[0,t], t) 2 log pθ 0(Z0) + 1 sθ(Zt, t) ηZT (Z[0,t], t) 2 A.2 DERIVATION OF THE DRIFT ηΠ OF QΠ Lemma A.1. Let Qx is the law of d Zx t = ηx(Zx [0,t], t)dt + σ(Zx t , t)d Wt, Z0 Qx 0, and QΠ := R Qx(Z)Π (dx) for a distribution Π on Rd. Then QΠ is the law of d Zt = ηΠ (Z[0,t], t)dt + σ(Zt, t)d Wt, Z0 QΠ 0 , ηΠ (z[0,t], t) = Ex Π ,Z Qx[ηx(Z[0,t], t) | Z[0,t] = z[0,t]], QΠ 0 (dz0) = Ex Π [Qx 0(dz0)]. [Proof] QΠ is the solution of the following optimization problem: QΠ = arg min P n KL(QΠ || P) = Ex Π [KL(Qx || P)] + const o . Published as a conference paper at ICLR 2023 By Girsanov s Theorem (e.g., Lejay, 2018), any stochastic process P that has KL(Qx || P) < + (and hence is equivalent to Qx) has a form of d Zt = ηΠ (Z[0,t], t)dt + σ(Zt, t)d Wt for some measurable function ηΠ , and Ex Π [KL(Qx || P)] = Ex Π [KL(Qx 0 || P0)] + Ex Π ,Z Qx σ(Zt, t) 1(ηΠ (Z[0,t], t) ηx(Z[0,t], 0)) 2 It is clear that to achieve the minimum, we need to take P0( ) = Ex Π [Qx 0( )] and ηΠ (z[0,t], t) = Ex Π ,Z Qx[ηx(Z[0,t], t) | Z[0,t] = z[0,t]], which yields the desirable form of QΠ . A.3 Pθ YIELDS A MARKOVIZATION OF QΠ As Pθ is Markov by the model assumption, it can not perfectly fit QΠ which is non-Markov in general. This is a substantial problem because QΠ can be non-Markov even if Qx is Markov for all x Ω(see Section A.4). In fact, using Doob s h-transform method (Doob, 1984), QΠ can be shown to be the law of a diffusion process d Zt = ηΠ (Z[0,t], t)dt + σ(Zt, t)d Wt, ηΠ (z[0,t], t) = EZ QΠ ηZT (z[0,t], t) | Z[0,t] = z[0,t] , where ηΠ is the expectation of ηx when x = ZT is drawn from Q conditioned on Z[0,t]. We resolve this by observing that it is not necessary to match the whole path measure (Pθ QΠ ) to match the terminal (Pθ T QΠ T = Π ). It is enough for Pθ to be the best Markov approximation (a.k.a. Markovization) of QΠ , which matches all (hence terminal) fixed-time marginals with QΠ : Proj(QΠ , M) := arg min P M KL(QΠ || P), M = the set of all Markov processes on [0, T]. Proposition A.2. The global optimum of L(θ) in (11) and (13) is achieved by θ if sθ (z, t) = EZ QΠ ηZT (Z[0,t], t) | Zt = z , µθ (dz0) = QΠ 0 = Ex Π [Qx 0(dz0)] . (14) In this case, Pθ = Proj(QΠ , M) is the Markovization of QΠ , with which it matches all timemarginals: Pθ t = QΠ t for all time t [0, T]. In addition, KL(Π || Pθ T ) KL(Pθ || Pθ) = KL(QΠ || Pθ) KL(QΠ || Pθ ) = L(θ) L(θ ). (15) Note that sθ is a conditional expectation of ηΠ : sθ (z, t) = EZ QΠ [ηΠ (Z[0,t], t) | Zt = z]. Theorem 1 of Peluchetti (2021) gives a related result that the marginals of mixtures of Markov diffusion processes can be matched by another Markov diffusion process, but does not discuss the issue of Markovization nor connect to KL divergence. Theorem 1 of Song et al. (2021) is the special case of (15) when QΠ is Markov. [Proof of Proposition A.2] It is the combined result of Lemma A.3 and Lemma A.4 below. Lemma A.3. Let Q be a non-Markov diffusion process on [0, T] of form Q : d Zt = η(Z[0,t], t)dt + σ(Zt, t)d Wt, Z0 Q0, and M = arg min P M KL(Q || P) be the Markovization of Q, where M is the set of all Markov processes on [0, T]. Then Q is the law of M : d Zt = m(Zt, t)dt + σ(Zt, t)d Wt, Z0 Q0, where m(z, t) = EZ Q[η(Z[0,t], t) | Zt = z]. In addition, we have Qt = Mt for all time t [0, T]. Published as a conference paper at ICLR 2023 [Proof] By Girsanov s Theorem (e.g., Lejay, 2018), any process that has KL(Q || M) < + (and hence is equivalent to Q) has a form of d Zt = m(Z[0,t], t)dt + σ(Zt, t)d Wt, where m is a measurable function. Since M is Markov, we have m(Z[0,t], t) = m(Zt, t). Then KL(Q || P) = KL(Q0 || P0) + EZ Q σ(Zt, t) 1(η(Z[0,t], t) m(Zt, 0)) 2 2 It is clear that to achieve the minimum, we need to take M0 = Q0 and m(z, t) = EZ Q[η(Z[0,t], t) | Zt = z]. To prove Qt = Mt, note that by the chain rule of KL divergence: KL(Q || P) = KL(Qt || Pt) + EZt Qt[KL(Q( |Zt) || P( |Zt))], t [0, T]. As the second term P( |Zt) is independent of the choice of the marginal Pt at time t [0, T], the optimum should be achieved by M only if Mt = Qt. Lemma A.4. Let Q : d Zt = η(Z[0,t], t)dt + σ(Zt, t)d Wt, Z0 Q0 M : d Zt = m(Zt, t)dt + σ(Zt, t)d Wt, Z0 Q0, Pθ : d Zt = sθ(Zt, t)dt + σ(Zt, t)d Wt, Z0 Pθ 0, where M is the Markovization of Q (see Lemma A.3). Then KL(Q || Pθ) = KL(Q || M) + KL(M || Pθ). Hence, assume there exists θ such that Pθ = M and write L(θ) := KL(Q || Pθ). We have KL(QT || Pθ T ) = KL(MT || Pθ T ) KL(M || Pθ) = L(θ) L(θ ). [Proof] Note that KL(M || Pθ) = KL(M0 || Pθ 0) + 1 σ(Zt, t) 1(sθ(Zt, t) m(Zt, t)) 2 = KL(M0 || Pθ 0) + 1 0 EZt Mt h σ(Zt, t) 1(sθ(Zt, t) m(Zt, t)) 2 = KL(Q0 || Pθ 0) + 1 0 EZt Qt h σ(Zt, t) 1(sθ(Zt, t) m(Zt, t)) 2 i dt //Qt = Mt t = KL(Q0 || Pθ 0) + EZ Q σ(Zt, t) 1(sθ(Zt, t) m(Zt, t)) 2 = KL(Q0 || Pθ 0) + 1 where we define f 2 Q,σ = EZ Q h 1 2 R T 0 σ(Zt, t) 1f(Zt, t) 2 2 dt i . On the other hand, KL(Q || Pθ) = KL(Q0 || Pθ 0) + EZ Q σ(Zt, t) 1(sθ(Zt, 0)) η(Z[0,t], t) 2 = KL(Q0 || Pθ 0) + 1 KL(Q || M) = EZ Q σ(Zt, t) 1(η(Z[0,t], t) m(Zt, 0)) 2 2 dt 2 η m 2 Q,σ . Published as a conference paper at ICLR 2023 Using Lemma A.5 with a(z) = σ(z, t) 1sθ(z, t), and b(z[0,t]) = σ(z, t) 1η(z[0,t], t), we have the following bias-variance decomposition: η sθ 2 Q,σ = sθ m 2 Q,σ + η m 2 Q,σ . Hence, KL(Q || Pθ) = KL(M || Pθ) + KL(Q || M). Finally, KL(MT || Pθ T ) KL(M || Pθ) is the direct result of the following factorization of KL divergence: KL(M || Pθ) = KL(MT || Pθ T ) + Ex MT KL(MT ( |ZT = x) || Pθ T ( |ZT = x)) . Lemma A.5. Let (X, Y ) be a random variable and a(x), b(x, y) are square integral functions. Let m(x) = E[b(X, Y ) | X = x]. We have E[ a(X) b(X, Y ) 2 2] = E[ a(X) m(X) 2 2] + E[ b(X, Y ) m(X) 2 2]. E[ a(X) b(X, Y ) 2 2] = E[ a(X) m(X) + m(X) b(X, Y ) 2 2] = E[ a(X) m(X) 2 2] + E[ m(X) b(X, Y ) 2 2] + 2 , = E[(a(X) m(X)) (m(X) b(X, Y ))]] = E[(a(X) m(X)) E[(m(X) b(X, Y ))|X]] = E[(a(X) m(X)) (m(X) m(X))] = 0. A.4 MARKOV AND RECIPROCAL PROPERTIES OF QΠ Mixture of Bridges and Initialization It is an immediate observation that the mixtures of a set of bridges are also bridges: let Qz,A be a set of A-bridges indexed by a variable z, then QA := R Qz,Aµ(dz) is an x-bridge for any distribution µ on z. A special case is to take the mixture of the conditional bridges in (5) starting from different deterministic initialization, which shows that we can obtain a valid x-bridge by equipping the same drift in (5) with essentially any initialization. Hence, the choices of the drift force and initialization in Qx can be completely decouple. Proposition A.6. Let Q is a path measure and Ωx is the set of z for which Qz0,x( ) := Q( |ZT = x, Z0 = z0) exists. Then Qx := R Qz0,xµ(dz0 | x) is an x-bridge, for any distribution µ on Ω Ω. [Proof of Proposition A.6] This is an obvious result. We have Qz0,x(ZT = x) = 1 by the definition of conditioned processes. Hence Qx(ZT = x) = R Qz0,x(ZT = x)µ(dz0 | x) = R µ(dz0 | x) = 1. Markov and Reciprocal Properties of QΠ If Qx is constructed as Qx = Q( |ZT = x), it is easy to see that QΠ := R Qx( )Π (dx) is Markov iff Q is Markov. Proposition A.7. Assume Qx = Q( | ZT = x) and π (z) := dΠ d QT (z) exists and is positive everywhere. Then QΠ is Markov, iff Q is Markov. [Proof of Proposition A.7] If Qx = Q( | ZT = x), we have from the definition of QΠ : QΠ (Z) = Q(Z|ZT )Π (ZT ) = Q(Z)π (ZT ), Published as a conference paper at ICLR 2023 where π (ZT ) = dΠ d QT (ZT ). Therefore, QΠ is obtained by multiplying a positive factor π (ZT ) on the terminal state ZT of Q. Hence QΠ has the same Markov structure as that of Q. If Qx is constructed from mixtures of bridges as above, the resulting QΠ is more complex. In fact, simply varying the initialization µ in Proposition (A.6) can change the Markov structure of QΠ . Proposition A.8. Take Qx to be the dynamics in (7) initialized from Z0 N(0, v0). Assume σt > 0, t [0, T]. Then QΠ is Markov only when v0 = 0, or v0 = + . [Proof of Proposition A.8] When taking Qx to be the dynamics (7) initialized from Z0 µ0 = N(0, v0), we have Qx = R µ0(dz0) Qz0,x, where Qz0,x = Q( |Z0 = z0, ZT = x) with Q following Brownian motion d Zt = d Wt. Hence, we can write QΠ (d Z) = Q(d Z)r(Z0, ZT ), where r(z0, z T ) = dµ0 Π d Q0,T (z0, z T ). From L eonard et al. (2014), QΠ is Markov iff r(x, z0) = f(x)g(z0) for some f and g, which is not the case except the degenerated case (v0 = 0 and v0 = + ) because Q0,1 is not factorized. On the other hand, when v0 = 0, we have that Qx = Q( |ZT = x) is the standard Brownian bridge and hence QΠ is Markov following Proposition A.7. When v0 = + , as the case of SMLD, QΠ is the law of Zt = ZT t with d Zt = d Wt and Z0 Π , which is also Markov. The right characterization of QΠ from Proposition (A.6) involves reciprocal processes (L eonard et al., 2014). Definition A.9. A process Z with law Q on [0, T] is said to be reciporcal if it can be written into Q = R Qz0,z T µ(dz0, dz T ), where Q is a Markov process and Qz0,z T = Q( |Z0 = z0, ZT = z T ), and µ is a probability measure on Ω Ω. Proposition A.10. QΠ is reciprocal iff Qx = R Qz0,xµ(dz0 | x) for a Markov Q and distribution µ. [Proof of Proposition A.10] Note that QΠ ( ) = Z π(dx)Qx( ) = Z π(dx)µ(dz0 | x) Qz0,x( ). Hence if Q is Markov, QΠ is reciprocal by Definition A.9. On the other hand, if QΠ is reciprocal, we have QΠ ( ) = R Mz0,x( )µ(dz0, dx) for some Markov process M and probability measure µ on Ω Ω. In this case, we have Qx( ) = QΠ ( |ZT = x) = R Mz0,x( )µ(dz0 | x), assuming it exits. Intuitively, a reciprocal process can be viewed as connecting the head and tail of a Markov chain, yielding a single loop structure. A characteristic property is Q(X[s,t] A | Z[0,s], Z[t,T ]) = Q(X[s,t] A | Zs, Zt), where A is any event that occur between time s and t. Solutions of the Schrodinger bridge problems are reciprocal processes (L eonard et al., 2014). A.5 PRACTICAL ALGORITHM AND ERROR ANALYSIS In practice, we need to introduce empirical and numerical approximations in both training and inference phases. Denote by τ = {τi}K+1 i=1 a grid of time points with 0 = τ1 < τ2 . . . < τK+1 = T. During training, we minimize an empirical and time-discretized surrogate of L(θ) as follows i=1 ℓ(θ; Z(i), τ (i)), ℓ(θ; Z, τ) := log pθ 0(Z0) + 1 2K k=1 (θ; Z, τk), (16) where (θ; Z, t) := σ 1(Zt, t)(sθ(Zt, t) ηx(Z[0,t], t)) 2, and {Z(i)} is drawn from QΠ , and τ (i) can be either a deterministic uniform grid of [0, T], i.e., τ (i) = {i/K}K i=0, or drawn i.i.d. uniformly on [0, T] (see e.g.,Song et al. (2020b); Ho et al. (2020)). A subtle problem here is that the Published as a conference paper at ICLR 2023 variance of (θ; Z, t) grows to infinite as t T. Hence, we should not include (θ; Z, T) at the end point τ K+1 = T into the sum in the loss ℓ(θ, Z, τ) to avoid variance exploding. In the sampling phase, the continuous-time model Pθ should be approximated numerically. A standard approach is the Euler-Maruyama method, which simulates the trajectory on a time grid τ by ˆZτk+1 = ˆZτk + ϵksθ( ˆZτk, τk) + ϵkσ( ˆZτk, τk)ξk, ϵk = τk+1 τk, ξk N(0, Id), (17) The final output is ˆZT . The following result shows the KL divergence between Π and the distribution of ˆZT can be bounded by the sum of the step size and the expected optimality gap E[ ˆL(θ) ˆL(θ )] of the time-discretized loss in (16). A.5.1 TIME-DISCRETIZATION ERROR ANALYSIS (PROPOSITION A.11) Proposition A.11. Assume Ω= Rd and σ(z, t) = σ(t) is state-independent. Take the uniform time grid τ unif := {iϵ}K i=0 with step size ϵ = T/K in the sampling step (17). Assume σ(t) > c > 0, t and σ(t) is piecewise constant w.r.t. time grid τ unif. Let Lϵ(θ) = EZ QΠ [ℓ(θ; Z, τ unif)]. Let Pθ,ϵ T be the distribution of the resulting sample ˆZT . Let θ be an optimal parameter satisfying (14). Assume C0 := supz,t sθ (z, t) 2 /(1 + z 2), tr(σ2(z, t)), EPθ [ Z0 2] < + , and sθ (z, t) sθ (z , t ) 2 2 L z z 2 + |t t | for z, z Rd and t, t [0, T]. Then q KL(Π || Pθ,ϵ T ) p Lϵ(θ) Lϵ(θ ) + O ϵ . We provide the analysis and proof for Proposition A.11 in the following text. Proposition A.12. Assume Ω= Rd and σ(z, t) = σ(t) is state-independent and σ(t) > c > 0, t [0, T]. Take the uniform time grid τ unif := {iϵ}K i=0 with step size ϵ = T/K in the sampling step (17). Let Lϵ(θ) = EZ QΠ [ℓϵ(θ; Z)] with ℓϵ(θ, Zt) = log pθ 0(Z0) + 1 2K σ 1 k (sθ(Ztk, tk) ηZT (Z[0,tk], tk)) 2 where ϵ > 0 is a step size with T = Kϵ and tk = (k 1)ϵ, and σ2 k := (tk+1 tk) 1 R tk+1 tk σ(t)2dt. Let Pθ,ϵ T be the distribution of the sample ˆZT resulting from the following Euler method: ˆZtk+1 = ˆZtk + ϵsθ(Ztk, tk) + ϵσkξk, where ξk N(0, Id) is the standard Gaussian noise in Rd. Let θ be an optimal parameter satisfy- ing (14). Assume C0 := supz,t sθ (z, t) 2 /(1 + z 2), tr(σ2(z, t)), EPθ [ Z0 2] < + , and sθ satisfies sθ (z, t) sθ (z , t ) 2 2 L z z 2 + |t t | for z, z Rd and t, t [0, T]. Then we have q KL(Π || Pθ,ϵ T ) p Lϵ(θ) Lϵ(θ ) + O ϵ . [Proof of Proposition A.11] This is the result of Lemma A.13 below by noting that the ˆPθ there is equivalent to the Euler method above, and Lϵ(θ) Lϵ(θ ) Lϵ(θ) Lϵ(θ ) (because σ 2 k = ((tk+1 tk) 1 R tk+1 tk σ(t)2) 1 (tk+1 tk) 1 R tk+1 tk σ(t) 2). Lemma A.13. Let h be a step size and ϵ = T/K for a positive integer K. For each t [0, ), denote by t ϵ = max({kϵ: k N} [0, t]). Assume QΠ : d Zt = ηΠ (Z[0,t], t)dt + σ(Zt, t)d Wt, Z0 Q0 Pθ : d Zt = sθ (Zt, t)dt + σ(Zt, t)d Wt, Z0 Q0, Pθ : d Zt = sθ(Zt, t)dt + σ(Zt, t)d Wt, Z0 Pθ 0 ˆPθ : d Zt = sθ(Z t ϵ, t)dt + σ(Zt, t)d Wt, Z0 Pθ 0, Published as a conference paper at ICLR 2023 where Pθ is the Markovianization of QΠ , and ˆPθ is a discretized version of Pθ. Define Lϵ(θ) = EQΠ log pθ 0(Z0) + 1 σ 1(Zt, t)(sθ(Z t ϵ, t ϵ) ηZT (Z[0, t ϵ], t ϵ)) 2 dt Assume the conditions of Lemma A.17 holds for Pθ , and σ(z, t) c > 0 for all z, t, and sθ satisfies sθ (z, t) sθ (z , t ) 2 2 L z z 2 + |t t | for z, z Rd and t, t [0, T]. Then q KL(Pθ || ˆPθ) q Lϵ(θ) Lϵ(θ ) + C ϵ, where C is a constant that is independent of ϵ. [Proof] Define f 2 Q,σ = EZ Q[ R T 0 σ(Z, t)f(Z, t) 2] for convenient notation. Let sθ ϵ(Z, t) = sθ(Z t ϵ, t ϵ), and ηϵ = ηZT (Z[0, t ϵ], t ϵ). KL(Pθ || ˆPθ) = KL(Pθ 0 || ˆPθ 0) + 1 KL(Pθ 0 || Pθ 0) + 1 (1 + ω) sθ ϵ sθ ϵ 2 Pθ ,σ + (1 + 1/ω) sθ sθ ϵ 2 := (1 + ω)I1 + (1 + 1/ω)I2, where ω > 0 is any positive number and I1 := 1 1 + ω KL(Pθ 0 || Pθ 0) + 1 sθ ϵ sθ ϵ 2 = 1 1 + ω KL(Pθ 0 || Pθ 0) + 1 sθ ϵ sθ ϵ 2 KL(Pθ 0 || Pθ 0) + 1 sθ ϵ sθ ϵ 2 KL(Pθ 0 || Pθ 0) + 1 sθ ϵ ηZT ϵ 2 QΠ ,σ sθ ϵ ηZT ϵ 2 //Lemma A.5 = Lϵ(θ) Lϵ(θ ), and 0 σ 2(Zt, t) Zt Z t ϵ 2 + (t t ϵ) dt 0 ((Zt Z t ϵ)2 + (t t ϵ))dt 2c2 (CPθ + 1) Z T 0 (t t ϵ)dt //Lemma A.17 2c2 (CPθ + 1) Tϵ 2 . //Lemma A.15, where CPθ is a constant depending on Pθ that comes from Lemma A.17. Hence KL(Pθ || ˆPθ) inf ω 0(1 + ω)I1 + (1 + 1/ω)I2 Lϵ(θ) Lϵ(θ ) + 1 L c2 (CPθ + 1) Tϵ This completes the proof. Published as a conference paper at ICLR 2023 Lemma A.14. For any a, b Rd, and ω 0, a + b 2 2 (1 + ω) a 2 2 + (1 + 1/ω) b 2 2 . (1 + ω) a 2 2 + (1 + 1/ω) b 2 2 a 2 2 + b 2 2 + 2a b = a + b 2 2 Lemma A.15. Assume T 0, ϵ 0 and T/ϵ N. We have Z T 0 (t t ϵ)dt = Tϵ [Proof] Z T 0 (t t ϵ)dt = 0 (hk + x hk)dx = Kϵ2/2 = Tϵ/2. Lemma A.16 (Gr onwall s inequality). Let I denote an interval of the real line of the form [a, ) or [a, b] or [a, b) with a < b. Let α, β and u be real-valued functions defined on I. Assume that β and u are continuous and that the negative part of α is integrable on every closed and bounded subinterval of I. (a) If β is non-negative and if u satisfies the integral inequality u(t) α(t) + Z t a β(s)u(s) ds, t I, which is true if u (t) α (t) + β(t)u(t). then u(t) α(t) + Z t a α(s)β(s) exp Z t s β(r) dr ds, t I. (b) If, in addition, the function α is non-decreasing, then u(t) α(t) exp Z t a β(s) ds , t I. Lemma A.17. Consider d Xt = b(Zt, t)dt + σ(Zt, t)d Wt, X0 = 0, t [0, T]. Assume there exists a finite constant C0, such that b(x, t) 2 2 C0(1 + x 2 2), x Rd, t [0, T], tr(σσ (x, t)) C0, x Rd, t [0, T]. and E[ Z0 2 2] C0. Then for any 0 s t T, we have E[ Zt Zs 2 2] KC0,T (t s), where KC0,T is a finite constant that depends on C0 and T. Published as a conference paper at ICLR 2023 [Proof] Let η = supx,t tr(σσ (x, t)). We have by Ito Lemma, d dt E h Zt Zs 2 2 i = E 2(Zt Zs) (b(Zt, t) + d Wt) + η = E 2(Zt Zs) b(Zt, t) + η = E h Zt Zs 2 2 + b(Zt, t) 2 2 + d i E h Zt Zs 2 2 + C0(1 + Zt 2 2) + η i (1 + 2C0)E h Zt Zs 2 2 i + η + C0(1 + 2E h Zs 2 2 i ). Using Gronwall s inequality, E h Zt Zs 2 2 i (t s)(η + C0(1 + 2E[ Zs 2 2])) exp ((t s)(1 + 2C0)) . Taking s = 0 yields that E h Zt Z0 2 2 i t(η + C0(1 + 2E[ Z0 2 2])) exp (t(1 + 2C0)) . E[ Zt 2 2] 2E h Zt Z0 2 2 i + 2E h X0 2 2 i 2t(η + C0(1 + 2E[ Z0 2 2])) exp (t(1 + 2C0)) + 2E h Z0 2 2 i 4T(C0 + C2 0) exp(T(1 + 2C0)) + 2C0. E h Zt Zs 2 2 i (t s)(η + C0(1 + 2E[ Zs 2 2])) exp ((t s)(1 + 2C0)) C(t s), C = (2C0 + 4C2 0 + 8C0T(C0 + C2 0) exp(T(1 + 2C0))) exp (T(1 + 2C0)) . A.5.2 STATISTICAL ERROR ANALYSIS (PROPOSITION A.18) To provide a simple analysis of the statistical error, we assume that ˆθn = arg minθ ˆL(θ) is an asymptotically normal M-estimator of θ following classical asymptotic statistics (Van der Vaart, 2000), with which we can estimate the rate of the excess risk Lϵ(ˆθn) Lϵ(θ ) and hence the KL divergence. Proposition A.18. Assume the conditions in Proposition (A.11). Assume ˆθn = arg minθ ˆLϵ(θ) with ˆLϵ(θ) = Pn i=1 ℓ(θ; Z(i), τ unif)/n, Z(i) QΠ . Take Qx to be the standard Brownian bridge d Zx t = x Zx t T t dt + d Wt with Z0 N(0, v0) and v0 > 0. Assume n(ˆθn θ ) d N(0, Σ ) as n + , where Σ is the asymptotic covariance matrix of the M estimator ˆθn. Assume Lϵ(θ) is second order continuously differentiable and strongly convex at θ . Assume Π has a finite covariance and admits a density function π that satisfies supt [0,T ] EQΠ h θsθ (Zt, t) 2 (1 + log π(ZT ) 2 + tr( 2 log π(ZT ))) i < + . We have KL(Π || P ˆθn,ϵ T ) = O log(1/ϵ) + 1 Published as a conference paper at ICLR 2023 The expectation in Eq. (18) is w.r.t. the randomness of ˆθn. The log(1/ϵ) factor shows up as the sum of a harmonic series as the variance of (θ; Z, t) grows with O(1/(T t)) when t T. Taking ϵ = 1/n yields KL(Π || P ˆθn,ϵ T ) = O(log n/n). If we want to achieve KL(Π || P ˆθn,ϵ T ) = O(η), it is sufficient to take K = T/ϵ = O(1/η) steps and n = Ø(log(1/η)/η) data points. [Proof of Proposition A.18] Let θ = arg min θ Lϵ(θ) := EZ QΠ [ℓ(θ; Z)], ˆθn = arg min θ ˆLϵ(θ) := 1 i=1 ℓ(θ; Z(i)), where {Z(i)}n i=1 is drawn i.i.d. from QΠ . We assume that ˆθn is an asymptotically normal Mestimator, in which case we have n(θn θ ) d N(0, Σ ), Σ = H 1 V H 1 , H = EZ QΠ 2 θθℓ(θ ; Z) , V = E[ θℓ(θ ; Z) θℓ(θ ; Z) ], n E[(L(ˆθn) L(θ ))] 1 2 n(θ ˆθn) H n(θ ˆθn) 1 2tr(H 1 V ), where f g denotes that f g = o(1). We now need to bound tr(H 1 V ). Combining the results in Lemma A.19 and Lemma A.23, we have when tk = (k 1)ϵ and T = Kϵ, tr(H 1 V ) = O = O(1 + log(1/ϵ)). KL(Π || P ˆθn,ϵ T )] = O E[ q L(ˆθn) L(θ )] + ϵ E[L(ˆθn) L(θ )] + ϵ log(1/ϵ + 1) Lemma A.19. Assume the conditions in Proposition A.18. Define I0 = EZ QΠ log pθ 0 (Z0) 2 , Ik = EZ QΠ θsθ (Ztk, tk) 2 tr(cov(ηZT (Z[0,tk], tk) | Ztk)) , for k = 1, . . . K. Then tr(H 1 V )1/2 1 λmin(H )1/2 [Proof] From Lemma A.21, tr(H 1 V ) (λmin(H )) 1tr(V ). Hence we just need to bound tr(V ). Published as a conference paper at ICLR 2023 tr(V )1/2 = EZ QΠ h θℓ(θ , Z) 2 2 i1/2 EZ QΠ h θℓ(θ , Z) 2 2 i1/2 + 1 k=1 EZ QΠ θsθ (Ztk, tk)(sθ (Ztk, tk) ηZT (Z[0,tk], tk)) 2 EZ QΠ h θℓ(θ , Z) 2 2 i1/2 + 1 k=1 EZ QΠ θsθ (Ztk, tk) 2 (sθ (Ztk, tk) ηZT (Z[0,tk], tk)) 2 = EZ QΠ h θℓ(θ , Z) 2 2 i1/2 + 1 k=1 EZ QΠ θsθ (Ztk, tk) 2 2 tr cov ηZT (Z[0,tk], tk) | Ztk 1/2 = I1/2 0 + 1 Lemma A.20. Assume the results in Lemma A.19 and Lemma A.23 hold. Assume max k 1,...,K EZ Π θ sθ (Ztk, tk) 2 1 + log π (ZT ) 2 2 + tr( 2 log π (ZT )) < + , Then for k = 1, . . . , K, we have Ik = O 1 T tk + 1 . [Proof] It is a direction application of (20). Lemma A.21. Let A and B be two d d positive semi-definite matrices. Then tr(AB) λmax(A)tr(B). [Proof] Write A into A = Pd i=1 λiuiu i where λi and ui is the i-th eigenvalue and eigenvectors of A, respectively. Then tr(AB) = tr( i=1 λiu i Bui) λmax(A)tr( i=1 u i Bui) = λmax(A)tr(B). Controlling the Conditional Variance of the Regression Problem Assume Qx is the standard Brownian bridge: Qx : d Zx t = x Zx t T t dt + d Wt, Z0 N(0, v0). (19) In this case, the (ideal) loss function is L(θ) = EX Π ,Z QX log pθ 0(Z0) + 1 sθ(Zt, t) Yt 2 dt , where Yt = X Zt The second part of the loss is a least square regression for predicting Yt = ηX(Zt, t) with sθ(Zt, t). The conditioned variance cov(Yt | Zt) is an important factor that influences the error of the regression problem. We now show that tr(cov(Yt | Zt)) = O(1/T t) which means that it explodes to infinity when t T. First, note that tr(cov(Yt | Zt)) = 1 (T t)2 tr(cov(X | Zt)). Using the estimate in Lemma A.23, we have tr(cov(Yt | Zt)) = O 1 T t + E x log π (X) 2 2 + tr 2 log π (X) Zt Published as a conference paper at ICLR 2023 Lemma A.22. For the standard Brownian bridge in (19), we have T x, t(T t) [Proof] Let Zz0,x t be the same process that is initialized from Zz0,x 0 = z0. We have from the textbook result regarding Brownian bridge that we can write Zz0,x t = tx+(T t)z0 T ξt where ξt is some standard Gaussian random variable. The result follows directly as Zx t = ZZ0,x t with Z0 N(0, v0). Lemma A.23. Let π be the density function Π on Rd whose covariance matrix exists. When X Π and Z QX from (19) with v0 > 0. Then the density function ρt(x|z) of X|Zt = zt satisfies ρt(x|z) π (x) exp t z x 2 2 2( T (T t) t + v0 (T t)2 In addition, there exists positive constants c < + and τ (0, T), such that tr(covρt(x|z)) wtd + w2 t Eρt x log π (x) 2 2 + tr 2 log π (x) z , when τ t T c, when 0 t τ, where wt = T (T t) t + v0 (T t)2 t2 . So tr(covρt(x|z)) is bounded and decay to zero with rate O(T t) as t T. [Proof] We know that X Π and ZX t |X N(t/TX, wt). Hence, (21) is a direct result of Bayes rule. Then Lemma A.25 gives tr(covρt(x|z)) = wtd + w2 t Eρt x log π (x) 2 2 + tr 2 log π (x) z . On the other hand, ρt(x|z) π (x) exp( 1 2wt x 2 + T When t 0, we have 1/wt 0 and T/(twt) 0. Hence, ρt(x|z) converges to π (x) as t 0, as a result, tr(covρt(x|z)) tr(covπ (x)) < + . Therefore, for any c > 0, there exists t0 > 0, such that tr(covρt(x|z)) tr(covπ (x)) + c when 0 t t0. Remark A.24. We need to have v0 > 0 to ensure that T/(twt) 0 in the proof of Lemma A.23. This is purely a technical reason, for yielding a finite bound of the conditioned variance when t is close to 0. We can establish the same result when v0 = 0 by adding the assumption that maxk 1,...,K EZ QΠ h θsθ (Ztk, tk) 2 2 tr(covΠ Ztk (ZT )) i < + , where Π z is the distribution with density π z(x) π (x) exp(z x/T). Lemma A.25. Let p(x) π(x) exp α x b 2 2 2 be a positive probability density function on Rd, where α > 0, b R and log π is continuously second order differentiable. Then tr(covp(x)) Ep[ x 2 2] = α 1d + α 2 Ep[ x log π(x) 2 2 + tr( 2 log π(x))] . [Proof] Let us focus on the case when b = 0 first. Stein s identity says that Ep ( x log π(x) αx) ϕ(x) + x ϕ(x) = 0, for a general continuously differentiable function ϕ when the integrals above are finite. Published as a conference paper at ICLR 2023 Taking ϕ = x yields that Ep ( x log π(x) αx) x + d = 0, which gives Ep[ x 2 2] = α 1(Ep[ x log π(x) x] + d). On the other hand, taking ϕ(x) = x log π(x) yields Ep ( x log π(x) αx) x log π(x) + tr( 2 log π(x)) = 0, which gives Ep[ x log π(x) x] = α 1 Ep[ x log π(x) 2 2 + tr( 2 log π(x))] . Ep[ x 2 2] = dα 1 + α 2 Ep[ x log π(x) 2 2 + tr( 2 log π(x))] . For b = 0, define p(x) π (x + b) exp α 2 x 2 , which is the distribution of x = x b when x p. Then applying the result above to p yields tr(covp(x)) = tr(cov p(x)) α 1d + α 2Ex p h x log π (x + b) 2 2 + tr 2 log π (x + b) i = α 1d + α 2E p h x log π (x) 2 2 + tr 2 log π (x) i . A.6 CONDITION FOR Ω-BRIDGES We provide the proof for Proposition 2.3. [Proof of Proposition 2.3] By the formula of KL divergence between two diffusion processes, we have KL(QΩ||Pθ) = KL(QΩ 0 || Pθ 0) + 1 f θ(Zt, t) 2 This means that QΩand Pθ are absolutely continuous to each other, and hence have the same support. Therefore, QΩ(ZT Ω) = 1 implies that Pθ(ZT Ω) = 1. A.7 EXAMPLES OF Ω-BRIDGES If Ωis a product space, the integration can be factorized into one-dimensional integrals. Specifically, assume Ω= I1 Id, then ηΩ(z, t) = ηIi(zi, t) d i=1 , where ηIi is the drift fore of the Ii-bridge, and zi is the i-th element of z = [zi]. Therefore, it is sufficient to focus on 1D case below. Consider the bridge process constructed from the Brownian motion in (9). If Ωis a discrete set, say Ω= {e1 . . . , e K}, we have ηΩ(z, t) = σ2 t 1 PK k=1 ω(ek, z, t) k=1 ω(ek, z, t) ek z = σ2 t z log k=1 ω(ek, z, t), Published as a conference paper at ICLR 2023 Figure 5: Generated tabular data from Mixed-Bridge. ω(ek, z, t) = exp If Ω= [a, b], we have ηΩ(z, t) = σ2 t 1 R b a ω(e, z, t) a ω(e, z, t) e z = σ2 t z log Z b a ω(e, z, t)de = σ2 t z log F( z a βT βt ) F( z b βT βt ) , where F is the standard Gaussian CDF. B ADDITIONAL MATERIALS OF THE EXPERIMENTS In our experiments, T = 1 and ϵ = T/K = 1/K. Moreover, we take the time grid by randomly sampling from {i/K}K 1 i=0 for the training objective Eq. (13). For evaluation, we calculate the standard evidence lower bound (ELBO) by viewing the resulting time-discretized model as a latent variable model: EX Π [ log ˆpθ T (X)] EZ QΠ log ˆpθ 0(Z0) q0(Z0) k=1 log ˆpθ tk+1|tk(Ztk+1|Ztk) qtk+1|tk(Ztk+1|Ztk) where tk = (k 1)ϵ, and ˆpθ is the density function of the time-discretized version of Pθ, and q is the density function of Q. We adopt Monte-Carlo sampling to estimate the log-likelihood. As in (Song et al., 2020b), we repeat 5 times in the test set for the estimation. For categorical/integer/grid generation, the likelihood of the last step should take the rounding into account: in practice, we have ˆZT = rounding( ˆZt K +ϵsθ( ˆZt K, t K)+ ϵσ(Ztk, t K)ξK, Ω), where rounding(x, Ω) denotes finding the nearest element of x on Ω, and hence the likelihood ˆpθ T |t K of the last step should incorporate the rounding operator as a part of the model. B.1 GENERATING MIXED-TYPE TABULAR DATA In this experiment, the metrics are measured by the implementation from Synthetic Data Vault (SDV) (Patki et al., 2016). For baseline methods, we adopt their open-sourced official implementation 1. For the machine learning models adopted for evaluation, logistic regerssion, Ada Boost and MLP, we directly use their default configuration in SDV. For the results in Table 1, we repeat the experiments with 5 different random seeds and report their standard deviation. We provide additional generated samples from Mixed-Bridge in Figure 5. 1https://github.com/sdv-dev/CTGAN Published as a conference paper at ICLR 2023 Figure 6: Visualization of the noise schedule of Noise decay A, Noise decay B and Noise decay C. B.2 GENERATING INTEGER-VALUED POINT CLOUDS In this experiment, we need to process point cloud data on integer grid. To prepare the data, we firstly sample 2048 points from the ground truth mesh. Then, we normalize all the point clouds to a unit bounding box. After this, we simply project the points onto grid point by rounding the coordinate to integer. The metrics in the main text, MMD, COV and 1-NNA are computed with respect to the post-processed integer-valued training point clouds. For the results in Table 2, we repeat the experiments for 3 times and report the mean of the experiments. B.3 GENERATING SEMANTIC SEGMENTATION MAPS ON CITYSCAPES In this experiment, we set (Noise Decay A): σ2 t = 3 exp( 3t); (Noise Decay B): σ2 t = 3(1 t); (Noise Decay C) σ2 t = 3 3 exp( 3(1 t)). We visualize the noise schedule in Figure 6. Note that, except for Constant Noise, all the other three processes gradually decrease the magnitude of the noise as t 1. For fair comparison, we use the same neural network as in Hoogeboom et al. (2021). The network is optimized with Adam optimizer with a learning rate of 0.0002. The model is trained for 500 epochs. The City Scapes dataset (Cordts et al., 2016) contains photos captured by the cameras on the driving cars. A pixel-wise semantic segmentation map is labeled for each photo. As in (Hoogeboom et al., 2021), we rescale the segmentation maps from cityscapes to 32 64 images using nearest neighbour interpolation. Our training set and test set is exactly the same as that of (Hoogeboom et al., 2021) for fair comparison. For the results in Table 3,we repeat the experiments for 3 times and report the mean of the experiments. We provide more samples in Figure 9. B.4 DISCRETE CIFAR10 GENERATION The model is trained using the same training strategy as DDPM (Ho et al., 2020) with the code base provided in Song et al. (2020b). Specifically, the neural network is the same U-Net structure as the implementation in Song et al. (2020b). The optimizer is Adam with a learning rate of 0.0002. According to common practice (Song & Ermon, 2020; Song et al., 2020b), the training is smoothed by exponential moving average (EMA) with a factor of 0.999. We use K = 1000 and dt = 0.001 for discretizing the SDE. To account for the discretization error, after the final step, we apply rounding to the generated images to get real integer-valued images. We compare the value distribution of the generated images in Figure 8. Published as a conference paper at ICLR 2023 Figure 7: Diffusion process of one pixel (a 8-dimensional vector) in City Scapes. As t 1, 7 of the dimensions reaches 0, while 1 of the dimensions reaches 1, turning the vector into a one-hot vector. (a) Bridge-Continuous (b) Bridge-Integer Figure 8: Final value distribution of the generated images with Bridge-Continuous and Bridge-Integer (before rounding) on CIFAR10. We only show the values in [125.5, 130.5] for visual clarity. Integer-Bridge generates discrete values. Published as a conference paper at ICLR 2023 (a) Real Data (b) Constant Noise (d) Noise Decay B (c) Noise Decay A (e) Noise Decay C Figure 9: Additional samples from real data, Constant Noise, Noise decay A, Noise decay B and Noise decay C.