# deep_generative_learning_via_schrödinger_bridge__fd5297ce.pdf

Deep Generative Learning via Schr odinger Bridge

Gefei Wang 1 Yuling Jiao 2 Qian Xu 3 Yang Wang 1 4 Can Yang 1 4

We propose to learn a generative model via entropy interpolation with a Schr odinger Bridge. The generative learning task can be formulated as interpolating between a reference distribution and a target distribution based on the Kullback Leibler divergence. At the population level, this entropy interpolation is characterized via an SDE on [0, 1] with a time-varying drift term. At the sample level, we derive our Schr odinger Bridge algorithm by plugging the drift term estimated by a deep score estimator and a deep density ratio estimator into the Euler-Maruyama method. Under some mild smoothness assumptions of the target distribution, we prove the consistency of both the score estimator and the density ratio estimator, and then establish the consistency of the proposed Schr odinger Bridge approach. Our theoretical results guarantee that the distribution learned by our approach converges to the target distribution. Experimental results on multimodal synthetic data and benchmark data support our theoretical ﬁndings and indicate that the generative model via Schr odinger Bridge is comparable with state-ofthe-art GANs, suggesting a new formulation of generative learning. We demonstrate its usefulness in image interpolation and image inpainting.

1. Introduction

Deep generative models have achieved enormous success in learning the underlying high-dimensional data distribution from samples. They have various applications in machine learning, like image-to-image translation (Zhu et al., 2017;

1Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong, China 2School of Mathematics and Statistics, Wuhan University, Wuhan, China 3AI Group, We Bank Co., Ltd., Shenzhen, China 4Guangdong-Hong Kong Macao Joint Laboratory for Data-Driven Fluid Mechanics and Engineering Applications, The Hong Kong University of Science and Technology, Hong Kong, China. Correspondence to: Yuling Jiao <yulingjiaomath@whu.edu.cn>, Can Yang <macyang@ust.hk>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Choi et al., 2020), semantic image editing (Zhu et al., 2016; Shen et al., 2020) and audio synthesis (Van Den Oord et al., 2016; Prenger et al., 2019). Most of existing generative models seek to learn a nonlinear function to transform a simple reference distribution to the target distribution as data generating mechanisms. They can be categorized as either likelihood-based models or implicit generative models.

Likelihood-based models, such as variational auto-encoders (VAEs) (Kingma & Welling, 2014) and ﬂow-based methods (Dinh et al., 2015), optimize the negative log-likelihood or its surrogate loss, which is equivalent to minimize the Kullback Leibler (KL) divergence between the target distribution and the generated distribution. Although their ability to learn ﬂexible distributions is restricted by the way to model the probability density, many works have been established to alleviate this problem and achieved appealing results (Makhzani et al., 2016; Tolstikhin et al., 2018; Razavi et al., 2019; Dinh et al., 2017; Papamakarios et al., 2017; Kingma & Dhariwal, 2018; Behrmann et al., 2019). As a representative of implicit generative models, generative adversarial networks (GANs) use a min-max game objective to learn the target distribution. It has been shown that vanilla GAN (Goodfellow et al., 2014) minimizes the Jensen-Shannon (JS) divergence between the target distribution and the generated distribution. To generalize vanilla GAN, researchers consider some other criterions including more general f-divergences (Nowozin et al., 2016), 1Wasserstein distance (Arjovsky et al., 2017) and maximum mean discrepancy (MMD) (Bi nkowski et al., 2018). Meanwhile, recent progress on designing network architectures (Radford et al., 2016; Zhang et al., 2019) and training techniques (Karras et al., 2018; Brock et al., 2019) has enabled GANs to produce impressive high-quality images.

Despite the extraordinary performance of generative models (Razavi et al., 2019; Kingma & Dhariwal, 2018; Brock et al., 2019; Karras et al., 2019), there still exists a gap between the empirical success and the theoretical justiﬁcation of these methods. For likelihood-based models, consistency results require that the data distribution is within the model family, which is often hard to hold in practice (Kingma & Welling, 2014). Recently, new generative models have been developed from different perspectives, such as gradient ﬂow in a measure space in which GAN can be covered as a special case (Gao et al., 2019; Arbel et al., 2019) and

Deep Generative Learning via Schr odinger Bridge

stochastic differential equations (SDE) (Song & Ermon, 2019; 2020; Song et al., 2021). To push a simple initial distribution to the target one, however, these methods (Gao et al., 2019; Arbel et al., 2019; Liutkus et al., 2019; Song & Ermon, 2019; 2020; Block et al., 2020) require the evolving time to go to inﬁnity at the population level. Therefore, these methods require a strong assumption to achieve model consistency: the target must be log-concave or satisfy the log-Sobolev inequality.

To ﬁll the gap, we propose a Schr odinger Bridge approach to learn generative models. Schr odinger Bridge tackles the problem by interpolating a reference distribution to a target distribution based on the Kullback-Leibler divergence. The Schr odinger Bridge can be formulated via an SDE on a ﬁnite time interval [0, 1] with a time-varying drift term. At the population level, we can solve the SDE using the standard Euler-Maruyama method. At the sample level, we derive our Schr odinger Bridge algorithm by plugging the drift term into the Euler-Maruyamma method, where the drift term can be accurately estimated by a deep score network. The major contributions of this work are as follows:

From the theoretical perspective, we prove the consistency of the Schr odinger Bridge approach under the some mild smoothness assumptions of the target distribution. Our theory guarantees that the learned distribution converges to the target. To achieve model consistency, existing theories rely on strong assumptions, e.g., the target must be log-concave or satisfy some error bound conditions, such as the log-Sobolev inequality. These assumptions may not hold in practice.

From the algorithmic perspective, we develop a novel two-stage approach to make the theory of Schr odinger Bridge work in practice, where the ﬁrst stage effectively learns a smoothed version of the target distribution and the second stage drives the smoothed one to the target distribution. Figure 1 gives an overview of our two-stage algorithm.

Through synthetic data, we demonstrate that our Schr odinger Bridge approach can stably learn multimodal distribution, while GANs are often highly unstable and prone to miss modes (Che et al., 2017). We also show that the proposed approach achieves comparable performance with state-of-the-art GANs on benchmark data.

In summary, we believe that our work suggests a new formulation of generative models.

stage 1 stage 2

Figure 1. Overview of our two-stage algorithm. Stage 1 drives samples at 0 (left) to a smoothed data distribution (middle), and stage 2 learns the underlying target data distribution (right) with samples produced by stage 1. Stage 1 and stage 2 are achieved through the two different Schrodinger Bridges with theoretically guaranteed performance.

2. Background

Let s ﬁrst recall some background on Schr odinger Bridge problem adopted from (L eonard, 2014; Chen et al., 2020).

Let Ω= C([0, 1], Rd) be the space of Rd-valued continuous functions on time interval [0, 1]. Denote X = (Xt)t [0,1] as the canonical process on Ω, where Xt(ω) = ωt, ω = (ωs)s [0,1] Ω. The canonical σ-ﬁeld on Ωis then generated as F = σ(Xt, t [0, 1]) = {ω : (Xt(ω))t [0,1] H} : H B(Rd) . Denote P(Ω) as the space of probability measures on the path space Ω, and Wx τ P(Ω) as the Wiener measure with variance τ whose initial marginal is δx. The law of the reversible Brownian motion, is then deﬁned as Pτ = R Wx τ dx, which is an unbounded measure on Ω. One can observe that, Pτ has a marginal coincides with the Lebesgue measure L at each t.

Schr odinger (1932) studied the problem of ﬁnding the most likely random evolution between two continuous probability distributions µ, ν P(Rd). Nowadays, people call the study of Schr odinger as the Schr odinger Bridge problem (SBP). In fact, SBP can be further formulated as seeking a probability law on a path space that interpolates between µ and ν, such that the probability law is close to the prior law of the Brownian diffusion in the sense of relative entropy (Jamison, 1975; L eonard, 2014), i.e., ﬁnding a path measure Q P(Ω) with marginal Q t = (Xt)#Q = Q X 1 t , t [0, 1] such that

Q arg min Q P(Ω)DKL(Q||Pτ),

and Q0 = µ, Q1 = ν,

where µ, ν P(Rd), relative entropy DKL(Q||Pτ) = R log( d Q

d Pτ )d Q if Q Pτ (i.e. Q is absolutely continuous w.r.t. Pτ), and DKL(Q||Pτ) = otherwise. The following results characterize the solution to SBP.

Theorem 1 (L eonard, 2014) If µ, ν L , then SBP admits a unique solution Q = f (X0)g (X1)Pτ,

Deep Generative Learning via Schr odinger Bridge

where f , g are L -measurable nonnegative functions on Rd satisfying the Schr odinger system f (x)EPτ [g (X1) | X0 = x] = dµ d L (x), L a.e. g (y)EPτ [f (X0) | X1 = y] = dν d L (y), L a.e.

Besides Q , we can also characterize the density of the time-marginals of Q , i.e. d Q t d L (x).

Let q(x) and p(y) be the density of µ and ν respectively, and hτ(s, x, t, y) = [2πτ(t s)] d/2 exp x y 2

2τ(t s) be the transition density of the Wiener process. Then we have EPτ [f (X0) | X1 = y] = R hτ(0, x, 1, y)f0(x)dx, EPτ [g (X1) | X0 = x] = R hτ(0, x, 1, y)g1(y)dy. The above Schr odinger system is equivalent to f (x) R hτ(0, x, 1, y)g1(y)dy = q(x), g (y) R hτ(0, x, 1, y)f0(x)dx = p(y).

Denote f0(x) = f (x), g1(y) = g (y),

f1(y) = Z hτ(0, x, 1, y)f0(x)dx,

g0(x) = Z hτ(0, x, 1, y)g1(y)dy.

The Schr odinger system in Theorem 1 can also be characterized by

q(x) = f0(x)g0(x), p(y) = f1(y)g1(y)

with the following forward and backward time harmonic equations (Chen et al., 2020) tft(x) = τ

2 ft(x), tgt(x) = τ

2 gt(x), on (0, 1) Rd.

Let qt denote marginal density of Q t , then it can be represented (Chen et al., 2020) by the product of gt and ft deﬁned as qt(x) = d Q t d L (x), and qt(x) = ft(x)gt(x).

There are also dynamic formulations of SBP. Let U consist of admissible Markov controls with ﬁnite energy. The following theorem shows that, the vector ﬁeld

u t = τv t = τ x log gt(x)

=τ x log Z hτ(t, x, 1, y)g1(y)dy (1)

solves such a stochastic control problem:

Theorem 2 (Dai Pra, 1991)

u t (x) arg min u U E Z 1

s.t. dxt = utdt + τdwt, x0 q(x), x1 p(x). (2)

According to Theorem 2, the dynamics determined by the SDE in (2) with a time-varying drift term u t in (1) will make the particles sampled from the initial distribution µ evolve to the particles drawn from the target distribution ν in the unit time interval. This nice property is what we need in generative learning because we want to learn the underlying target distribution ν via pushing forward a simple reference distribution µ. Theorem 2 also indicates that such a solution has minimum energy in terms of quadratic cost.

3. Generative Learning via Schr odinger Bridge

In generative learning, we observe i.i.d. data x1, ..., xn from an unknown distribution pdata P(Rd). The underlying distribution pdata often has multi-modes or lies on a lowdimensional manifold, which may cause difﬁculty to learn from simple distribution such as Gaussian or Dirac measure supported on a single point. To make the generative learning task easy to handle, we can ﬁrst learn a smoothed version of pdata from the simple reference distribution, say

qσ(x) = Z pdata(y)Φσ(x y)dy,

where Φσ( ) is the density of N (0, σ2I), the variance of Gaussian noise σ2 controls the smoothness of qσ. Then we learn pdata starting from qσ. At the population level, this idea can be done via Schr odinger Bridge from the point of view of the stochastic control problem (Theorem 2). To be precise, we have the following theorem.

Theorem 3 Deﬁne the density ratio f(x) = qσ(x) Φ τ (x). Then for the SDE

dxt = τ log Ez Φ τ [f(xt+

1 tz)]dt+ τdwt (3)

with initial condition x0 = 0, we have x1 qσ(x).

And, for the SDE

dxt = σ2 log q 1 tσ(xt)dt + σdwt (4)

with initial condition x0 qσ(x), we have x1 pdata(x).

According to Theorem 3, at the population level, the target pdata can be learned from the Dirac mass supported at 0 through two SDEs (3) and (4) in the unit time interval [0,1]. The main feature of the SDEs (3) and (4) is that both drift terms are time-varying, which is different from classical Langevin SDEs with time-invariant drift terms (Song & Ermon, 2019; 2020). The beneﬁt of time-varying drift terms is that the dynamics in (3) and (4) will push the initial distributions to the target distributions in a unit time interval, while the classical Langevin SDE needs time to go to inﬁnity.

Deep Generative Learning via Schr odinger Bridge

3.1. Estimation of the drift terms

Based on Theorem 3, we can run the Euler-Maruyama method to solve the SDEs (3) and (4) and get particles approximately drawn from the targets (Higham, 2001). However, the drift terms in Theorem 3 depend on the underlying target. To make the Euler-Maruyama method practical, we need to estimate the two drift terms in (3) and (4). In Eq. (3), some calculation shows that

log Ez Φ τ [f(x +

= Ez Φ τ f(x + 1 tz) log f(x + 1 tz)

Ez Φ τ [f(x + 1 tz)] ,

and log f(x) = log qσ(x) + x/τ.

Let ˆf and \ log qσ be the estimators of the density ratio f and the score of qσ(x), respectively. After plugging them into (5), we can obtain an estimator of the drift term in (3) by computing the expectation with Monte Carlo approximation.

Now we consider obtaining the estimator of density ratio ˆf, via minimizing the logistic regression loss Llogistic(r) = Eqσ(x) log(1+exp( r(x)))+EΦ τ (x) log(1+exp(r(x))). By setting the ﬁrst variation to zero, the optimal solution is given by

r (x) = log qσ(x)

Therefore, given samples ex1, ..., exn from qσ(x), which can be obtained by adding Gaussian noise drawn from Φσ on x1, ..., xn pdata, and samples z1, ..., zn from Φ τ, we can estimate the density ratio f(x) by

ˆf(x) = exp(ˆrφ(x)), (6)

where ˆrφ NN φ is the neural network that minimizes the empirical loss:

ˆrφ arg minrφ NN φ 1 n

i=1 [ log(1 + exp( rφ(exi)))

+ log(1 + exp(rφ(zi)))]. (7)

Next, we consider estimating the time-varying drift term in (4), i.e., log q 1 tσ(x) for t [0, 1]. To do so, we build a deep network as the score estimator for log q σ(x) with σ varying in [0, σ]. Vincent (2011) showed that, explicitly matching the score by minimizing the objective

1 2Eq σ(x) sθ(x, σ) x log q σ(x) 2

is equivalent to minimizing the denoising score matching objective

1 2Epdata(x)EN ( x;x, σ2I) sθ( x, σ) x log q σ( x|x) 2

2Epdata(x)EN ( x;x, σ2I)

sθ( x, σ) + x x

Thus we build the score estimator following Song & Ermon (2019; 2020) as

ˆsθ( , ) arg min sθ N N θ L(θ), (8)

j=1 λ( σj)L σj(θ), (9)

sθ(xi + zi, σ) + zi

variance terms σ2 j , j = 1, . . . , m are i.i.d. samples from Uniform[0, σ2] with sample size m, λ( σ) = σ2 is a nonnegative scaling factor to ensure all the summands in (9) have the same scale, and zi, i = 1, ..., n are i.i.d. from Φ σ.

At last, we establish the consistencies of the deep density ratio estimator ˆf(x) = exp(ˆrφ(x)) and the deep score estimator \ log q σ(x) = ˆsθ(x; σ) in Theorem 4 and Theorem 5, respectively.

Theorem 4 Assume that the support of pdata(x) is contained in a compact set, and f(x) is Lipschitz continuous and bounded. Set the depth D, width W, and size S of NN φ as

D = O(log(n)), W = O(n

d 2(2+d) / log(n)),

S = O(n d 2 d+2 log(n) 3).

Then E[ ˆf(x) f(x) L2(pdata)] 0 as n .

Theorem 5 Assume that pdata(x) is differentiable with bounded support, and log q σ(x) is Lipschitz continuous and bounded for ( σ, x) [0, σ] Rd. Set the depth D, width W, and size S of NN θ as

D = O(log(n)), W = O(max{n

d 2(2+d) / log(n), d}),

S = O(dn d 2 d+2 log(n) 3).

Then E[ \ log q σ(x) log q σ(x) 2 L2(q σ)] 0 as m, n .

Deep Generative Learning via Schr odinger Bridge

Algorithm 1 Sampling

Input: ˆf( ), ˆsθ( , ), τ, σ, N1, N2, N3 Initialize particles as x0 = 0 stage 1 for k = 0 to N1 1 do

Sample {zi}2N3 i=1 , ϵk N (0, I)

xi = xk + r

zi, i = 1, ..., N3

PN3 i=1 ˆ f( xi)[ˆsθ( xi,σ)+ r 1 k

/τzi] P2N3 i=N3+1 ˆ f( xi) + xk

xk+1 = xk + τ N1 b(xk) + q

end for Set x0 = x N1 stage 2 for k = 0 to N2 1 do

Sample ϵk N (0, I)

b(xk) = ˆsθ(xk, q

xk+1 = xk + σ2

N2 b(xn) + σ N2 ϵk end for

return x N2

3.2. Schr odinger Bridge Algorithm

With the two estimators ˆf and \ log q σ , we can use the Euler-Maruyama method to approximate numerical solutions of SDEs (3) and (4). Let N1 and N2 be the number of uniform grids on the time interval [0, 1]. In stage 1, we start from 0 and run Euler-Maruyama for (3) with the estimated ˆf and \ log qσ in the drift term to obtain samples that follow qσ approximately. In stage 2, we start with the samples from qσ and run another Euler-Maruyama for (4) with the estimated time-varying drift term \ log q σ. We summarize our two-stage Schr odinger Bridge algorithm in 1.

Interestingly, the second stage of our proposed Schr odinger Bridge algorithm 1 recovers the reverse-time Variance Exploding (VE) SDE algorithm proposed in Song et al. (2021), if their annealing scheme is chosen to be linear as σ2(t) = σ2 t. From this point of view, our Schr odinger Bridge algorithm also provides deeper understanding of annealing score based sampling, i.e., the reverse-time VE SDE algorithm (with a proper annealing scheme) proposed by Song et al. (2021) is equivalent to the Schr odinger Bridge SDE (4).

3.3. Consistency of Schr odinger Bridge Algorithm

D1(t, x) = log Ez Φ τ [f(x +

D2(t, x) = log q 1 tσ(x)

be the drift terms. Denote

hσ,τ(x1, x2) = exp x1 2

pdata(x1 + σx2).

Now we establish the consistency of our Schr odinger Bridge algorithm which can drive a simple distribution to the target one. To this end, we need the following assumptions:

Assumption 1 supp(pdata) is contained in a ball with radius R, and pdata > c > 0 on its support.

Assumption 2 Di(t, x) 2 C1(1 + x 2), x supp(pdata), t [0, 1], where C1 R is a constant.

Assumption 3 Di(t1, x1) Di(t2, x2) C2( x1 x2 + |t1 t2|1/2), x1, x2 supp(pdata), t1, t2 [0, 1]. C2 R is another constant.

Assumption 4 hσ,τ(x1, x2), x1hσ,τ(x1, x2), pdata and pdata are L-Lipschitz functions.

Theorem 6 Under Assumptions 1-4,

E[W2(Law(x N2), pdata)] 0, as n, N1, N2, N3 ,

where W2 is the 2-Wasserstein distance between two distributions.

The consistency of the proposed Schr odinger Bridge algorithm is mainly based on mild assumptions (such as smoothness and boundedness) without some restricted technical requirements that the target distribution has to be log-concave or fulﬁll the log-Sobolev inequality (Gao et al., 2021; Arbel et al., 2019; Liutkus et al., 2019; Block et al., 2020).

4. Related Work

We discuss connections and differences between our Schr odinger Bridge approach and existing related works.

Most of existing generative models, such as VAEs, GANs and ﬂow-based methods, parameterize a transform map with a neural network G that minimizes an integral probability metric. Clearly, they are quite different from our proposal.

Recently, particle methods derived in the perspective of gradient ﬂows in measure spaces or SDEs have been studied (Johnson & Zhang, 2018; Gao et al., 2019; Arbel et al., 2019; Song & Ermon, 2019; 2020; Song et al., 2021). Here we clarify the main differences of our Schr odinger Bridge approach and the above mentioned particle methods. The proposals in (Johnson & Zhang, 2018; Gao et al., 2019; Arbel et al., 2019) are derived based on the surrogate of the geodesic interpolation (Gao et al., 2021; Liutkus et al., 2019; Song & Ermon, 2019). They utilize the invariant measure of SDEs to model the generative task, resulting in an iteration scheme that looks similar to our Schr odinger Bridge.

Deep Generative Learning via Schr odinger Bridge

However, the main difference lies that the drift terms of the Langevin SDEs in (Song & Ermon, 2019; 2020; Block et al., 2020) are time-invariant in contrast to the time-varying drift term in our formulation. As shown in Theorem 3, the beneﬁt of the time-varying drift term is essential: the SDE of Schr odinger Bridge runs on a unit time interval [0, 1] will recover the target distribution at the terminal time. However, the evolution measures of the above mentioned methods (Gao et al., 2019; Arbel et al., 2019; Song & Ermon, 2019; 2020; Block et al., 2020; Gao et al., 2021) only converge to the target when the time goes to inﬁnity. Hence, some technical requirements are imposed to the target distribution, such as log-concave or the log-Sobolev inequality, to guarantee the consistency of Euler-Maruyama discretization. However, these assumptions may often be too strong to hold in real data analysis. We proposed a two-stage approach to make the Schr odinger Bridge formulation work in practice. We drive the Dirac distribution to a smoothed version of underlying distribution pdata in stage 1 and then learn pdata from the smoothed version in stage 2. Interestingly, the second stage of the proposed Schr odinger Bridge algorithm recovers the reverse-time Variance Exploding SDE algorithm (VE SDE) (Song et al., 2021) when their annealing scheme is linear, i.e., σ2(t) = σ2 t. Therefore, the analysis developed here also provides a theoretical justiﬁcation of why the reverse-time VE SDE algorithm works well. However, their setting is σ2(t) = (σ2 max)t (σ2 min)1 t. This implies that the end-time distribution of the reverse-time VE SDE is still a smoothed one (with noise level σmin), resulting in a barrier of establishing the consistency. Another fundamental difference between our approach and reverse-time VE SDE is that, the reverse-time VE SDE also need a smoothed distribution as the input of theoretically, but they only approximately use large Gaussian noises as the initialization of the denoising process. Stage 1 ensures our algorithm to learn samples from the smoothed data distribution in unit time, which is necessary for model consistency.

5. Experiments

In this section, we ﬁrst employ two-dimensional toy examples to show the ability of our algorithm to learn multimodal distributions which may not satisfy log-Sobolev inequality. Next, we show that our algorithm is able to generate realistic image samples. We also demonstrate the effective of our approach by image interpolation and image inpainting. We use two benchmark datasets including CIFAR-10 (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015). For Celeb A, the images are center-cropped and resized to 64 64. Both of the datasets are normalized by ﬁrst rescaling the pixel values to [0, 1], and then substracting a mean vector x estimated using 50,000 samples to center the data distributions at the origin. In our algorithm, the particles start from δ0. To improve the performance, it is helpful to align the sample mean to

the origin. After generation, we add the image mean x back to the generated samples. More details on the hyperparameter settings and network architectures, and some additional experiments are provided in the supplementary material. The code for reproducing all our experiments is available at https://github.com/Yang Lab HKUST/DGLSB.

For the noise level σ, we set σ = 1.0 in this paper for generative tasks including both 2D example and CIFAR-10. In fact, the performance of our algorithm is insensitive to the choice of σ when σ is given in a reasonable range (the results with other σ values are shown in the supplementary material). We ﬁnd that the performance of our algorithm is often among the best by setting σ = 1.0 for 32 32 images. The reason is that a very small σ can not make qσ smooth enough and harms the performance of stage 1 while a very large σ brings more difﬁculty for our stage 2 to anneal the noise level down. For larger images like Celeb A, as the dimensionality of samples is higher, we increase the noise level σ to 2.0. We also compare the results by varying the value of the variance of the Wiener measure τ for image generation. The numbers of grids are chosen as N1 = N2 = 1, 000 for stage 1 and stage 2. We use sample size N3 = 1 to estimate the drift term in stage 1 for both 2D toy examples and real images. In general, we ﬁnd that a larger sample size N3 does not signiﬁcantly improve sample quality.

5.2. Learning 2D Multimodal Distributions

We demonstrate that our algorithm can effectively learn multimodal distributions. The distribution we adopt is a mixture of Gaussians with 6 components. Each of the components has a mean with a distance equaling to 5.0 from the origin, and a variance 0.01, as shown in Fig. 2(a). The components are relatively far away from each other. It is a very challenging task for GANs to learn this multimodal distribution because this distribution may not satisfy the log Sobolev inequality. Fig. 2(b) shows the failure of vanilla GAN, where several modes are missed. However, Fig. 2(c) and 2(d) show that our algorithm is able to stably generate samples from the multimodal distribution without ignoring any of the modes. In Fig. 3, we compare the ground truth velocity ﬁelds induced by drift terms D1(t, x), D2(t, x) with the estimated velocity ﬁelds at the end of each stage. Our estimated drift terms are close to the ground truth except for the region with nearly zero probability density.

5.3. Effectiveness of Two Stages for Image Generation

Fig. 4 shows the particle evolution on CIFAR-10 in our algorithm, where the two stages are annotated with corresponding colors. It shows that our two-stage approach

Deep Generative Learning via Schr odinger Bridge

Figure 2. KDE plots for mixture of Gaussians with 5,000 samples. (a). Ground truth. (b). Distribution learned by vanilla GAN. (c). Distribution learned by the proposed method after stage 1 (τ = 5.0). (d). Distribution learned by the proposed method after stage 2.

Figure 3. Velocity ﬁelds. (a) and (b). Ground truth velocity ﬁelds at the end of stages 1 and 2. (c) and (d). Estimated velocity ﬁelds at the end of stages 1 and 2.

provides a valid path for the particles to move from the origin to the target distribution. A natural question is: what are the roles of stage 1 and stage 2 in the generative modeling, respectively? In this subsection, we design experiments to answer this question.

Figure 4. Particle evolution on CIFAR-10. The column in the center indicates particles obtained after stage 1.

We ﬁrst evaluate the role of stage 1. For this purpose, we skip stage 1 but simply run stage 2 using non-informative Gaussian noises as the initial condition. Fig. 5 shows that the approach only using stage 2 generates worse image samples than the proposed two-stage approach. These results indicate that the role of stage 1 is to provide a better initial reference for stage 2. The role of stage 2 is easier to check: it is a Schr odinger Bridge from qσ(x) to the target distribution pdata(x). In Fig. 6, we perturb real images with

Gaussian noises of variance σ2 = 1.0. Our stage 2 anneals the noise level to zero and drives the particles to the data distribution. Moreover, Fig. 6 also indicates that stage 2 not only recover the original images, but also generate images with some extent of diversity.

Figure 5. Comparison with random image samples. (a). Samples produced by our algorithm with τ = 2.0 (FID = 12.32). (b), (c), (d). Samples produced by stage 2 taking Gaussian noises with variance 1.0 (FID = 32.60), 1.5 (FID = 24.76), 2.0 (FID = 51.21) as input respectively.

Figure 6. Denoising with stage 2 for perturbed real images.

5.4. Results

In this subsection, we evaluate our proposed approach on benchmark datasets. Fig. 7 presents the generated samples of our algorithm on CIFAR-10 and Celeb A. Visually, our algorithm produces high-ﬁdelity image samples which are competitive with real images. For quantitive evaluation, we employ Fr echet Inception Distance (FID) (Heusel et al., 2017) and Inception Score (IS) (Salimans et al., 2016) to compare our method with other benchmark methods.

We ﬁrst compare the FID and IS on CIFAR-10 dataset, with τ increasing from 1.0 to 4.0 using 50,000 generated samples. Note that τ is the variance of the prior Wiener measure in stage 1, so it controls the behavior of the particle evolution from δ0 to qσ, and has an impact on the numerical results. To make the prior reasonable, we let τmin = σ2 = 1.0. The reason is that, if the particles strictly follow the prior law of the Brownian diffusion with variance τ in stage 1, the end time marginal will be N (0, τI). A good choice of the prior should make N (0, τI) close to the end time marginal qσ which we are interested about. As shown in Table 1, our

Deep Generative Learning via Schr odinger Bridge

Table 1. FID and Inception Score on CIFAR-10 for τ [1, 4].

τ 1.0 1.5 2.0 2.5

FID 37.20 20.49 12.32 12.90 IS 6.52 7.65 8.14 7.99

τ 3.0 3.5 4.0

FID 13.97 14.49 14.67 IS 7.98 8.03 8.10

Table 2. FID and Inception Scores on CIFAR-10.

MODELS FID IS

WGAN-GP 36.4 7.86 0.07 SN-SMMDGAN 25.0 7.3 0.1 SNGAN 21.7 8.22 0.05 NCSN 25.32 8.87 0.12 NCSNV2 10.87 8.40 0.07

OURS 12.32 8.14 0.07

algorithm achieves the best performance at τ = 2.0. The results also indicate that our algorithm is stable with respect to the value of variance of the prior Wiener measure τ when τ 2.0. In general, reasonable choices of τ would result in relatively good generating performance.

Table 2 presents the FID and IS of our algorithm evaluating with 50,000 samples, as well as other state-of-the-art generative models including WGAN-GP (Gulrajani et al., 2017), SN-SMMDGAN(Arbel et al., 2018), SNGAN (Miyato et al., 2018), NCSN (Song & Ermon, 2019) and NCSNv2 (Song & Ermon, 2020) on CIFAR-10. Our algorithm attains an FID score of 12.32 and an Inception Score of 8.14, which are competitive with the referred baseline methods. The quantitive results demonstrate the effectiveness of our algorithm.

Figure 7. Random samples on CIFAR-10 (σ = 1.0, τ = 2.0) and Celeb A (σ = 2.0, τ = 8.0).

5.5. Image Interpolation and Inpainting with Stage 2

To demonstrate usefulness of the proposed algorithm, we consider image interpolation and inpainting tasks.

Interpolating images linearly in the data distribution pdata would induce artifacts. However, if we perturb the linear interpolation using a Gaussian noise with variance σ2, and then use our stage 2 to denoise, we are able to obtain an interpolation without such artifacts. We ﬁnd σ2 = 0.4 is suitable for the image interpolation task for Celeb A. Fig. 8 lists the image interpolation results. Our algorithm produces smooth image interpolation by gradually changing facial attributes.

Figure 8. Image interpolation on Celeb A. The ﬁrst and last columns correspond to real images.

The second stage can also be utilized for image inpainting with a little modiﬁcation, inspired by the image inpainting algorithm with annealed Langevin dynamics in (Song & Ermon, 2019). Let m be a mask with entries in {0, 1} where 0 corresponds to missing pixels. The idea for inpainting is very similar to interpolation. We treat x m + σϵ as a sample from qσ, where ϵ N(0, I). Thus, we can use stage 2 to obtain samples from pdata. The image inpainting procedure is given in algorithm 2, and the results are presented

in Fig. 9. Notice that we perturb y with q

N2 σz at the end of each iteration. This is because the k-th iteration in stage 2 can be regarded as one-step Schr odinger Bridge from q

1 k/N2σ to q

1 (k+1)/N2σ. Thus, the particles

are supposed to follow q

1 (k+1)/N2σ(x) after the k-th iteration.

Figure 9. Image inpainting on Celeb A. The leftmost column contains real images. Each occluded image is followed by three inpainting samples.

Deep Generative Learning via Schr odinger Bridge

Algorithm 2 Inpainting with stage 2

Input: y = x m, m Sample z N(0, I) x0 = y + σz for k = 0 to N2 1 do

Sample ϵk N(0, I)

b(xk) = sθ(xk, q

xk+1 = xk + σ2

N2 b(xk) + σ N2 ϵn

xk+1 = xk+1 (1 m) + (y + q

N2 σz) m end for return x N2

6. Conclusion

We propose to learn a generative model via entropy interpolation with a Schr odinger Bridge. At the population level, this entropy interpolation can be characterized via an SDE on [0, 1] with a time varying drift term. We derive a twostage Schr odinger Bridge algorithm by plugging the drift term estimated by a deep score estimator and a deep density estimator in the Euler-Maruyama method. Under some smoothness assumptions of the target distribution, we prove the consistency of the proposed Schr odinger Bridge approach, guaranteeing that the learned distribution converges to the target distribution. Experimental results on multimodal synthetic data and benchmark data support our theoretical ﬁndings and demonstrate that the generative model via Schr odinger Bridge is comparable with state-of-the-art GANs, suggesting a new formulation of generative learning.

7. Acknowledgement

We thank the reviewers for their valuable comments. This work is supported in part by the National Key Research and Development Program of China [grant 208AAA0101100], the National Science Foundation of China [grant 11871474], the research fund of KLATASDSMOE, Hong Kong Research Grant Council [grants 16307818, 16301419, 16308120], the Guangdong-Hong Kong-Macao Joint Laboratory [grant 2020B1212030001], Hong Kong Innovation and Technology Fund [PRP/029/19FX], Hong Kong University of Science and Technology (HKUST) [startup grant R9405, Z0428 from the Big Data Institute] and the HKUSTWe Bank Joint Lab project. The computational task for this work was partially performed using the X-GPU cluster supported by the RGC Collaborative Research Fund: C6021-19EF.

Arbel, M., Sutherland, D., Bi nkowski, M., and Gretton, A. On gradient regularizers for MMD GANs. In Advances in

Neural Information Processing Systems, pp. 6701 6711, 2018.

Arbel, M., Korba, A., Salim, A., and Gretton, A. Maximum mean discrepancy gradient ﬂow. In Advances in Neural Information Processing Systems, pp. 6481 6491, 2019.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214 223, 2017.

Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D., and Jacobsen, J.-H. Invertible residual networks. In International Conference on Machine Learning, pp. 573 582, 2019.

Bi nkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying MMD GANs. In International Conference on Learning Representations, 2018.

Block, A., Mroueh, Y., and Rakhlin, A. Generative modeling with denoising auto-encoders and langevin sampling. ar Xiv preprint ar Xiv:2002.00107, 2020.

Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high ﬁdelity natural image synthesis. In International Conference on Learning Representations, 2019.

Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Mode regularized generative adversarial networks. In International Conference on Learning Representations, 2017.

Chen, Y., Georgiou, T. T., and Pavon, M. Stochastic control liasons: Richard sinkhorn meets gaspard monge on a schroedinger bridge. ar Xiv preprint ar Xiv:2005.10963, 2020.

Choi, Y., Uh, Y., Yoo, J., and Ha, J.-W. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8188 8197, 2020.

Dai Pra, P. A stochastic control approach to reciprocal diffusion processes. Applied mathematics and Optimization, 23(1):313 329, 1991.

Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear independent components estimation. In International Conference on Learning Representations, 2015.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using Real NVP. In International Conference on Learning Representations, 2017.

Gao, Y., Jiao, Y., Wang, Y., Wang, Y., Yang, C., and Zhang, S. Deep generative learning via variational gradient ﬂow. In International Conference on Machine Learning, pp. 2093 2101, 2019.

Deep Generative Learning via Schr odinger Bridge

Gao, Y., Huang, J., Jiao, Y., Liu, J., Lu, X., and Yang, Z. Generative learning with euler particle transport. In Annual Conference on Mathematical and Scientiﬁc Machine Learning, volume 145, pp. 1 33, 2021.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672 2680, 2014.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of Wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769 5779, 2017.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6629 6640, 2017.

Higham, D. J. An algorithmic introduction to numerical simulation of stochastic differential equations. SIAM review, 43(3):525 546, 2001.

Jamison, B. The markov processes of schr odinger. Zeitschrift f ur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 32(4):323 331, 1975.

Johnson, R. and Zhang, T. Composite functional gradient learning of generative adversarial models. In International Conference on Machine Learning, pp. 2371 2379, 2018.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.

Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401 4410, 2019.

Kingma, D. P. and Dhariwal, P. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236 10245, 2018.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

L eonard, C. A survey of the schrodinger problem and some of its connections with optimal transport. DYNAMICAL SYSTEMS, 34(4):1533 1574, 2014.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision, pp. 3730 3738, 2015.

Liutkus, A., Simsekli, U., Majewski, S., Durmus, A., St oter, F.-R., Chaudhuri, K., and Salakhutdinov, R. Sliced Wasserstein ﬂows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, pp. 4104 4113, 2019.

Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I. Adversarial autoencoders. In ICLR Workshop, 2016.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.

Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271 279, 2016.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive ﬂow for density estimation. In Advances in Neural Information Processing Systems, pp. 2335 2344, 2017.

Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A ﬂow-based generative network for speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3617 3621, 2019.

Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, 2016.

Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse high-ﬁdelity images with vq-vae-2. In Advances in Neural Information Processing Systems, pp. 14837 14847, 2019.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2226 2234, 2016.

Schr odinger, E. Sur la th eorie relativiste de l electron et l interpr etation de la m ecanique quantique. In Annales de l institut Henri Poincar e, volume 2, pp. 269 310, 1932.

Shen, Y., Gu, J., Tang, X., and Zhou, B. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9243 9252, 2020.

Deep Generative Learning via Schr odinger Bridge

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11895 11907, 2019.

Song, Y. and Ermon, S. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, 2020.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Tolstikhin, I., Bousquet, O., Gelly, S., and Sch olkopf, B. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018.

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. Wave Net: A generative model for raw audio. In 9th ISCA Speech Synthesis Workshop, pp. 125 125, 2016.

Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Selfattention generative adversarial networks. In International Conference on Machine Learning, pp. 7354 7363, 2019.

Zhu, J.-Y., Kr ahenb uhl, P., Shechtman, E., and Efros, A. A. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597 613, 2016.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2223 2232, 2017.