# restart_sampling_for_improving_generative_processes__f60fd30f.pdf

Restart Sampling for Improving Generative Processes

Yilun Xu MIT ylxu@mit.edu

Mingyang Deng MIT dengm@mit.edu

Xiang Cheng MIT chengx@mit.edu

Yonglong Tian Google Research yonglong@google.com

Ziming Liu MIT zmliu@mit.edu

Tommi Jaakkola MIT tommi@csail.mit.edu

Generative processes that involve solving differential equations, such as diffusion models, frequently necessitate balancing speed and quality. ODE-based samplers are fast but plateau in performance while SDE-based samplers deliver higher sample quality at the cost of increased sampling time. We attribute this difference to sampling errors: ODE-samplers involve smaller discretization errors while stochasticity in SDE contracts accumulated errors. Based on these findings, we propose a novel sampling algorithm called Restart in order to better balance discretization errors and contraction. The sampling method alternates between adding substantial noise in additional forward steps and strictly following a backward ODE. Empirically, Restart sampler surpasses previous SDE and ODE samplers in both speed and accuracy. Restart not only outperforms the previous best SDE results, but also accelerates the sampling speed by 10-fold / 2-fold on CIFAR-10 / Image Net 64 64. In addition, it attains significantly better sample quality than ODE samplers within comparable sampling times. Moreover, Restart better balances text-image alignment/visual quality versus diversity than previous samplers in the large-scale text-to-image Stable Diffusion model pre-trained on LAION 512 512. Code is available at https://github.com/Newbeeer/diffusion_restart_sampling

1 Introduction

Deep generative models based on differential equations, such as diffusion models and Poission flow generative models, have emerged as powerful tools for modeling high-dimensional data, from image synthesis [23, 9, 13, 27, 28] to biological data [10, 26]. These models use iterative backward processes that gradually transform a simple distribution (e.g., Gaussian in diffusion models) into a complex data distribution by solving a differential equations. The associated vector fields (or drifts) driving the evolution of the differential equations are predicted by neural networks. The resulting sample quality can be often improved by enhanced simulation techniques but at the cost of longer sampling times.

Prior samplers for simulating these backward processes can be categorized into two groups: ODEsamplers whose evolution beyond the initial randomization is deterministic, and SDE-samplers where the generation trajectories are stochastic. Several works [23, 12, 13] show that these samplers demonstrate their advantages in different regimes, as depicted in Fig. 1(b). ODE solvers [22, 16, 13] result in smaller discretization errors, allowing for decent sample quality even with larger step sizes (i.e., fewer number of function evaluations (NFE)). However, their generation quality plateaus rapidly. In contrast, SDE achieves better quality in the large NFE regime, albeit at the expense of increased sampling time. To better understand these differences, we theoretically analyze SDE performance: the

Equal Contribution.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: (a) Illustration of the implementation of drift and noise terms in ODE, SDE, and Restart. (b) Sample quality versus number of function evaluations (NFE) for different approaches. ODE (Green) provides fast speeds but attains only mediocre quality, even with a large NFE. SDE (Yellow) obtains good sample quality but necessitates substantial sampling time. In contrast to ODE and SDE, which have their own winning regions, Restart (Red) achieves the best quality across all NFEs.

stochasticity in SDE contracts accumulated error, which consists of both the discretization error along the trajectories as well as the approximation error of the learned neural network relative to the ground truth drift (e.g., score function in diffusion model [23]). The approximation error dominates when NFE is large (small discretization steps), explaining the SDE advantage in this regime. Intuitively, the stochastic nature of SDE helps "forget" accumulated errors from previous time steps.

Inspired by these findings, we propose a novel sampling algorithm called Restart, which combines the advantages of ODE and SDE. As illustrated in Fig. 1(a), the Restart sampling algorithm involves K repetitions of two subroutines in a pre-defined time interval: a Restart forward process that adds a substantial amount of noise, akin to "restarting" the original backward process, and a Restart backward process that runs the backward ODE. The Restart algorithm separates the stochasticity from the drifts, and the amount of added noise in the Restart forward process is significantly larger than the small single-step noise interleaving with drifts in previous SDEs such as [23, 13], thus amplifying the contraction effect on accumulated errors. By repeating the forward-backward cycle K times, the contraction effect introduced in each Restart iteration is further strengthened. The deterministic backward processes allow Restart to reduce discretization errors, thereby enabling step sizes comparable to ODE. To maximize the contraction effects in practice, we typically position the Restart interval towards the end of the simulation, where the accumulated error is larger. Additionally, we apply multiple Restart intervals to further reduce the initial errors in more challenging tasks.

Experimentally, Restart consistently surpasses previous ODE and SDE solvers in both quality and speed over a range of NFEs, datasets, and pre-trained models. Specifically, Restart accelerates the previous best-performing SDEs by 10 fewer steps for the same FID score on CIFAR-10 using VP [23] (2 fewer steps on Image Net 64 64 with EDM [13]), and outperforms ODE solvers even in the small NFE regime. When integrated into previous state-of-the-art pre-trained models, Restart further improves performance, achieving FID scores of 1.88 on unconditional CIFAR-10 with PFGM++ [28], and 1.36 on class-conditional Image Net 64 64 with EDM. To the best of our knowledge, these are the best FID scores obtained on commonly used UNet architectures for diffusion models without additional training. We also apply Restart to the practical application of text-to-image Stable Diffusion model [19] pre-trained on LAION 512 512. Restart more effectively balances text-image alignment/visual quality (measured by CLIP/Aesthetic scores) and diversity (measured by FID score) with a varying classifier-free guidance strength, compared to previous samplers.

Our contributions can be summarized as follows: (1) We investigate ODE and SDE solvers and theoretically demonstrate the contraction effect of stochasticity via an upper bound on the Wasserstein distance between generated and data distributions (Sec 3); (2) We introduce the Restart sampling, which better harnesses the contraction effect of stochasticity while allowing for fast sampling. The sampler results in a smaller Wasserstein upper bound (Sec 4); (3) Our experiments are consistent with the theoretical bounds and highlight Restart s superior performance compared to previous samplers on standard benchmarks in terms of both quality and speed. Additionally, Restart improves the trade-off between key metrics on the Stable Diffusion model (Sec 5).

2 Background on Generative Models with Differential Equations

Many recent successful generative models have their origin in physical processes, including diffusion models [9, 23, 13] and Poisson flow generative models [27, 28]. These models involve a forward process that transforms the data distribution into a chosen smooth distribution, and a backward process that iteratively reverses the forward process. For instance, in diffusion models, the forward process is the diffusion process with no learned parameters:

2 σ(t)σ(t)d Wt,

where σ(t) is a predefined noise schedule increasing with t, and Wt Rd is the standard Wiener process. For simplicity, we omit an additional scaling function for other variants of diffusion models as in EDM [13]. Under this notation, the marginal distribution at time t is the convolution of data distribution p0 = pdata and a Gaussian kernel, i.e., pt = p0 N(0, σ2(t)Id d). The prior distribution is set to N(0, σ2(T)Id d) since p T is approximately Gaussian with a sufficiently large T. Sampling of diffusion models is done via a reverse-time SDE [1] or a marginally-equivalent ODE [23]:

(SDE) dx = 2 σ(t)σ(t) x log pt(x)dt + p

2 σ(t)σ(t)d Wt (1) (ODE) dx = σ(t)σ(t) x log pt(x)dt (2)

where x log pt(x) in the drift term is the score of intermediate distribution at time t. W.l.o.g we set σ(t) = t in the remaining text, as in [13]. Both processes progressively recover p0 from the prior distribution p T while sharing the same time-dependent distribution pt. In practice, we train a neural network sθ(x, t) to estimate the score field x log pt(x) by minimizing the denoising score-matching loss [25]. We then substitute the score x log pt(x) with sθ(x, t) in the drift term of above backward SDE (Eq. (1))/ODE (Eq. (2)) for sampling.

Recent work inspired by electrostatics has not only challenged but also integrated diffusion models, notably PFGM/PFGM++, enhances performance in both image and antibody generation [27, 28, 10]. They interpret data as electric charges in an augmented space, and the generative processes involve the simulations of differential equations defined by electric field lines. Similar to diffusion models, PFGMs train a neural network to approximate the electric field in the augmented space.

3 Explaining SDE and ODE performance regimes

To sample from the aforementioned generative models, a prevalent approach employs general-purpose numerical solvers to simulate the corresponding differential equations. This includes Euler and Heun s 2nd method [2] for ODEs (e.g., Eq. (2)), and Euler-Maruyama for SDEs (e.g., Eq. (1)). Sampling algorithms typically balance two critical metrics: (1) the quality and diversity of generated samples, often assessed via the Fréchet Inception Distance (FID) between generated distribution and data distribution [7] (lower is better), and (2) the sampling time, measured by the number of function evaluations (NFE). Generally, as the NFE decreases, the FID score tends to deteriorate across all samplers. This is attributed to the increased discretization error caused by using a larger step size in numerical solvers.

However, as illustrated in Fig. 1(b) and observed in previous works on diffusion models [23, 22, 13], the typical pattern of the quality vs time curves behaves differently between the two groups of samplers, ODE and SDE. When employing standard numerical solvers, ODE samplers attain a decent quality with limited NFEs, whereas SDE samplers struggle in the same small NFE regime. However, the performance of ODE samplers quickly reaches a plateau and fails to improve with an increase in NFE, whereas SDE samplers can achieve noticeably better sample quality in the high NFE regime. This dilemma raises an intriguing question: Why do ODE samplers outperform SDE samplers in the small NFE regime, yet fall short in the large NFE regime?

The first part of the question is relatively straightforward to address: given the same order of numerical solvers, simulation of ODE has significantly smaller discretization error compared to the SDE. For example, the first-order Euler method for ODE results in a local error of O(δ2), whereas the first-order Euler-Maruyama method for SDEs yeilds a local error of O(δ 3 2 ) (see e.g., Theorem 1 of [4]), where δ denotes the step size. As O(δ 3 2 ) O(δ2), ODE simulations exhibit lower sampling errors than SDEs, likely causing the better sample quality with larger step sizes in the small NFE regime.

In the large NFE regime the step size δ shrinks and discretization errors become less significant for both ODEs and SDEs. In this regime it is the approximation error error arising from an

inaccurate estimation of the ground-truth vector field by the neural network sθ starts to dominate the sampling error. We denote the discretized ODE and SDE using the learned field sθ as ODEθ and SDEθ, respectively. In the following theorem, we evaluate the total errors from simulating ODEθ and SDEθ within the time interval [tmin, tmax] [0, T]. This is done via an upper bound on the Wasserstein-1 distance between the generated and data distributions at time tmin. We characterize the accumulated initial sampling errors up until tmax by total variation distances. Below we show that the inherent stochasticity of SDEs aids in contracting these initial errors at the cost of larger additional sampling error in [tmin, tmax]. Consequently, SDE results in a smaller upper bound as the step size δ nears 0 (pertaining to the high NFE regime). Theorem 1 (Informal). Let tmax be the initial noise level and pt denote the true distribution at noise level t. Let p ODEθ t , p SDEθ t denote the distributions of simulating ODEθ, SDEθ respectively. Assume that t [tmin, tmax], xt < B/2 for any xt in the support of pt, p ODEθ t or p SDEθ t . Then

W1(p ODEθ tmin , ptmin) B TV p ODEθ tmax , ptmax + O(δ + ϵapprox) (tmax tmin)

W1(p SDEθ tmin , ptmin) | {z } total error

1 λe U B TV (p SDEθ tmax , ptmax) | {z } upper bound on contracted error

δtmax + ϵapprox) (tmax tmin) | {z } upper bound on additional sampling error

In the above, U = BL1/tmin + L2 1t2 max/t2 min, λ < 1 is a contraction factor, L1 and ϵapprox are uniform bounds on tsθ(xt, t) and the approximation error t x log pt(x) tsθ(x, t) for all xt, t, respectively. O() hides polynomial dependency on various Lipschitz constants and dimension.

We defer the formal version and proof of Theorem 1 to Appendix A.1. As shown in the theorem, the upper bound on the total error can be decomposed into upper bounds on the contracted error and additional sampling error. TV (p ODEθ tmax , ptmax) and TV (p SDEθ tmax , ptmax) correspond to the initial errors accumulated from both approximation and discretization errors during the simulation of the backward process, up until time tmax. In the context of SDE, this accumulated error undergoes contraction by a factor of 1 λe BL1/tmin L2 1t2 max/t2 min within [tmin, tmax], due to the effect of adding noise. Essentially, the minor additive Gaussian noise in each step can drive the generated distribution and the true distribution towards each other, thereby neutralizing a portion of the initial accumulated error.

The other term related to additional sampling error includes the accumulation of discretization and approximation errors in [tmin, tmax]. Despite the fact that SDE incurs a higher discretization error than ODE (O(

δ) versus O(δ)), the contraction effect on the initial error is the dominant factor impacting the upper bound in the large NFE regime where δ is small. Consequently, the upper bound for SDE is significantly lower. This provides insight into why SDE outperforms ODE in the large NFE regime, where the influence of discretization errors diminishes and the contraction effect dominates. In light of the distinct advantages of SDE and ODE, it is natural to ask whether we can combine their strengths. Specifically, can we devise a sampling algorithm that maintains a comparable level of discretization error as ODE, while also benefiting from, or even amplifying, the contraction effects induced by the stochasticity of SDE? In the next section, we introduce a novel algorithm, termed Restart, designed to achieve these two goals simultaneously.

4 Harnessing stochasticity with Restart

In this section, we present the Restart sampling algorithm, which incorporates stochasticity during sampling while enabling fast generation. We introduce the algorithm in Sec 4.1, followed by a theoretical analysis in Sec 4.2. Our analysis shows that Restart achieves a better Wasserstein upper bound compared to those of SDE and ODE in Theorem 1 due to greater contraction effects.

In the Restart algorithm, simulation performs a few repeated back-and-forth steps within a pre-defined time interval [tmin, tmax] [0, T], as depicted in Figure 1(a). This interval is embedded into the simulation of the original backward ODE referred to as the main backward process, which runs from T to 0. In addition, we refer to the backward process within the Restart interval [tmin, tmax] as the Restart backward process, to distinguish it from the main backward process.

Starting with samples at time tmin, which are generated by following the main backward process, the Restart algorithm adds a large noise to transit the samples from tmin to tmax with the help of

the forward process. The forward process does not require any evaluation of the neural network sθ(x, t), as it is generally defined by an analytical perturbation kernel capable of transporting distributions from tmin to tmax. For instance, in the case of diffusion models, the perturbation kernel is N(0, (σ(tmax)2 σ(tmin)2)Id d). The added noise in this step induces a more significant contraction compared to the small, interleaved noise in SDE. The step acts as if partially restarting the main backward process by increasing the time. Following this step, Restart simulates the backward ODE from tmax back to tmin using the neural network predictions as in regular ODE. We repeat these forward-backward steps within [tmin, tmax] interval K times in order to further derive the benefit from contraction. Specifically, the forward and backward processes in the ith iteration (i {0, . . . , K 1}) proceed as follows:

(Restart forward process) xi+1 tmax = xi tmin + εtmin tmax (3)

(Restart backward process) xi+1 tmin = ODEθ(xi+1 tmax , tmax tmin) (4)

where the initial x0 tmin is obtained by simulating the ODE until tmin: x0 tmin = ODEθ(x T , T tmin), and the noise εtmin tmax is sampled from the corresponding perturbation kernel from tmin to tmax. The Restart algorithm not only adds substantial noise in the Restart forward process (Eq. (3)), but also separates the stochasticity from the ODE, leading to a greater contraction effect, which we will demonstrate theoretically in the next subsection. For example, we set [tmin, tmax] = [0.05, 0.3] for the VP model [13] on CIFAR-10. Repetitive use of the forward noise effectively mitigates errors accumulated from the preceding simulation up until tmax. Furthermore, the Restart algorithm does not suffer from large discretization errors as it is mainly built from following the ODE in the Restart backward process (Eq. (4)). The effect is that the Restart algorithm is able to reduce the total sampling errors even in the small NFE regime. Detailed pseudocode for the Restart sampling process can be found in Algorithm 2, Appendix B.2.

4.2 Analysis

We provide a theoretical analysis of the Restart algorithm under the same setting as Theorem 1. In particular, we prove the following theorem, which shows that Restart achieves a much smaller contracted error in the Wasserstein upper bound than SDE (Theorem 1), thanks to the separation of the noise from the drift, as well as the large added noise in the Restart forward process (Eq. (3)). The repetition of the Restart cycle K times further leads to a enhanced reduction in the initial accumulated error. We denote the intermediate distribution in the ith Restart iteration, following the discretized trajectories and the learned field sθ, as p Restartθ(i) t [tmin,tmax].

Theorem 2 (Informal). Under the same setting of Theorem 1, assume K C L2(tmax tmin) for some universal constant C. Then W1(p Restartθ(K) tmin , ptmin) | {z } total error

B (1 λ)K TV (p Restartθ(0) tmax , ptmax) | {z } upper bound on contracted error

+ (K + 1) O (δ + ϵapprox) (tmax tmin) | {z } upper bound on additional sampling error where λ < 1 is the same contraction factor as Theorem 1. O() hides polynomial dependency on various Lipschitz constants, dimension.

Proof sketch. To bound the total error, we introduce an auxiliary process q Restartθ(i) t [tmin,tmax], which initiates from true distribution ptmax and performs the Restart iterations. This process differs from p Restartθ(i) t [tmin,tmax] only in its initial distribution at tmax (ptmax versus p Restartθ(0) tmax ). We bound the total error by the following triangular inequality:

W1(p Restartθ(K) tmin , ptmin) | {z } total error

W1(p Restartθ(K) tmin , q Restartθ(K) tmin ) | {z } contracted error

+ W1(q Restartθ(K) tmin , ptmin) | {z } additional sampling error

To bound the contracted error, we construct a careful coupling process between two individual trajectories sampled from p Restartθ(i) tmin and q Restartθ(i) tmin , i = 0, . . . , K 1. Before these two trajectories converge, the Gaussian noise added in each Restart iteration is chosen to maximize the probability of the two trajectories mapping to an identical point, thereby maximizing the mixing rate in TV. After converging, the two processes evolve under the same Gaussian noise, and will stay converged as their drifts are the same. Lastly, we convert the TV bound to W1 bound by multiplying B. The bound on the additional sampling error echoes the ODE analysis in Theorem 1: since the noise-injection and ODE-simulation stages are separate, we do not incur the higher discretization error of SDE.

We defer the formal version and proof of Theorem 2 to Appendix A.1. The first term in RHS bounds the contraction on the initial error at time tmax and the second term reflects the additional sampling error of ODE accumulated across repeated Restart iterations. Comparing the Wasserstein upper bound of SDE and ODE in Theorem 1, we make the following three observations: (1) Each Restart iteration has a smaller contraction factor 1 λ compared to the one in SDE, since Restart separates the large additive noise (Eq. (3)) from the ODE (Eq. (4)). (2) Restart backward process (Eq. (4)) has the same order of discretization error O(δ) as the ODE, compared to O(

δ) in SDE. Hence, the Restart allows for small NFE due to ODE-level discretization error. (3) The contracted error further diminishes exponentially with the number of repetitions K though the additional error increases linearly with K. It suggests that there is a sweet spot of K that strikes a balance between reducing the initial error and increasing additional sampling error. Ideally, one should pick a larger K when the initial error at time tmax greatly outweigh the incurred error in the repetitive backward process from tmax to tmin. We provide empirical evidences in Sec 5.2.

While Theorem 1 and Theorem 2 compare the upper bounds on errors of different methods, we provide empirical validation in Section 5.1 by directly calculating these errors, showing that the Restart algorithm indeed yields a smaller total error due to its superior contraction effects. The main goal of Theorem 1 and Theorem 2 is to study how the already accumulated error changes using different samplers, and to understand their ability to self-correct the error by stochasticity. In essence, these theorems differentiate samplers based on their performance post-error accumulation. For example, by tracking the change of accumulated error, Theorem 1 shed light on the distinct "winning regions" of ODE and SDE: ODE samplers have smaller discretization error and hence excel at the small NFE regime. In contrast, SDE performs better in large NFE regime where the discretization error is negligible and its capacity to contract accumulated errors comes to the fore.

4.3 Practical considerations

The Restart algorithm offers several degrees of freedom, including the time interval [tmin, tmax] and the number of restart iterations K. Here we provide a general recipe of parameter selection for practitioners, taking into account factors such as the complexity of the generative modeling tasks and the capacity of the network. Additionally, we discuss a stratified, multi-level Restart approach that further aids in reducing simulation errors along the whole trajectories for more challenging tasks.

Where to Restart? Theorem 2 shows that the Restart algorithm effectively reduces the accumulated error at time tmax by a contraction factor in the Wasserstein upper bound. These theoretical findings inspire us to position the Restart interval [tmin, tmax] towards the end of the main backward process, where the accumulated error is more substantial. In addition, our empirical observations suggest that a larger time interval tmax tmin is more beneficial for weaker/smaller architectures or more challenging datasets. Even though a larger time interval increases the additional sampling error, the benefits of the contraction significantly outweighs the downside, consistent with our theoretical predictions. We leave the development of principled approaches for optimal time interval selection for future works.

Multi-level Restart For challenging tasks that yield significant approximation errors, the backward trajectories may diverge substantially from the ground truth even at early stage. To prevent the ODE simulation from quickly deviating from the true trajectory, we propose implementing multiple Restart intervals in the backward process, alongside the interval placed towards the end. Empirically, we observe that a 1-level Restart is sufficient for CIFAR-10, while for more challenging datasets such as Image Net [5], a multi-level Restart results in enhanced performance [5].

5 Experiments

In Sec 5.1, we first empirically verify the theoretical analysis relating to the Wasserstein upper bounds. We then evaluate the performance of different sampling algorithms on standard image generation benchmarks, including CIFAR-10 [14] and Image Net 64 64 [5] in Sec 5.2. Lastly, we employ Restart on text-to-image generation, using Stable Diffusion model [19] pre-trained on LAION-5B [21] with resolution 512 512, in Sec 5.3.

5.1 Additional sampling error versus contracted error

Our proposed Restart sampling algorithm demonstrates a higher contraction effect and smaller addition sampling error compared to SDE, according to Theorem 1 and Theorem 2. Although our

0.65 0.70 0.75 0.80 0.85 0.90 Additional Sampling error

Contracted error

ODE SDE Restart

0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Additional Sampling error

Total Error

ODE SDE Restart

20 40 80 160 320 NFE

Total Error

ODE SDE Restart

Figure 2: Additional sampling error versus (a) contracted error, where the Pareto frontier is plotted and (b) total error, where the scatter plot is provided. (c) Pareto frontier of NFE versus total error.

theoretical analysis compares the upper bounds of the total, contracted and additional sampling errors, we further verify their relative values through a synthetic experiment.

Setup We construct a 20-dimensional dataset with 2000 points sampled from a Gaussian mixture, and train a four-layer MLP to approximate the score field x log pt. We implement the ODE, SDE, and Restart methods within a predefined time range of [tmin, tmax] = [1.0, 1.5], where the process outside this range is conducted via the first-order ODE. To compute various error types, we define the distributions generated by three methods as outlined in the proof of Theorem 2 and directly gauge the errors at end of simulation t = 0 instead of t = tmin: (1) the generated distribution as p Sampler 0 , where Sampler {ODEθ, SDEθ, Restartθ(K)}; (2) an auxiliary distribution q Sampler 0 initiating from true distribution ptmax at time tmax. The only difference between p Sampler 0 and q Sampler 0 is their initial distribution at tmax (p ODEθ tmax versus ptmax); and (3) the true data distribution p0. In line with Theorem 2, we use Wasserstein-1 distance W1(p Sampler 0 , q Sampler 0 ) / W1(q Sampler 0 , p0) to measure the contracted error / additional sampling error, respectively. Ultimately, the total error corresponds to W1(p Sampler 0 , p0). Detailed information about dataset, metric and model can be found in the Appendix C.5.

Results In our experiment, we adjust the parameters for all three processes and calculate the total, contracted, and additional sampling errors across all parameter settings. Figure 2(a) depicts the Pareto frontier of additional sampling error versus contracted error. We can see that Restart consistently achieves lower contracted error for a given level of additional sampling error, compared to both the ODE and SDE methods, as predicted by theory. In Figure 2(b), we observe that the Restart method obtains a smaller total error within the additional sampling error range of [0.8, 0.85]. During this range, Restart also displays a strictly reduced contracted error, as illustrated in Figure 2(a). This aligns with our theoretical analysis, suggesting that the Restart method offers a smaller total error due to its enhanced contraction effects. From Figure 2(c), Restart also strikes an better balance between efficiency and quality, as it achieves a lower total error at a given NFE.

5.2 Experiments on standard benchmarks

32 64 128 256 512 1024 NFE

FID score (50K)

ODE Gonna Go Fast Improved SDE Restart

(a) CIFAR-10, VP

32 64 128 256 512 1024 NFE

FID score (50K)

128 256 512 1024

1.8 ODE Improved SDE Restart

(b) Image Net 64 64, EDM

Figure 3: FID versus NFE on (a) unconditional generation on CIFAR-10 with VP; (b) classconditional generation on Image Net with EDM.

To evaluate the sample quality and inference speed, we report the FID score [7] (lower is better) on 50K samplers and the number of function evaluations (NFE). We borrow the pretrained VP/EDM/PFGM++

15 20 25 30 35 NFE

FID score (50K)

DPM Solver Restart

Figure 4: CIFAR-10, VP, in the low NFE regime. Restart consistently outperforms the DPM-Solver with an NFE ranging from 16 to 36.

models on CIFAR-10 or Image Net 64 64 from [13, 28]. We also use the EDM discretization scheme [13] (see Appendix B.1 for details) during sampling.

For the proposed Restart sampler, the hyperparameters include the number of steps in the main/Restart backward processes, the number of Restart iteration K, as well as the time interval [tmin, tmax]. We pick the tmin and tmax from the list of time steps in EDM discretization scheme with a number of steps 18. For example, for CIFAR-10 (VP) with NFE=75, we choose tmin=0.06, tmax=0.30, K=10, where 0.30/0.06 is the 12th/14th time step in the EDM scheme. We also adopt EDM scheme for the Restart backward process in [tmin, tmax]. In addition, we apply the multi-level Restart strategy (Sec 4.3) to mitigate the error at early time steps for the more challenging Image Net 64 64. We provide the detailed Restart configurations in Appendix C.2.

For SDE, we compare with the previously best-performing stochastic samplers proposed by [13] (Improved SDE). We use their optimal hyperparameters for each dataset. We also report the FID scores of the adaptive SDE [12] (Gonna Go Fast) on CIFAR-10 (VP). Since the vanilla reversediffusion SDE [23] has a significantly higher FID score, we omit its results from the main charts and defer them to Appendix D. For ODE samplers, we compare with the Heun s 2nd order method [2] (Heun), which arguably provides an excellent trade-off between discretization errors and NFE [13]. To ensure a fair comparison, we use Heun s method as the sampler in the main/Restart backward processes in Restart.

We report the FID score versus NFE in Figure 3(a) and Table 1 on CIFAR-10, and Figure 3(b) on Image Net 64 64 with EDM. Our main findings are: (1) Restart outperforms other SDE or ODE samplers in balancing quality and speed, across datasets and models. As demonstrated in the figures, Restart achieves a 10-fold / 2-fold acceleration compared to previous best SDE results on CIFAR-10 (VP) / Image Net 64 64 (EDM) at the same FID score. In comparison to ODE sampler (Heun), Restart obtains a better FID score, with the gap increasing significantly with NFE. (2) For stronger models such as EDM and PFGM++, Restart further improve over the ODE baseline on CIFAR-10. In contrast, the Improved SDE negatively impacts performance of EDM, as also observed in [13]. It suggests that Restart incorporates stochasticity more effectively. (3) Restart establishes new state-of-the-art FID scores for UNet architectures without additional training. In particular, Restart achieves FID scores of 1.36 on class-cond. Image Net 64 64 with EDM, and 1.88 on uncond. CIFAR-10 with PFGM++.

Table 1: Uncond. CIFAR-10 with EDM and PFGM++

EDM-VP [13]

ODE (Heun) 63 1.97 35 1.97 Improved SDE 63 2.27 35 2.45 Restart 43 1.90

PFGM++ [28]

ODE (Heun) 63 1.91 35 1.91 Restart 43 1.88

0 20 40 60 Number of Restart iterations K

FID score (50K)

EDM (Restart) EDM (ODE) VP (Restart) VP (ODE)

Figure 5: FID score with a varying number of Restart iterations K.

To further validate that Restart can be applied in low NFE regime, we show that one can employ faster ODE solvers such as the DPM-solver [16] to further accelerate Restart. Fig. 4 shows that the Restart consistently outperforms the DPM-Solver with an NFE ranging from 16 to 36. This demonstrates Restart s capability to excel over ODE samplers, even in the small NFE regime. It also suggests that Restart can consistently improve other ODE samplers, not limited to the DDIM, Heun. Surprisingly, when paired with the DPM-Solver, Restart achieves an FID score of 2.11 on VP setting when NFE is 30, which is significantly

lower than any previous numbers (even lower than the SDE sampler with an NFE greater than 1000 in [23]), and make VP model on par with the performance with more advanced models (such as EDM).

Theorem 4 shows that each Restart iteration reduces the contracted errors while increasing the additional sampling errors in the backward process. In Fig. 5, we explore the choice of the number of Restart iterations K on CIFAR-10. We find that FID score initially improves and later worsens with increasing iterations K, with a smaller turning point for stronger EDM model. This supports the theoretical analysis that sampling errors will eventually outweigh the contraction benefits as K increases, and EDM only permits fewer Restart iterations due to smaller accumulated errors. It also suggests that, as a rule of thumb, we should apply greater Restart strength (e.g., larger K) for weaker or smaller architectures and vice versa.

5.3 Experiments on large-scale text-to-image model

0.290 0.295 0.300 0.305 0.310 0.315 0.320 CLIP score (Vi T-g/14)

FID score (5K)

DDIM (Steps=50) DDIM (Steps=100) DDPM (Steps=100) DDPM (Steps=200) Restart (Steps=66)

(a) FID versus CLIP score

5.15 5.20 5.25 5.30 5.35 5.40 Aesthetic score

FID score (5K)

DDIM (Steps=50) DDIM (Steps=100) DDPM (Steps=100) DDPM (Steps=200) Restart (Steps=66)

(b) FID versus Aesthetic score Figure 6: FID score versus (a) CLIP Vi T-g/14 score and (b) Aesthetic score for text-to-image generation at 512 512 resolution, using Stable Diffusion v1.5 with a varying classifier-free guidance weight w = 2, 3, 5, 8.

We further apply Restart to the text-to-image Stable Diffusion v1.5 2 pre-trained on LAION-5B [21] at a resolution of 512 512. We employ the commonly used classifier-free guidance [8, 20] for sampling, wherein each sampling step entails two function evaluations the conditional and unconditional predictions. Following [18, 20], we use the COCO [15] validation set for evaluation. We assess text-image alignment using the CLIP score [6] with the open-sourced Vi T-g/14 [11], and measure diversity via the FID score. We also evaluate visual quality through the Aesthetic score, as rated by the LAION-Aesthetics Predictor V2 [24]. Following [17], we compute all evaluation metrics using 5K captions randomly sampled from the validation set and plot the trade-off curves between CLIP/Aesthetic scores and FID score, with the classifier-free guidance weight w in {2, 3, 5, 8}.

We compare with commonly used ODE sampler DDIM [22] and the stochastic sampler DDPM [9]. For Restart, we adopt the DDIM solver with 30 steps in the main backward process, and Heun in the Restart backward process, as we empirically find that Heun performs better than DDIM in the Restart. In addition, we select different sets of the hyperparameters for each guidance weight. For instance, when w = 8, we use [tmin, tmax]=[0.1, 2], K=2 and 10 steps in Restart backward process. We defer the detailed Restart configuration to Appendix C.2, and the results of Heun to Appendix D.1.

As illustrated in Fig. 6(a) and Fig. 6(b), Restart achieves better FID scores in most cases, given the same CLIP/Aesthetic scores, using only 132 function evaluations (i.e., 66 sampling steps). Remarkably, Restart achieves substantially lower FID scores than other samplers when CLIP/Aesthetic scores are high (i.e., with larger w values). Conversely, Restart generally obtains a better text-image alignment/visual quality given the same FID. We also observe that DDPM generally obtains comparable performance with Restart in FID score when CLIP/Aesthetic scores are low, with Restart being more time-efficient. These findings suggest that Restart balances diversity (FID score) against text-image alignment (CLIP score) or visual quality (Aesthetic score) more effectively than previous samplers.

In Fig. 7, we visualize the images generated by Restart, DDIM and DDPM with w = 8. Compared to DDIM, the Restart generates images with superior details (e.g., the rendition of duck legs by DDIM is less accurate) and visual quality. Compared to DDPM, Restart yields more photo-realistic images (e.g., the astronaut). We provide extended of text-to-image generated samples in Appendix E.

2https://huggingface.co/runwayml/stable-diffusion-v1-5

(a) Restart (Steps=66)

(b) DDIM (Steps=100)

(c) DDPM (Steps=100)

Figure 7: Visualization of generated images with classifier-free guidance weight w = 8, using four text prompts ( A photo of an astronaut riding a horse on mars.", "A raccoon playing table tennis", "Intricate origami of a fox in a snowy forest" and "A transparent sculpture of a duck made out of glass") and the same random seeds.

6 Conclusion and future direction

In this paper, we introduce the Restart sampling for generative processes involving differential equations, such as diffusion models and PFGMs. By interweaving a forward process that adds a significant amount of noise with a corresponding backward ODE, Restart harnesses and even enhances the individual advantages of both ODE and SDE. Theoretically, Restart provides greater contraction effects of stochasticity while maintaining ODE-level discretization error. Empirically, Restart achieves a superior balance between quality and time, and improves the text-image alignment/visual quality and diversity trade-off in the text-to-image Stable Diffusion models.

A current limitation of the Restart algorithm is the absence of a principled way for hyperparameters selection, including the number of iterations K and the time interval [tmin, tmax]. At present, we adjust these parameters based on the heuristic that weaker/smaller models, or more challenging tasks, necessitate a stronger Restart strength. In the future direction, we anticipate developing a more principled approach to automating the selection of optimal hyperparameters for Restart based on the error analysis of models, in order to fully unleash the potential of the Restart framework.

Acknowledgements

YX and TJ acknowledge support from MIT-DSTA Singapore collaboration, from NSF Expeditions grant (award 1918839) "Understanding the World Through Code", and from MIT-IBM Grand Challenge project. Xiang Cheng acknowledges support from NSF CCF-2112665 (TILOS AI Research Institute).

[1] Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982.

[2] Uri M. Ascher and Linda R. Petzold. Computer methods for ordinary differential equations and differential-algebraic equations. In SIAM, 1998.

[3] Andrei N Borodin and Paavo Salminen. Handbook of Brownian motion-facts and formulae. Springer Science & Business Media, 2015.

[4] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278 5311, 2019.

[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255, 2009.

[6] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Joseph Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In Conference on Empirical Methods in Natural Language Processing, 2021.

[7] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.

[8] Jonathan Ho. Classifier-free diffusion guidance. Ar Xiv, abs/2207.12598, 2022.

[9] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models. Ar Xiv, abs/2006.11239, 2020.

[10] Chutian Huang, Zijing Liu, Shengyuan Bai, Linwei Zhang, Chencheng Xu, Zhe Wang, Yang Xiang, and Yuanpeng Xiong. Pf-abgen: A reliable and efficient antibody generator via poisson flow. Machine Learning for Drug Discovery Workshop, International Conference on Learning Representations, 2023.

[11] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. Zenodo, 2021.

[12] Alexia Jolicoeur-Martineau, Ke Li, Remi Piche-Taillefer, Tal Kachman, and Ioannis Mitliagkas. Gotta go fast when generating data with score-based models. Ar Xiv, abs/2105.14080, 2021.

[13] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Ar Xiv, abs/2206.00364, 2022.

[14] Alex Krizhevsky. Learning multiple layers of features from tiny images. Citeseer, 2009.

[15] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.

[16] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. ar Xiv preprint ar Xiv:2206.00927, 2022.

[17] Chenlin Meng, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. Ar Xiv, abs/2210.03142, 2022.

[18] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, 2021.

[19] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674 10685, 2021.

[20] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. Ar Xiv, abs/2205.11487, 2022.

[21] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models. Ar Xiv, abs/2210.08402, 2022.

[22] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. Ar Xiv, abs/2010.02502, 2020.

[23] Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. Ar Xiv, abs/2011.13456, 2020.

[24] LAION-AI Team. Laion-aesthetics predictor v2. https://github.com/ christophschuhmann/improved-aesthetic-predictor, 2022.

[25] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23:1661 1674, 2011.

[26] Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, Basile I. M. Wicky, Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, William Sheffler, Jue Wang, Preetham Venkatesh, Isaac Sappington, Susana Vázquez Torres, Anna Lauko, Valentin De Bortoli, Emile Mathieu, Regina Barzilay, T. Jaakkola, Frank Di Maio, Minkyung Baek, and David Baker. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bio Rxiv, 2022.

[27] Yilun Xu, Ziming Liu, Max Tegmark, and T. Jaakkola. Poisson flow generative models. Ar Xiv, abs/2209.11178, 2022.

[28] Yilun Xu, Ziming Liu, Yonglong Tian, Shangyuan Tong, Max Tegmark, and T. Jaakkola. Pfgm++: Unlocking the potential of physics-inspired generative models. Ar Xiv, abs/2302.04265, 2023.

A Proofs of Main Theoretical Results

In this section, we provide proofs of our main results. We define below some crucial notations which we will use throughout. We use ODE(. . . ) to denote the backwards ODE under exact score log pt(x). More specifically, given any x Rd and s > r > 0, let xt denote the solution to the following ODE:

dxt = t log pt(xt)dt. (5)

ODE(x, s r) is defined as "the value of xr when initialized at xs = x". It will also be useful to consider a "time-discretized ODE with drift tsθ(x, t)": let δ denote the discretization step size and let k denote any integer. Let δ denote a step size, let xt denote the solution to

dxt = tsθ(xkδ, kδ)dt, (6)

where for any t, k is the unique integer such that t ((k 1)δ, kδ]. We verify that the dynamics of Eq. (6) is equivalent to the following discrete-time dynamics for t = kδ, k Z:

x(k 1)δ = xkδ 1

((k 1)δ)2 (kδ)2 sθ(xkδ, kδ).

We similarly denote the value of xr when initialized at xs = x as ODEθ(x, s r). Analogously, we let SDE(x, s r) and SDEθ(x, s r) denote solutions to

dyt = 2t log pt(yt)dt +

dyt = 2tsθ(yt, t)dt +

respectively. Finally, we will define the Restartθ process as follows:

(Restartθ forward process) xi+1 tmax = xi tmin + εi tmin tmax (Restartθ backward process) xi+1 tmin = ODEθ(xi+1 tmax , tmax tmin), (7)

where εi tmin tmax N 0, t2 max t2 min I . We use Restartθ(x, K) to denote x K tmin in the above processes, initialized at x0 tmin = x. In various theorems, we will refer to a function Q(r) : R+ [0, 1/2), defined as the Gaussian tail probability Q(r) = Pr(a r) for a N(0, 1).

A.1 Main Result

Theorem 3. [Formal version of Theorem 1] Let tmax be the initial noise level. Let the initial random variables xtmax = ytmax, and

xtmin = ODEθ(xtmax, tmax tmin) ytmin = SDEθ(ytmax, tmax tmin),

Let pt denote the true population distribution at noise level t. Let p ODEθ t , p SDEθ t denote the distributions for xt, yt respectively. Assume that for all x, y, s, t, sθ(x, t) satisfies tsθ(x, t) tsθ(x, s) L0|s t|, tsθ(x, t) L1, tsθ(x, t) tsθ(y, t) L2 x y , and the approximation error tsθ(x, t) t log pt(x) ϵapprox. Assume in addition that t [tmin, tmax], xt < B/2 for any xt in the support of pt, p ODEθ t or p SDEθ t , and K C L2(tmax tmin) for some universal constant C. Then

W1(p ODEθ tmin , ptmin) B TV p ODEθ tmax , ptmax

+ e L2(tmax tmin) (δ(L2L1 + L0) + ϵapprox) (tmax tmin) (8)

W1(p SDEθ tmin , ptmin) B 1 λe BL1/tmin L2 1t2 max/t2 min TV (p SDEθ tmax , ptmax)

+ e2L2(tmax tmin) ϵapprox + δL0 + L2 δL1 + p

2δdtmax (tmax tmin) (9)

where λ := 2Q B 2

t2 max t2 min

Proof. Let us define xtmax ptmax, and let xtmin = ODE(xtmax, tmax tmin). We verify that xtmin has density ptmin. Let us also define ˆxtmin = ODEθ(xtmax, tmax tmin). We would like to bound the Wasserstein distance between xtmin and xtmin (i.e., p ODEθ tmin and ptmin), by the following triangular inequality: W1( xtmin, xtmin) W1( xtmin, ˆxtmin) + W1(ˆxtmin, xtmin) (10) By Lemma 2, we know that

ˆxtmin xtmin e(tmax tmin)L2 (δ(L2L1 + L0) + ϵapprox) (tmax tmin) , where we use the fact that ˆxtmax xtmax = 0. Thus we immediately have

W1(ˆxtmin, xtmin) e(tmax tmin)L2 (δ(L2L1 + L0) + ϵapprox) (tmax tmin) (11) On the other hand, W1(ˆxtmin, xtmin) B TV (ˆxtmin, xtmin) B TV (ˆxtmax, xtmax) (12) where the last equality is due to the data-processing inequality. Combining Eq. (11) , Eq. (12) and the triangular inequality Eq. (10), we arrive at the upper bound for ODE (Eq. (8)). The upper bound for SDE (Eq. (9)) shares a similar proof approach. First, let ytmax ptmax. Let ˆytmin = SDEθ(ytmax, tmax tmin). By Lemma 5,

TV ˆytmin, ytmin

t2max t2 min

e BL1/tmin L2 1t2 max/t2 min !

TV ˆytmax, ytmax

On the other hand, by Lemma 4,

E [ ˆytmin ytmin ] e2L2(tmax tmin) ϵapprox + δL0 + L2 δL1 + p

2δdtmax (tmax tmin) .

The SDE triangular upper bound on W1( ytmin, ytmin) follows by multiplying the first inequality by B (to bound W1( ytmin, ˆytmin)) and then adding the second inequality (to bound W1(ytmin, ˆytmin)). Notice that by definition, TV ˆytmax, ytmax = TV ytmax, ytmax . Finally, because of the assumption that K C L2(tmax tmin) for some universal constant, we summarize the second term in the Eq. (8) and Eq. (9) into the big O in the informal version Theorem 1.

Theorem 4. [Formal version of Theorem 2] Consider the same setting as Theorem 3. Let p Restartθ,i tmin denote the distributions after ith Restart iteration, i.e., the distribution of xi tmin = Restartθ(x0 tmin, i). Given initial x0 tmax p Restart,0 tmax , let x0 tmin = ODEθ(x0 tmax, tmax tmin). Then

W1(p Restartθ,K tmin , ptmin) B (1 λ)K TV (p Restart,0 tmax , ptmax) | {z } upper bound on contracted error

+ e(K+1)L2(tmax tmin)(K + 1) (δ(L2L1 + L0) + ϵapprox) (tmax tmin) | {z } upper bound on additional sampling error (13)

where λ = 2Q B 2

t2 max t2 min

Proof. Let x0 tmax ptmax. Let x K tmin = Restart(x0 tmin, K). We verify that x K tmin has density ptmin. Let us also define ˆx0 tmin = ODEθ(x0 tmax, tmax tmin) and ˆx K tmin = Restartθ(ˆx0 tmin, K).

By Lemma 1,

x K tmin, ˆx K tmin

t2max t2 min

x0 tmin, ˆx0 tmin

t2max t2 min

x0 tmax, ˆx0 tmax

t2max t2 min

x0 tmax, x0 tmax

The second inequality holds by data processing inequality. The above can be used to bound the 1-Wasserstein distance as follows:

x K tmin, ˆx K tmin B TV

x K tmin, ˆx K tmin

t2max t2 min

x0 tmax, x0 tmax

On the other hand, using Lemma 3,

W1 x K tmin, ˆx K tmin x K tmin ˆx K tmin

e(K+1)L2(tmax tmin)(K + 1) (δ(L2L1 + L0) + ϵapprox) (tmax tmin) (15)

We arrive at the result by combining the two bounds above (Eq. (14), Eq. (15)) with the following triangular inequality,

W1( x K tmin, x K tmin) W1( x K tmin, ˆx K tmin) + W1(ˆx K tmin, x K tmin)

A.2 Mixing under Restart with exact ODE

Lemma 1. Consider the same setup as Theorem 4. Consider the Restartθ process defined in equation 7. Let

xi tmin = Restartθ(x0 tmin, i)

yi tmin = Restartθ(y0 tmin, i).

Let p Restartθ(i) t and q Restartθ(i) t denote the densities of xi t and yi t respectively. Then

TV p Restartθ(K) tmin , q Restartθ(K) tmin (1 λ)K TV p Restartθ(0) tmin , q Restartθ(0) tmin ,

where λ = 2Q B 2

t2max t2 min

Proof. Conditioned on xi tmin, yi tmin, let xi+1 tmax = xi tmin + p

t2max t2 minξx i and yi+1 tmax = yi tmin + p

t2max t2 minξy i . We now define a coupling between xi+1 tmin and yi+1 tmin by specifying the joint distribution over ξx i and ξy i .

If xi tmin = yi tmin, let ξx i = ξy i , so that xi+1 tmin = yi+1 tmin . On the other hand, if xi tmin = yi tmin, let xi+1 tmax and yi+1 tmax be coupled as described in the proof of Lemma 7, with x = xi+1 tmax , y = yi+1 tmax , σ = p

t2max t2 min. Under this coupling, we verify that,

E 1 xi+1 tmin = yi+1 tmin

E 1 xi+1 tmax = yi+1 tmax

xi tmin yi tmin

t2max t2 min

1 xi tmin = yi tmin #

t2max t2 min

E 1 xi tmin = yi tmin .

Applying the above recursively,

E 1 x K tmin = y K tmin

t2max t2 min

E 1 x0 tmin = y0 tmin .

The conclusion follows by noticing that TV p Restartθ(K) tmin , q Restartθ(K) tmin Pr x K tmin = y K tmin =

E 1 x K tmin = y K tmin , and by selecting the initial coupling so that Pr x0 tmin = y0 tmin =

TV p Restartθ(0) tmin , q Restartθ(0) tmin .

A.3 W1 discretization bound

Lemma 2 (Discretization bound for ODE). Let xtmin = ODE (xtmax, tmax tmin) and let xtmin = ODEθ (xtmax, tmax tmin). Assume that for all x, y, s, t, sθ(x, t) satisfies tsθ(x, t) tsθ(x, s) L0|s t|, tsθ(x, t) L1 and tsθ(x, t) tsθ(y, t) L2 x y . Then

xtmin xtmin e(tmax tmin)L2 ( xtmax xtmax + (δ(L2L1 + L0) + ϵapprox) (tmax tmin))

Proof. Consider some fixed arbitrary k, and recall that δ is the step size. Recall that by definition of ODE and ODEθ, for t ((k 1)δ, kδ],

dxt = t log pt(xt)dt dxt = tsθ(xkδ, kδ)dt.

For t [tmin, tmax], let us define a time-reversed process x t := x t. Let v(x, t) := log p t(x). Then for t [ tmax, tmin]

dx t = tv(x t , t)ds.

Similarly, define x t := x t and v(x, t) := sθ (x, t). It follows that

dx t = tv(x kδ, kδ)ds,

where k is the unique (negative) integer satisfying t [kδ, (k + 1)δ). Following these definitions,

d dt x t x t

tv(x t , t) tv(x t , t) + tv(x t , t) tv(x t , t) + tv(x t , t) tv(x t , kδ) + tv(x t , kδ) tv(x kδ, kδ) ϵapprox + L2 x t x t + δL0 + L2 x t x kδ ϵapprox + L2 x t x t + δL0 + δL2L1.

Applying Gronwall s Lemma over the interval t [ tmax, tmin],

xtmin xtmin

= x tmin x tmin

e L2(tmax tmin) x tmax x tmax + (ϵapprox + δL0 + δL2L1) (tmax tmin)

=e L2(tmax tmin) ( xtmax xtmax + (ϵapprox + δL0 + δL2L1) (tmax tmin)) .

Lemma 3. Given initial x0 tmax, let x0 tmin = ODE x0 tmax, tmax tmin , and let ˆx0 tmin = ODEθ x0 tmax, tmax tmin . We further denote the variables after K Restart iterations as x K tmin = Restart(x0 tmin, K) and ˆx K tmin = Restartθ(ˆx0 tmin, K), with true field and learned field respectively. Then there exists a coupling between x K tmin and ˆx K tmin such that x K tmin ˆx K tmin e(K+1)L2(tmax tmin)(K + 1) (δ(L2L1 + L0) + ϵapprox) (tmax tmin) .

Proof. We will couple xi tmin and ˆxi tmin by using the same noise εi tmin tmax in the Restart forward process for i = 0 . . . K 1 (see Eq. (7)). For any i, let us also define yi,j tmin := Restartθ xi tmin, j i , and this process uses the same noise εi tmin tmax as previous ones. From this definition, y K,K tmin = x K tmin. We can thus bound

x K tmin, ˆx K tmin y0,K tmin ˆx K tmin +

yi,K tmin yi+1,K tmin (16)

Using the assumption that tsθ( , t) is L2 Lipschitz, y0,i+1 tmin ˆxi+1 tmin

= ODEθ(y0,i tmax, tmax tmin) ODEθ(ˆxi tmax, tmax tmin)

e L2(tmax tmin) y0,i tmax ˆxi tmax

=e L2(tmax tmin) y0,i tmin ˆxi tmin ,

where the last equality is because we add the same additive Gaussian noise εi tmin tmax to y0,i tmin and ˆxi tmin in the Restart forward process. Applying the above recursively, we get y0,K tmin ˆx K tmin e KL2(tmax tmin) y0,0 tmin ˆx0 tmin

e KL2(tmax tmin) x0 tmin ˆx0 tmin

e(K+1)L2(tmax tmin) (δ(L2L1 + L0) + ϵapprox) (tmax tmin) , (17)

where the last line follows by Lemma 2 when setting xtmax = xtmax. We will now bound yi,K tmin yi+1,K tmin for some i K. It follows from definition that

yi,i+1 tmin = ODEθ xi tmax, tmax tmin

yi+1,i+1 tmin = xi+1 tmin = ODE xi tmax, tmax tmin .

By Lemma 2, yi,i+1 tmin yi+1,i+1 tmin e L2(tmax tmin) (δ(L2L1 + L0) + ϵapprox) (tmax tmin)

For the remaining steps from i + 2 . . . K, both yi, and yi+1, evolve with ODEθ in each step. Again using the assumption that tsθ( , t) is L2 Lipschitz, yi,K tmin yi+1,K tmin e(K i)L2(tmax tmin) (δ(L2L1 + L0) + ϵapprox) (tmax tmin)

Summing the above for i = 0...K 1, and combining with Eq. (16) and Eq. (17) gives x K tmin ˆx K tmin e(K+1)L2(tmax tmin)(K + 1) (δ(L2L1 + L0) + ϵapprox) (tmax tmin) .

Lemma 4. Consider the same setup as Theorem 3. Let xtmin = SDE (xtmax, tmax tmin) and let xtmin = SDE (xtmax, tmax tmin). Then there exists a coupling between xt and xt such that

E [ xtmin xtmin ] e2L2(tmax tmin)E [ xtmax xtmax ]

+ e2L2(tmax tmin) ϵapprox + δL0 + L2 δL1 + p

2δdtmax (tmax tmin)

Proof. Consider some fixed arbitrary k, and recall that δ is the stepsize. By definition of SDE and SDEθ, for t ((k 1)δ, kδ],

dxt = 2t log pt(xt)dt +

dxt = 2tsθ(xkδ, kδ)dt +

Let us define a coupling between xt and xt by identifying their respective Brownian motions. It will be convenient to define the time-reversed processes x t := x t, and x t := x t, along with v(x, t) := log p t(x) and v(x, t) := sθ(x, t). Then there exists a Brownian motion B t , such that for t [ tmax, tmin],

dx t = 2tv(x t , t)dt +

2td B t dx t = 2tv(x kδ, kδ)dt +

2td B t d(x t x t ) = 2t (v(x t , t) v(x kδ, kδ)) dt,

where k is the unique negative integer such that t [kδ, (k + 1)δ). Thus

d dt E [ x t x t ]

2 (E [ tv(x t , t) tv(x t , t) ] + E [ tv(x t , t) tv(x t , t) ]) + 2 (E [ tv(x t , t) tv(x t , kδ) ] + E [ tv(x t , kδ) tv(x kδ, kδ) ]) 2 (ϵapprox + L2E [ x t x t ] + δL0 + L2E [ x t x kδ ])

2 ϵapprox + L2E [ x t x t ] + δL0 + L2 δL1 + p

By Gronwall s Lemma,

E [ xtmin xtmin ]

=E x tmin x tmin

e2L2(tmax tmin) E x tmax x tmax + ϵapprox + δL0 + L2 δL1 + p

2δdtmax (tmax tmin)

=e2L2(tmax tmin) E [ xtmax xtmax ] + ϵapprox + δL0 + L2 δL1 + p

2δdtmax (tmax tmin)

A.4 Mixing Bounds

Lemma 5. Consider the same setup as Theorem 3. Assume that δ tmin. Let

xtmin = SDEθ (xtmax, tmax tmin) ytmin = SDEθ (ytmax, tmax tmin) .

Then there exists a coupling between xs and ys such that

TV (xtmin, ytmin)

t2max t2 min

e BL1/tmin L2 1t2 max/t2 min !

TV (xtmax, ytmax)

Proof. We will construct a coupling between xt and yt. First, let (xtmax, ytmax) be sampled from the optimal TV coupling, i.e., Pr(xtmax = ytmax) = TV (xtmax, ytmax). Recall that by definition of SDEθ, for t ((k 1)δ, kδ],

dxt = 2tsθ(xkδ, kδ)dt +

Let us define a time-rescaled version of xt: xt := xt2. We verify that

dxt = sθ(x(kδ)2, kδ)dt + d Bt,

where k is the unique integer satisfying t [((k 1)δ)2, k2δ2). Next, we define the time-reversed process x t := x t, and let v(x, t) := sθ(x, t). We verify that there exists a Brownian motion Bx t such that, for t [ t2 max, t2 min],

dx t = vx t dt + d Bx t ,

where vx t = sθ(x (kδ)2, kδ), where k is the unique positive integer satisfying t (((k 1)δ)2, (kδ)2]. Let dy t = vy t dt + d By t , be defined analogously. For any positive integer k and for any t [ (kδ)2, ((k 1)δ)2), let us define

zt = x k2δ2 y k2δ2 + (2k 1)δ2 vx (kδ)2 vy (kδ)2 + Bx t Bx (kδ)2 By t By (kδ)2 .

Let γt := zt zt . We will now define a coupling between d Bx t and d By t as

d By t = I 21 {t τ}γtγT t d Bx t ,

where 1 {} denotes the indicator function, i.e. 1 {t τ} = 1 if t τ, and τ is a stopping time given by the first hitting time of zt = 0. Let rt := zt . Consider some t i2δ2, (i 1)2δ2 , and Let j := tmax

δ (assume w.l.o.g that this is an integer), then

k=i (2k 1)δ2 (vx (kδ)2 vy (kδ)2) + Z t

t2max 1 {t τ}2d B1 s

k2 (k 1)2 δ22L1/ (tmin) + Z t

t2max 1 {t τ}2d B1 t

= Z (i 1)δ2

tmin ds + Z t

t2max 1 {t τ}2d B1 s,

where d B1 s = γt, d Bx s d By s is a 1-dimensional Brownian motion. We also verify that

r t2max = z t2max

= x t2max y t2max + (2j 1)δ2 vx t2max vy t2max

+ Bx t Bx t2max

By t By t2max

x t2max + (2j 1)δ2vx t2max + Bx (j 1)2δ2 Bx t2max

+ y t2max + (2j 1)δ2vy t2max + Bx (j 1)2δ2 Bx t + By t By t2max

where the third relation is by adding and subtracting Bx (j 1)2δ2 Bx t and using triangle inequality.

The fourth relation is by noticing that x t2max + (2j 1)δ2vx t2max + Bx (j 1)2δ2 Bx t2max

x (j 1)2δ2 and that y t2max +(2j 1)δ2vy t2max + Bx (j 1)2δ2 Bx t + By t By t2max

d= y (j 1)2δ2, and then using our assumption in the theorem statement that all processes are supported on a ball of radius B/2.

We now define a process st defined by dst = 2L1/tmindt + 2d B1 t , initialized at s t2max = B r t2max. We can verify that, up to time τ, rt st with probability 1. Let τ denote the first-hitting time of st to 0, then τ τ with probability 1. Thus

Pr(τ t2 min) Pr(τ t2 min) 2Q

t2max t2 min

e BL1/tmin L2 1t2 max/t2 min

where we apply Lemma 6. The proof follows by noticing that, if τ t2 min, then xtmin = ytmin. This is because if τ [ k2δ2, (k 1)2δ2], then x (k 1)2δ2 = y (k 1)2δ2, and thus x t = y t for all t (k 1)2δ2, in particular, at t = t2 min.

Lemma 6. Consider the stochastic process

drt = d B1 t + cdt.

Assume that r0 B/2. Let τ denote the hitting time for rt = 0. Then for any T R+,

Pr(τ T) 2Q B

where Q is the tail probability of a standard Gaussian defined in Definition 1.

Proof. We will use he following facts in our proof:

1. For x N(0, σ2), Pr(x > r) = 1

2πt3 dt = erfc a

= 2Pr (N(0, T) > a) = 2Q a

by definition of Q.

Let drt = d B1 t + cdt, with r0 = a. The density of the hitting time τ is given by

p(τ = t) = f(a, c, t) = a exp (a+ct)2

2πt3 . (18)

(see e.g. [3]). From item 2 above, Z T

0 f(a, 0, t)dt = 2Q a

In the case of a general c = 0, we can bound (a+ct)2

2t + ac + c2t

2 . Consequently,

f(a, c, t) f(a, 0, t) e ac c2t

Pr(τ T) = Z T

0 f(a, c, t)dt Z T

0 f(a, 0, t)dte c = 2Q B

A.5 TV Overlap

Definition 1. Let x be sampled from standard normal distribution N(0, 1). We define the Gaussian tail probability Q(a) := Pr(x a). Lemma 7. We verify that for any two random vectors ξx N(0, σ2I) and ξy N(0, σ2I), each belonging to Rd, the total variation distance between x = x + ξx and y = y + ξy is given by

TV (x , y ) = 1 2Q (r) 1 2r r2 + 1 1

where r = x y

2σ , and Q(r) = Pr(ξ r), when ξ N(0, 1).

Proof. Let γ := x y x y . We decompose x , y into the subspace/orthogonal space defined by γ:

x = x + ξ x + x + ξ x y = y + ξ y + y + ξ y

where we define

x := γγT x x := x x

y := γγT y y := y y

ξ x := γγT ξx ξ x := ξx ξ x ξ y := γγT ξy ξ y := ξy ξ y

We verify the independence ξ x ξ x and ξ y ξ y as they are orthogonal decompositions of the standard Gaussian. We will define a coupling between x and y by setting ξ x = ξ y . Under this coupling, we verify that x + ξ x y + ξ y = x y γγT (x y) = 0

Therefore, x = y if and only if x + ξ x = y + ξ y. Next, we draw (a, b) from the optimal coupling between N(0, 1) and N( x y

σ , 1). We verify that x + ξ x and y + ξ y both lie in the span of

γ. Thus it suffices to compare D γ, x + ξ x E and D γ, y + ξ y E . We verify that D γ, x + ξ x E =

γ, y + γ, x y + D γ, ξ x E N( γ, y + x y , σ2) d= γ, y +σb. We similarly verify

that D γ, y + ξ y E = γ, y + D γ, ξ y E N( γ, y , σ2) d= γ, y + σa.

Thus TV (x , y ) = TV (σa, σb) = 1 2Q x y

2σ . The last inequality follows from

Pr(N(0, 1) r) r r2 + 1 1

B More on Restart Algorithm

B.1 EDM Discretization Scheme

[13] proposes a discretization scheme for ODE given the starting tmax and end time tmin. Denote the number of steps as N, then the EDM discretization scheme is:

1 ρ max + i N 1(t

with t0 = tmax and t N 1 = tmin. ρ is a hyperparameter that determines the extent to which steps near tmin are shortened. We adopt the value ρ = 7 suggested by [13] in all of our experiments. We apply the EDM scheme to creates a time discretization in each Restart interval [tmax, tmin] in the Restart backward process, as well as the main backward process between [0, T] (by additionally setting tmin = 0.002 and t N = 0 as in [13]). It is important to note that tmin should be included within the list of time steps in the main backward process to seamlessly incorporate the Restart interval into the main backward process. We summarize the scheme as a function in Algorithm 1.

Algorithm 1 EDM_Scheme(tmin, tmax, N, ρ = 7)

1: return (t

1 ρ max + i N 1(t

1 ρ max))ρ N 1

B.2 Restart Algorithm

We present the pseudocode for the Restart algorithm in Algorithm 2. In this pseudocode, we describe a more general case that applies l-level Restarting strategy. For each Restart segment, the include the number of steps in the Restart backward process NRestart, the Restart interval [tmin, tmax] and the number of Restart iteration K. We further denote the number of steps in the main backward process as Nmain. We use the EDM discretization scheme (Algorithm 1) to construct time steps for the main backward process (t0 = T, t Nmain = 0) as well as the Restart backward process, when given the starting/end time and the number of steps.

Although Heun s 2nd order method [2] (Algorithm 3) is the default ODE solver in the pseudocode, it can be substituted with other ODE solvers, such as Euler s method or the DPM solver [16].

The provided pseudocode in Algorithm 2 is tailored specifically for diffusion models [13]. To adapt Restart for other generative models like PFGM++ [28], we only need to modify the Gaussian perturbation kernel in the Restart forward process (line 10 in Algorithm 2) to the one used in PFGM++.

C Experimental Details

In this section, we discuss the configurations for different samplers in details. All the experiments are conducted on eight NVIDIA A100 GPUs.

Algorithm 2 Restart sampling

1: Input: Score network sθ, time steps in main backward process ti {0,Nmain}, Restart parameters {(NRestart,j, Kj, tmin,j, tmax,j)}l j=1 2: Round tmin,j {1,l} to its nearest neighbor in ti {0,Nmain} 3: Sample x0 N(0, T 2I) 4: for i = 0 . . . Nmain 1 do Main backward process 5: xti+1 = One Step_Heun(sθ, ti, ti+1) Running single step ODE 6: if j {1, . . . , l}, ti+1 = tmin,j then 7: tmin = tmin,j, tmax = tmax,j, K = Kj, NRestart = NRestart,j 8: x0 tmin = xti+1 9: for k = 0 . . . K 1 do Restart for K iterations 10: εtmin tmax N(0, (t2 max t2 min)I) 11: xk+1 tmax = xk tmin + εtmin tmax Restart forward process

12: { tm}NRestart 1 m=0 = EDM_Scheme(tmin, tmax, NRestart) 13: for m = 0 . . . NRestart 1 do Restart backward process 14: xk+1 tm+1 = One Step_Heun(sθ, tm, tm+1) 15: end for 16: end for 17: end if 18: end for 19: return xt Nmain

Algorithm 3 One Step_Heun(sθ, xti, ti, ti+1)

1: di = tisθ(xti, ti) 2: xti+1 = xti (ti+1 ti)di 3: if ti+1 = 0 then 4: d i = ti+1sθ(xti+1, ti+1) 5: xti+1 = xti (ti+1 ti)( 1

2d i) 6: end if 7: return xti+1

C.1 Configurations for Baselines

We select Vanilla SDE [23], Improved SDE [13], Gonna Go Fast [12] as SDE baselines and the Heun s 2nd order method [2] (Alg 3) as ODE baseline on standard benchmarks CIFAR-10 and Image Net 64 64. We choose DDIM [22], Heun s 2nd order method, and DDPM [9] for comparison on Stable Diffusion model.

Vanilla SDE denotes the reverse-time SDE sampler in [23]. For Improved SDE, we use the recommended dataset-specific hyperparameters (e.g., Smax, Smin, Schurn) in Table 5 of the EDM paper [13]. They obtained these hyperparameters by grid search. Gonna Go Fast [12] applied an adaptive step size technique based on Vanilla SDE and we directly report the FID scores listed in [12] for Gonna Go Fast on CIFAR-10 (VP). For fair comparison, we use the EDM discretization scheme [13] for Vanilla SDE, Improved SDE, Heun as well as Restart.

We borrow the hyperparameters such as discretization scheme or initial noise scale on Stable Diffusion models in the diffuser 3 code repository. We directly use the DDIM and DDPM samplers implemented in the repo. We apply the same set of hyperparameters to Heun and Restart.

C.2 Configurations for Restart

We report the configurations for Restart for different models and NFE on standard benchmarks CIFAR-10 and Image Net 64 64. The hyperparameters of Restart include the number of steps in the main backward process Nmain, the number of steps in the Restart backward process NRestart, the Restart interval [tmin, tmax] and the number of Restart iteration K. In Table 3 (CIFAR-10, VP)

3https://github.com/huggingface/diffusers

we provide the quintuplet (Nmain, NRestart, tmin, tmax, K) for each experiment. Since we apply the multi-level Restart strategy for Image Net 64 64, we provide Nmain as well as a list of quadruple {(NRestart,i, Ki, tmin,i, tmax,i)}l i=1 (l is the number of Restart interval depending on experiments) in Table 5. In order to integrate the Restart time interval to the main backward process, we round tmin,i to its nearest neighbor in the time steps of main backward process, as shown in line 2 of Algorithm 2. We apply Heun method for both main/backward process. The formula for NFE calculation is NFE = 2 Nmain 1 | {z } main backward process + Pl i=1 Ki |{z} number of repetitions (2 (NRestart,i 1)) | {z } per iteration in ith Restart interval

in this case. Inspired by

[13], we inflate the additive noise in the Restart forward process by multiplying Snoise = 1.003 on Image Net 64 64, to counteract the over-denoising tendency of neural networks. We also observe that setting γ = 0.05 in Algorithm 2 of EDM [13] would sligtly boost the Restart performance on Image Net 64 64 when t [0.01, 1].

We further include the configurations for Restart on Stable Diffusion models in Table 10, with a varying guidance weight w. Similar to Image Net 64 64, we use multi-level Restart with a fixed number of steps Nmain = 30 in the main backward process. We utilize the Euler method for the main backward process and the Heun method for the Restart backward process, as our empirical observations indicate that the Heun method doesn t yield significant improvements over the Euler method, yet necessitates double the steps. The number of steps equals to Nmain + Pl i=1 Ki (2 (NRestart,i 1)) in this case. We set the total number of steps to 66, including main backward process and Restart backward process.

Given the prohibitively large search space for each Restart quadruple, a comprehensive enumeration of all possibilities is impractical due to computational limitations. Instead, we adjust the configuration manually, guided by the heuristic that weaker/smaller models or more challenging tasks necessitate a stronger Restart strength (e.g., larger K, wider Restart interval, etc). On average, we select the best configuration from 5 sets for each experiment; these few trials have empirically outperformed previous SDE/ODE samplers. We believe that developing a systematic approach for determining Restart configurations could be of significant value in the future.

C.3 Pre-trained Models

For CIFAR-10 dataset, we use the pre-trained VP and EDM models from the EDM repository 4, and PFGM++ (D = 2048) model from the PFGM++ repository 5. For Image Net 64 64, we borrow the pre-trained EDM model from EDM repository as well.

C.4 Classifier-free Guidance

We follow the convention in [20], where each step in classifier-free guidance is as follows:

sθ(x, c, t) = wsθ(x, c, t) + (1 w)sθ(x, t)

where c is the conditions, and sθ(x, c, t)/sθ(x, t) is the conditional/unconditional models, sharing parameters. Increasing w would strength the effect of guidance, usually leading to a better text-image alignment [20].

C.5 More on the Synthetic Experiment

C.5.1 Discrete Dataset

We generate the underlying discrete dataset S with |S| = 2000 as follows. Firstly, we sample 2000 points, denoted as S1, from a mixture of two Gaussians in R4. Next, we project these points onto R20. To ensure a variance of 1 on each dimension, we scale the coordinates accordingly. This setup aims to simulate data points that primarily reside on a lower-dimensional manifold with multiple modes.

The specific details are as follows: S1 0.3N(a, s2I) + 0.7( a, s2I), where a = (3, 3, 3, 3) R4 and s = 1. Then, we randomly select a projection matrix P R20 4, where each entry is drawn from N(0, 1), and compute S2 = PS1. Finally, we scale each coordinate by a constant factor to ensure a variance of 1.

4https://github.com/NVlabs/edm 5https://github.com/Newbeeer/pfgmpp

0.65 0.70 0.75 0.80 0.85 0.90 Additional Sampling error

Contracted error

ODE Vanilla SDE Improved SDE Restart

0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Additional Sampling error

Total Error

ODE Improved SDE Vanilla SDE Restart

20 40 80 160 320 NFE

Total Error

ODE Vanilla SDE Improved SDE Restart

(c) Figure 8: Comparison of additional sampling error versus (a) contracted error (plotting the Pareto frontier) and (b) total error (using a scatter plot). (c) Pareto frontier of NFE versus total error.

C.5.2 Model Architecture

We employ a common MLP architecture with a latent size of 64 to learn the score function. The training method is adapted from [13], which includes the preconditioning technique and denoising score-matching objective [25].

C.5.3 Varying Hyperparameters

To achieve the best trade-off between contracted error and additional sampling error, and optimize the NFE versus FID (Fréchet Inception Distance) performance, we explore various hyperparameters. [13] shows that the Vanilla SDE can be endowed with additional flexibility by varying the coefficient β(t) (Eq.(6) in [13]). Hence, regarding SDE, we consider NFE values from {20, 40, 80, 160, 320}, and multiply the original β(t) = σ(t)/σ(t) [13] with values from {0, 0.25, 0.5, 1, 1.5, 2, 4, 8}. It is important to note that larger NFE values do not lead to further performance improvements. For restarts, we tried the following two settings: first we set the number of steps in Restart backward process to 40 and vary the number of Restart iterations K in the range {0, 5, 10, 15, 20, 25, 30, 35}. We also conduct a grid search with the number of Restart iterations K ranging from 5 to 25 and the number of steps in Restart backward process varying from 2 to 7. For ODE, we experiment with the number of steps set to {20, 40, 80, 160, 320, 640}.

Additionally, we conduct an experiment for Improved SDE in EDM. We try different values of Schurn in the range of {0, 1, 2, 4, 8, 16, 32, 48, 64}. We also perform a grid search where the number of steps ranged from 20 to 320 and Schurn takes values of [0.2 steps, 0.5 steps, 20, 60]. The plot combines the results from SDE and is displayed in Figure 8.

To mitigate the impact of randomness, we collect the data by averaging the results from five runs with the same hyperparameters. To compute the Wasserstein distance between two discrete distributions, we use minimum weight matching.

C.5.4 Plotting the Pareto frontier

We generate the Pareto frontier plots as follows. For the additional sampling error versus contracted error plot, we first sort all the data points based on their additional sampling error and then connect the data points that represent prefix minimums of the contracted error. Similarly, for the NFE versus FID plot, we sort the data points based on their NFE values and connect the points where the FID is a prefix minimum.

D Extra Experimental Results

D.1 Numerical Results

In this section, we provide the corresponding numerical reuslts of Fig. 3(a) and Fig. 3(b), in Table 2, 3 (CIFAR-10 VP, EDM, PFGM++) and Table 4, 5 (Image Net 64 64 EDM), respectively. We also include the performance of Vanilla SDE in those tables. For the evaluation, we compute the Fréchet distance between 50000 generated samples and the pre-computed statistics of CIFAR-10 and Image Net 64 64. We follow the evaluation protocol in EDM [13] that calculates each FID scores three times with different seeds and report the minimum.

We also provide the numerical results on the Stable Diffusion model [19], with a classifier guidance weight w = 2, 3, 5, 8 in Table 6, 7, 8, 9. As in [17], we report the zero-shot FID score on 5K random prompts sampled from the COCO validation set. We evaluate CLIP score [6] with the open-sourced Vi T-g/14 [11], Aesthetic score by the more recent LAION-Aesthetics Predictor V2 6. We average the CLIP and Aesthetic scores over 5K generated samples. The number of function evaluations is two times the sampling steps in Stable Diffusion model, since each sampling step involves the evaluation of the conditional and unconditional model.

Table 2: CIFAR-10 sample quality (FID score) and number of function evaluations (NFE) on VP [23] for baselines

ODE (Heun) [13] 1023 2.90 511 2.90 255 2.90 127 2.90 63 2.89 35 2.97

Vanilla SDE [23] 1024 2.79 512 4.01 256 4.79 128 12.57

Gonna Go Fast [12] 1000 2.55 329 2.70 274 2.74 179 2.59 147 2.95 49 72.29

Improved SDE [13] 1023 2.35 511 2.37 255 2.40 127 2.58 63 2.88 35 3.45

Table 3: CIFAR-10 sample quality (FID score), number of function evaluations (NFE) and Restart configurations on VP [23], EDM [13] and PFGM++ [28]

Method NFE FID Configuration (Nmain, NRestart,i, Ki, tmin,i, tmax,i)

519 2.11 (20, 9, 30, 0.06, 0.20) 115 2.21 (18, 3, 20, 0.06, 0.30) 75 2.27 (18, 3, 10, 0.06, 0.30) 55 2.45 (18, 3, 5, 0.06, 0.30) 43 2.70 (18, 3, 2, 0.06, 0.30)

43 1.90 (18, 3, 2, 0.14, 0.30)

43 1.88 (18, 3, 2, 0.14, 0.30)

6https://github.com/christophschuhmann/improved-aesthetic-predictor

Table 4: Image Net 64 64 sample quality (FID score) and number of function evaluations (NFE) on EDM [13] for baselines

NFE FID (50k)

ODE (Heun) [13] 1023 2.24 511 2.24 255 2.24 127 2.25 63 2.30 35 2.46

Vanilla SDE [23] 1024 1.89 512 3.38 256 11.91 128 59.71

Improved SDE [13] 1023 1.40 511 1.45 255 1.50 127 1.75 63 2.24 35 2.97

Table 5: Image Net 64 64 sample quality (FID score), number of function evaluations (NFE) and Restart configurations on EDM [13]

NFE FID (50k) Configuration Nmain, {(NRestart,i, Ki, tmin,i, tmax,i)}l i=1

623 1.36 36, {(10, 3, 19.35, 40.79),(10, 3, 1.09, 1.92), (7, 6, 0.59, 1.09), (7, 6, 0.30, 0.59), (7, 25, 0.06, 0.30)}

535 1.39 36, {(6, 1, 19.35, 40.79),(6, 1, 1.09, 1.92), (7, 6, 0.59, 1.09), (7, 6, 0.30, 0.59), (7, 25, 0.06, 0.30)}

385 1.41 36, {(3, 1, 19.35, 40.79),(6, 1, 1.09, 1.92), (6, 5, 0.59, 1.09), (6, 5, 0.30, 0.59), (6, 20, 0.06, 0.30)}

203 1.46 36, {(4, 1, 19.35, 40.79),(4, 1, 1.09, 1.92), (4, 5, 0.59, 1.09), (4, 5, 0.30, 0.59), (6, 6, 0.06, 0.30)}

165 1.51 18, {(3, 1, 19.35, 40.79),(4, 1, 1.09, 1.92), (4, 5, 0.59, 1.09), (4, 5, 0.30, 0.59), (4, 10, 0.06, 0.30)}

99 1.71 18, {(3, 1, 19.35, 40.79),(4, 1, 1.09, 1.92), (4, 4, 0.59, 1.09), (4, 1, 0.30, 0.59), (4, 4, 0.06, 0.30)}

67 1.95 18, {(5, 1, 19.35, 40.79),(5, 1, 1.09, 1.92), (5, 1, 0.59, 1.09), (5, 1, 0.06, 0.30)}

39 2.38 14, {(3, 1, 19.35, 40.79), (3, 1, 1.09, 1.92), (3, 1, 0.06, 0.30)}

Table 6: Numerical results on Stable Diffusion v1.5 with a classifier-free guidance weight w = 2

Steps FID (5k) CLIP score Aesthetic score

DDIM [22] 50 16.08 0.2905 5.13 100 15.35 0.2920 5.15

Heun 51 18.80 0.2865 5.14 101 18.21 0.2871 5.15

DDPM [9] 100 13.53 0.3012 5.20 200 13.22 0.2999 5.19

Restart 66 13.16 0.2987 5.19

Table 7: Numerical results on Stable Diffusion v1.5 with a classifier-free guidance weight w = 3

Steps FID (5k) CLIP score Aesthetic score

DDIM [22] 50 14.28 0.3056 5.22 100 14.30 0.3056 5.22

Heun 51 15.63 0.3022 5.20 101 15.40 0.3026 5.21

DDPM [9] 100 15.72 0.3129 5.28 200 15.13 0.3131 5.28

Restart 66 14.48 0.3079 5.25

Table 8: Numerical results on Stable Diffusion v1.5 with a classifier-free guidance weight w = 5

Steps FID (5k) CLIP score Aesthetic score

DDIM [22] 50 16.60 0.3154 5.31 100 16.80 0.3157 5.31

Heun 51 16.26 0.3135 5.28 101 16.38 0.3136 5.29

DDPM [9] 100 19.62 0.3197 5.36 200 18.88 0.3200 5.35

Restart 66 16.21 0.3179 5.33

Table 9: Numerical results on Stable Diffusion v1.5 with a classifier-free guidance weight w = 8

Steps FID (5k) CLIP score Aesthetic score

DDIM [22] 50 19.83 0.3206 5.37 100 19.82 0.3200 5.37

Heun 51 18.44 0.3186 5.35 101 18.72 0.3185 5.36

DDPM [9] 100 22.58 0.3223 5.39 200 21.67 0.3212 5.38

Restart 47 18.40 0.3228 5.41

D.2 Sensitivity Analysis of Hyper-parameters

We also investigate the impact of varying tmin when tmax = tmin + 0.3, and the length the restart interval when tmin = 0.06. Fig. 10(a) reveals that FID scores achieve a minimum at a tmin close to 0

Table 10: Restart (Steps=66) configurations on Stable Diffusion v1.5

w Configuration Nmain, {(NRestart,i, Ki, tmin,i, tmax,i)}l i=1 2 30, {(5, 2, 1, 9), (5, 2, 5, 10)} 3 30, {(10, 2, 0.1, 3)} 5 30, {(10, 2 0.1, 2)} 8 30, {(10, 2, 0.1, 2)}

0.285 0.290 0.295 0.300 0.305 0.310 0.315 0.320 CLIP score (Vi T-g/14)

FID score (5K)

Restart (Steps=66) DDIM (Steps=50) DDIM (Steps=100) Heun (Steps=51) Heun (Steps=101) DDPM (Steps=100) DDPM (Steps=200)

(a) FID versus CLIP score

5.15 5.20 5.25 5.30 5.35 5.40 Aesthetic score

FID score (5K)

Restart (Steps=66) DDIM (Steps=50) DDIM (Steps=100) Heun (Steps=51) Heun (Steps=101) DDPM (Steps=100) DDPM (Steps=200)

(b) FID versus Aesthetic score

Figure 9: FID score versus (a) CLIP Vi T-g/14 score and (b) Aesthetic score for text-to-image generation at 512 512 resolution, using Stable Diffusion v1.5 with varying classifier-free guidance weight w = 2, 3, 5, 8.

on VP, indicating higher accumulated errors at the end of sampling and poor neural estimations at small t. Note that the Restart interval 0.3 is about twice the length of the one in Table 1 and Restart does not outperform the ODE baseline on EDM. This suggests that, as a rule of thumb, we should apply greater Restart strength (e.g., larger K, tmax tmin) for weaker or smaller architectures and vice versa.

In theory, a longer interval enhances contraction but may add more additional sampling errors. Again, the balance between these factors results in a V-shaped trend in our plots (Fig. 10(b)). In practice, selecting tmax close to the dataset s radius usually ensures effective mixing when tmin is small.

4 2 0 2 log(tmin)

FID score (50K)

EDM (Restart) EDM (ODE) VP (Restart) VP (ODE)

Figure 10: (a): Adjusting tmin in Restart on VP/EDM; (b): Adjusting the Restart interval length when tmin = 0.06.

E Extended Generated Images

In this section, we provide extended generated images by Restart, DDIM, Heun and DDPM on text-to-image Stable Diffusion v1.5 model [19]. We showcase the samples of four sets of text prompts in Fig. 11, Fig. 12, Fig. 13, Fig. 14, with a classifier-guidance weight w = 8.

(a) Restart (Steps=66)

(b) DDIM (Steps=100)

(c) Heun (Steps=101)

(d) DDPM (Steps=100)

Figure 11: Generated images with text prompt="A photo of an astronaut riding a horse on mars" and w = 8.

F Heun s method is DPM-Solver-2 (with r2 = 1)

The first order ODE in DPM-Solver [16] (DPM-Solver-1) is in the form of:

ˆxti 1 = αti αti 1 ˆxti 1 (ˆσti 1 αti αti 1 ˆσti)ˆσti x log pˆσti(ˆxti) (19)

The first order ODE in EDM is in the form of

xti 1 = xti (σti 1 σti)σti x log pσti(xti) (20)

(a) Restart (Steps=66)

(b) DDIM (Steps=100)

(c) Heun (Steps=101)

(d) DDPM (Steps=100)

Figure 12: Generated images with text prompt="A raccoon playing table tennis" and w = 8.

When xt = ˆxt

αt , ˆσt = σtαt, we can rewrite the DPM-Solver-1 (Eq. (19)) as:

xti 1 = xti (σti 1 σti)ˆσti x log pˆσti(ˆxti)

= xti (σti 1 σti)ˆσti x log pσti(xti) 1

αti (change-of-variable)

= xti (σti 1 σti)σti x log pσti(xti)

where the expression is exact the same as the ODE in EDM [13]. It indicates that the sampling trajectory in DPM-Solver-1 is equivalent to the one in EDM, up to a time-dependent scaling (αt). As limt 0 αt = 1, the two solvers will leads to the same final points when using the same time discretization. Note that the DPM-Solver-1 is also equivalent to DDIM (c.f. Section 4.1 in [16]), as also used in this paper.

With that, we can further verify that the Heun s method used in this paper corresponds to the DPM-Solver-2 when setting r1 = 1.

(a) Restart (Steps=66)

(b) DDIM (Steps=100)

(c) Heun (Steps=101)

(d) DDPM (Steps=100)

Figure 13: Generated images with text prompt="Intricate origami of a fox in a snowy forest" and w = 8.

(a) Restart (Steps=66)

(b) DDIM (Steps=100)

(c) Heun (Steps=101)

(d) DDPM (Steps=100)

Figure 14: Generated images with text prompt="A transparent sculpture of a duck made out of glass" and w = 8.

G Broader Impact

The field of deep generative models incorporating differential equations is rapidly evolving and holds significant potential to shape our society. Nowadays, a multitude of photo-realistic images generated by text-to-image Stable Diffusion models populate the internet. Our work introduces Restart, a novel sampling algorithm that outperforms previous samplers for diffusion models and PFGM++. With applications extending across diverse areas, the Restart sampling algorithm is especially suitable for generation tasks demanding high quality and rapid speed. Yet, it is crucial to recognize that the utilization of such algorithms can yield both positive and negative repercussions, contingent on their specific applications. On the one hand, Restart sampling can facilitate the generation of highly realistic images and audio samples, potentially advancing sectors such as entertainment, advertising, and education. On the other hand, it could also be misused in deepfake technology, potentially leading to social scams and misinformation. In light of these potential risks, further research is required to develop robustness guarantees for generative models, ensuring their use aligns with ethical guidelines and societal interests.