# improving_consistency_models_with_generatoraugmented_flows__34d287f3.pdf

Improving Consistency Models with Generator-Augmented Flows

Thibaut Issenhuth 1 Sangchul Lee 2 Ludovic Dos Santos 1 Jean-Yves Franceschi 1 Chansoo Kim 2 3

Alain Rakotomamonjy 1 4

Consistency models imitate the multi-step sampling of score-based diffusion in a single forward pass of a neural network. They can be learned in two ways: consistency distillation and consistency training. The former relies on the true velocity field of the corresponding differential equation, approximated by a pre-trained neural network. In contrast, the latter uses a single-sample Monte Carlo estimate of this velocity field. The related estimation error induces a discrepancy between consistency distillation and training that, we show, still holds in the continuous-time limit. To alleviate this issue, we propose a novel flow that transports noisy data towards their corresponding outputs derived from a consistency model. We prove that this flow reduces the previously identified discrepancy and the noise-data transport cost. Consequently, our method not only accelerates consistency training convergence but also enhances its overall performance. The code is available at: github.com/thibautissenhuth/consistency GC.

1. Introduction

A large family of diffusion (Ho et al., 2020), score-based (Song et al., 2021; Karras et al., 2022), and flow models (Liu et al., 2023; Lipman et al., 2023) have emerged as stateof-the-art generative models for image generation. Since they are costly to use at inference time requiring several neural function evaluations , many distillation techniques have been explored (Salimans and Ho, 2022; Meng et al., 2023; Sauer et al., 2023). One of the most remarkable ap-

1Criteo AI Lab, Paris, France 2AI, Information and Reasoning (AI/R) Laboratory, Korea Institute of Science and Technology 3AI and Robot Department, University of Science and Technology, Korea 4LITIS, Univ Rouen-Normandie. Correspondence to: Thibaut Issenhuth, Chansoo Kim <t.issenhuth@criteo.com; eau@ust.ac.kr>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

proach is consistency models (Song et al., 2023; Song and Dhariwal, 2024). Consistency models lead to high-quality one-step generators, that can be trained either by distillation of a pre-trained velocity field (consistency distillation), or as standalone generative models (consistency training) by approximating the velocity field through a one-sample Monte Carlo estimate.

The corresponding estimation error naturally induces a discrepancy between consistency distillation and training. While Song et al. (2023) hinted that it would resolve in the continuous-time limit, we show that this discrepancy persists in both the gradients and values of the loss functions. Interestingly, this discrepancy vanishes when the difference between the target velocity field and its Monte-Carlo approximation approaches zero. However, this is not the case with the independent coupling (IC) between data and noise used to construct the standard estimate. It is unclear how to improve this one-sample estimate without access to the true underlying diffusion model.

The approach we adopt in this paper to alleviate this issue involves altering the velocity field thereby changing the target flow to reduce the variance of its one-sample estimator. One possible solution to this problem is to resort to optimal transport (OT) to learn on a deterministic coupling. OT has been succesfully adopted in diffusion (Li et al., 2024), consistency (Dou et al., 2024), and flow matching (Pooladian et al., 2023) models. However, due to the prohibitive cubic complexity of OT solvers (e.g. Hungarian matching algorithm), such methods need to be applied at the minibatch level. This incurs an OT approximation error (Fatras et al., 2021; Sommerfeld et al., 2019) and stochasticity of the data-noise coupling, thus not solving the consistency training issue.

In our approach, we propose to use the consistency model, assumed to be an approximation of the target diffusion flow, to construct additional trajectories. The consistency model serves as a proxy to reduce the expected deviation between the velocity field and its estimator. More precisely, from an intermediate point computed from an IC, we let the consistency model predict the corresponding endpoint, supposedly close to the data distribution. This predicted endpoint is coupled to the same original noise vector, defining a generator-

Improving Consistency Models with Generator-Augmented Flows

(a) PF-ODE (IC). (b) Generator-Augmented Flows (GC).

Figure 1. Comparison of the probability flow ODE (PF-ODE) and generator-augmented flows (GC): target data is a mixture of two Dirac delta functions, and GC is computed with a closed-form generator. In the background, we observe the density of probability paths. White arrows are ODE trajectories associated to the velocity field. Blue lines are sample paths from IC in (a) and from GC in (b). Trajectories start from random intermediate points . On this example, GC sample paths appear more aligned to the velocity field.

augmented coupling (GC). We show empirically that the resulting generator-augmented flow presents compelling properties for training consistency models, in particular a reduced deviation between the velocity field and its estimator, and decreased transport costs as supported by theoretical and empirical evidence. This can be observed in Figure 1. From this, we derive practical algorithms to train consistency models with generator-augmented flows, leading to improved performance and faster convergence compared to standard and OT-based consistency models.

Let us summarize our contributions below.

We prove that in the continuous-time limit consistency training and consistency distillation loss function converge to different values and we provide a closed-form expression of this discrepancy.

We propose a novel type of flows that we denote generator-augmented flows. It relies on generatoraugmented coupling (GC) that can be used to train a consistency model.

We provide theoretical and empirical insights into the advantages of GC. We show that generator-augmented flows have smaller discrepancy to consistency distillation than IC consistency training, and that they reduce data-noise transport costs.

We derive practical ways to train consistency models with GC. Our approach based on a joint learning strategy leads to faster convergence and improves the performance compared to the base model and OT-based approaches on image generation benchmarks.

Notation. We consider an empirical data distribution p and a noise distribution pz (e.g. Gaussian), both defined on Rd. We denote by q a joint distribution of samples from p and pz. We equip Rd with the dot product x, y = x y and write x = x, x 1/2 for the Euclidean norm of x. We use a distance function D: Rd Rd [0, ) to measure the distance between two points from Rd. sg denotes the stop-gradient operator.

In consistency models, we consider diffusion processes of the form xt = x + σtz, where x p , z pz, and σt is monotonically increasing for t [0, T], where T R+. We denote the distribution of xt by p(xt), or simply pt. Conditional distributions or finite-dimensional joint distributions of xt s are denoted similarly. When considering a discrete formulation with N intermediate timesteps, we denote the intermediate points as xti = x + σtiz, where ti is strictly increasing for i {0, . . . , N}, with t0 = 0 and t N = T. The values of σ0 and σT are chosen to be sufficiently small and large, respectively, so that p0 p and p T N(0, σ2 T I).

2. Consistency Distillation Versus Training

In this section, we provide the required background on diffusion and consistency models (Sections 2.1 and 2.2), then discuss the discrepancy between consistency distillation and consistency training (Section 2.3) which we theoretically characterize in continuous-time.

Improving Consistency Models with Generator-Augmented Flows

2.1. Flow and Score-Based Diffusion Models

Score-based diffusion models (Ho et al., 2020; Song et al., 2021) can generate data from noise via a multi-step process consisting in numerically solving either a stochastic differential equation (SDE), or equivalently an ordinary differential equation (ODE). Although SDE solvers generally exhibit superior sampling quality, ODEs have desirable properties. Most notably, they define a deterministic mapping from noise to data. Recently, Liu et al. (2023) and Lipman et al. (2023) generalize diffusion to flow models, which are defined by the following probability flow ODE (PF-ODE):

dx = vt(x) dt, (1)

where vt(x) = E[ xt|xt = x] is the velocity field. Note that xt is defined as the random variable xt = d(x +σtz)

dt = σtz, and is not to be confused with the time-derivative of the ODE, vt.

In the context of consistency models (Song et al., 2023; Song and Dhariwal, 2024), the most common choice is vt(x) = σtσt x log pt(x) dt, in particular the EDM formulation (Karras et al., 2022) where σt = t and thus vt(x) = t x log pt(x). Here, x log pt, a.k.a. the score function, can be approximated with a neural network sϕ(x, t) (Vincent, 2011; Song and Ermon, 2019).

2.2. Consistency Models

Numerically solving an ODE is costly because it requires multiple expensive evaluations of the velocity function. To alleviate this issue, Song et al. (2023) propose training a consistency model fθ, which learns the output map of the PF-ODE, i.e. its flow, such that:

fθ(xt, σt) = x0, (2)

for all (xt, σt) Rd [σ0, σT ] that belong to the trajectory of the PF-ODE ending at (x0, σ0).

Equation (2) is equivalent to (i) enforcing the boundary condition fθ(x0, σ0) = x0, and (ii) ensuring that fθ has the same output for any two samples of a single PF-ODE trajectory the consistency property. (i) is naturally satisfied by the following model parametrization:

fθ(xti, σti) = cskip(σti)xti + cout(σti)Fθ(xti, σti), (3)

where cskip(σ) = σ2 d σ2 d+(σ σ0)2 , cout(σ) = σd (σ σ0)

σ2 d+σ2 , σ2 d the

variance of data, and Fθ is a neural network. This ensures cskip(0) = 1, cout(0) = 0. (ii) is achieved by minimizing the distance between the outputs of two same-trajectory consecutive samples using the consistency loss:

LCD(θ) = Eq I(x ,z),p(xti+1|x ,z) h λ(σti)D sg fθ(xΦ ti, σti) , fθ(xti+1, σti+1) i , (4)

where (x , z) is sampled from the independent coupling q I(x , z) = p (x )pz(z), i is an index sampled uniformly at random from {0, 1, . . . , N 1}, xti+1 = x + σti+1z, and xΦ ti is computed by discretizing the PF-ODE with the Euler scheme as follows:

xΦ ti = Φ(xti+1, ti+1) = xti+1 + (ti ti+1)vti+1(xti+1). (5) This loss can be used to distill a score model into fθ.

In the case of consistency training, Song et al. (2023) circumvent the lack of a score function by noting that vti+1(x) = E[ xti+1|xti+1 = x]. In light of this, its singlesample Monte Carlo estimate xti+1 is used instead in Equation (5) to replace the intractable xΦ ti by xti = x + σtiz in the consistency loss:

LCT(θ) = Eq I(x ,z),p(xti,xti+1|x ,z) h λ(σti)D sg fθ(xti, σti) , fθ(xti+1, σti+1) i . (6)

2.3. Discrepancy Between Consistency Training and Distillation and Velocity Field Estimation

Naturally, replacing vt by its single-sample estimate xt makes consistency training deviate from consistency distillation in discrete time. Still, Song et al. (2023, Theorems 2 and 6) suggest that this discrepancy disappears in continuous-time since LCT(θ) = LCD(θ) + o(1/N) and the corresponding gradients are equal in some cases. This equality is then used in work of Lu and Song (2024), concurrent to ours, to train continuous-time consistency models at the cost of an elaborate architectural design. Without disproving these results, we find that scaling issues and lack of generality soften the claim of a closed gap between consistency training and distillation.

Indeed, we provide in the following theorem a thorough theoretical comparison of LCT and LCD. We first prove that they converge to different values in the continuous-time limit. The difference is captured by a regularization term that depends on the discrepancy between the velocity field and its estimate. Moreover, we show that the limits of the scaled gradients do not coincide in the general case, except when the (asymptotic) quadratic loss is used. The proof, and further discussion on why this discrepancy did not appear in Song et al. (2023), can be found in Appendix A.1.

Theorem 1 (Discrepancy between consistency distillation and consistency training objectives). Assume that the distance function is given by D(x, y) = φ( x y ) for a continuous convex function φ : [0, ) [0, ) with φ(x) Cxα as x 0+ for some C > 0 and α 1, and that the timesteps are equally spaced, i.e., ti = i T

N . Furthermore, assume that the Jacobian fθ

x does not vanish identically. Then the following assertions hold:

Improving Consistency Models with Generator-Augmented Flows

(i) The scaled consistency losses N αLCD(θ) and N αLCT(θ) converge as N . Moreover, the minimization objectives corresponding to these limiting scaled consistency losses are not equivalent, and their difference is given by:

lim N N α LCT(θ) LCD(θ) = CT α 1R(θ), (7)

where R(θ) is defined by

0 λ(σt)E CTfθ α CDfθ α dt (8)

and satisfies R(θ) > 0, with

σ (xt, σt) σt + fθ

x (xt, σt) xt, (9)

σ (xt, σt) σt + fθ

x (xt, σt) vt(xt). (10)

In particular, if α = 2,

0 λ(σt)E fθ

x (xt, σt) xt vt(xt)

(ii) The scaled gradient N α 1 θLCD(θ) and N α 1 θLCT(θ) converge as N . Moreover, if α = 2, then their respective limits are not identical as functions of θ:

lim N N α 1 θLCT(θ) = lim N N α 1 θLCD(θ). (12)

This theorem reveals that the optimization problems of consistency training and distillation differ not only in discrete time but also in continuous-time. It even highlights a discrepancy between, firstly, the limiting gradients in continuous-time although they are equal for α = 2 and, secondly, the gradients of the limiting losses, which differ because of R(θ), even when α = 2.

This analysis shows the importance of employing probability paths whose sample path derivatives xt are aligned with the velocity field vt(xt). In particular, if a diffusion process xt satisfies xt = vt(xt), we have R(θ) = 0 and equal gradients for all α 1. Hence, for such xt, consistency training and consistency distillation would be reconciled both in discrete time and in the continuous-time limit.

However, it is unclear how to directly improve the singlesample estimation xt of vt(xt). In particular, increasing the number of samples per point xt to reduce its variance is not tractable, as it requires sampling from the inverse diffusion process p(x |xt). Therefore, we adopt an alternative approach to alleviate the discrepancy identified in this section, which involves altering the velocity field thereby changing

the target flow to reduce the variance of its one-sample estimator. This approach is reminiscent of recent work tackling the data-noise coupling that we discuss in the following section.

3. Reducing the Discrepancy with Data-Noise Coupling

Beyond independent coupling (IC). From Section 2.2, it appears that xt is computed through an IC q I = p (x )pz(z) of data and noise, in a similar fashion to flow matching (Lipman et al., 2023; Kingma and Gao, 2024). Making correlated choices of data and noise beyond IC could then help align xt and vt(xt), thereby resolving the discrepancy from the previous section.

The reliance on IC in consistency and flow models is increasingly recognized as a limiting factor. Recent advancements suggest that improved coupling mechanisms could enhance both training efficiency and the quality of generated samples in flow matching (Liu et al., 2023; Pooladian et al., 2023) and diffusion models (Li et al., 2024). By reducing the variance in gradient estimation, enhanced coupling can accelerate training. Additionally, improved coupling could decrease transport costs and straighten trajectories, yielding better-quality samples. In a different context, Re Flow (Liu et al., 2023) leverages couplings provided by the ODE solver in a flow framework, and demonstrates that it reduced transport costs. Moreover, Lee et al. (2023) propose to learn an encoder from data to noise, and use this encoder as a way to construct a coupling when training a flow model.

Couplings based on optimal transport (OT) solvers. OT is a particularly appealing solution for our alignment problem. Indeed, if we consider a quadratic cost and distributions with bounded supports, OT is a no-collision transport map (Nurbekyan et al., 2020), i.e. xt can be sampled by a unique pair of points (x , z). Thus xt = vt(xt), implying R(θ) = 0 in Theorem 1. Several approaches have precisely targeted the reduction of transport cost in flow and consistency models.

Pooladian et al. (2023) have more directly explored OT coupling within the framework of flow matching models. They show that deterministic and non-crossing paths enabled by OT with infinite batch size lowers the variance of gradient estimators. Experimentally, they assess the efficacy of OT solvers, such as Hungarian matching and Sinkhorn algorithms, in coupling batches of noise and data points. Dou et al. (2024) have successfully adopted this approach in consistency models, while Li et al. (2024) applied OT to diffusion models. However, due to the prohibitive cubic complexity of OT solvers, OT has to be applied by minibatch for matching samples (x , z). Besides an OT approximation error, this incurs the loss of the no-collision property,

Improving Consistency Models with Generator-Augmented Flows

making R(θ) non-zero in real use-cases. Another line of works using OT tools with score-based models relies on the Schr odinger Bridge formulation (De Bortoli et al., 2021; Shi et al., 2023; Korotin et al., 2024; Tong et al., 2024), which has mostly proven benefits on transfer tasks.

Our approach. In this paper, we use a consistency model as a proxy of the flow of a diffusion process to reduce transport costs. While not fully solving the alignment issue, we will show that our method present reduced transport costs and better alignment than dedicated OT-based methods.

4. Consistency Models with Generator-Augmented Flows

Here, we introduce our method, denoted as generatoraugmented flows, which relies on a generator-augmented coupling (GC). We capitalize on the true diffusion flow f (i.e. an ideal consistency model) to map noisy points towards the PF-ODE solution. We present theoretical and empirical evidences that GC not only reduces the data-noise transport cost but also narrows the gap between consistency distillation and consistency training. We will discuss how to train GC consistency models jointly with f in Section 5.

4.1. Generator-Augmented Coupling (GC): Definition and Training Loss

The solution proposed in this work involves harnessing the diffusion flow, computed from a consistency model, to create a novel form of coupling. The idea is to leverage the properties and accumulated knowledge within an ideal consistency model, f, to construct pairs of points. To achieve this, we first sample an intermediate point, which is done as usual by sampling x p and z pz using the IC between the two distributions, and then predict the data point ˆxti via the consistency model:

(x , z) q I; xti = x + σtiz; ˆxti = sg( f(xti, σti)). (13) Although ˆxti depends on the timestep ti, it is important to note that it (supposedly) follows the distribution p0. This ˆxti is coupled with z, thereby defining our generatoraugmented coupling (GC) q, which we use to construct the pair of points ( xti, xti+1):

(ˆxti, z) q; xti = ˆxti + σtiz; xti+1 = ˆxti + σti+1z. (14) These intermediate points can serve to define a new consistency training loss:

LGC(θ) = Eq(ˆxti,z),p( xti, xti+1|ˆxti,z) h λ(σti)D sg(fθ( xti, σti)), fθ( xti+1, σti+1) i . (15)

1e 02 1e 01 1 10 100 Timesteps σ

IC Batch-OT (b=512) GC

Figure 2. Comparison of RIC, Rbatch-OT, and RGC on CIFAR-10. GC exhibits lower values of this quantity for all σt.

Generator-augmented trajectories satisfy the boundary conditions of diffusion processes. We note the two following important properties of the distribution of xt:

p( x0) = p(x0) p , p( x T ) p(x T ) p(σT z). (16)

The first property is achieved thanks to the boundary condition of the consistency model (c.f. Section 2.1) , and the second property by construction of the diffusion process which ensures that the noise magnitude is significantly larger than ˆxti for large t. However, for the timesteps t (0, T) the marginal distributions p(xt) and p( xt) do not necessarily coincide.

4.2. Properties of Generator-Augmented Flows

Here, we present some properties of generator-augmented flows that motivate them for training consistency models.

4.2.1. REDUCING R(θ) WITH GC

In Theorem 1, we proved that the continuous-time consistency training objective decomposes into the sum of the consistency distillation objective and a regularizer term: LCT(θ) = LCD(θ) + R(θ). Here, we study a proxy term for R(θ) that is easier to calculate:

Rt = E h xt vt(xt) 2i . (17)

This quantity measures the expected distance between the true velocity field and its one-sample Monte Carlo estimate. We study Rt,IC, Rbatch-OT, and Rt,GC. They are the respective proxy regularizer term for each type of probability path. Note that Rt,GC depends on the endpoint predictor, a consistency model, which impacts both probability paths and velocity fields. Our goal is to compare those proxy regularizer terms, in order to demonstrate that GC does lead to a

Improving Consistency Models with Generator-Augmented Flows

1e 02 1e 01 1 10 100 Timestep σ

Mean quadratic transport cost

Transport Cost on CIFAR-10

IC GC Batch-OT (b=128)

Batch-OT (b=2048)

Figure 3. Comparison of transport costs between IC, batch-OT, and GC on CIFAR-10.

smaller discrepancy than IC. We further motivate the use of this proxy, in regards with Theorem 1, in Appendix A.4.

In the following theorem, proved in Appendix A.2, we show that Rt decays faster for GC than for IC.

Theorem 2. Assume that the data distribution contains more than a single point. Also, assume that the generatoraugmented coupling between the predicted data point ˆxt and noise z is computed via an ideal consistency model f, i.e., the flow of the PF-ODE. Then, as t ,

Rt,GC Rt,IC. (18)

Empirical validation. Evaluating Rt requires computing the difference between the sample path derivative xt and the velocity field vt(xt). In the EDM setting, this difference can be approximated using a denoiser. Indeed, xt = z and vt(xt) = E[ xt|xt] = E[z|xt] = E[ xt x

t |xt] = 1 t (xt D (xt, t)) with an optimal denoiser D . The optimal denoiser can be approximated by a denoiser network Dϕ. Finally, we have: xt vt(xt) z 1

t (xt Dϕ(xt, t)). Since IC, batch-OT, and GC define different pt s and vt s, we train a different denoiser Dϕ for each coupling. In Figure 2, we report the results from the comparison of the three proxy terms on CIFAR-10. We observe that Rt,GC < Rt,batch-OT < Rt,IC and that the gap increases with t, corroborating our theoretical findings (Theorem 2).

4.2.2. REDUCING TRANSPORT COST WITH GC

Here, we investigate the average transport cost between the noise z pz and the predicted data point ˆx p as a measure of the efficiency of the data-noise coupling. Recall that the diffusion process is given by xt = x + σtz. Then, knowing that the consistency model f satisfying the boundary condition f(x0, σ0) = x0, we define the function

c(t) = Eq I(x ,z)

f(xt, σt) z 2 . (19)

c(0) = Eq I(x ,z)[ x0 z 2] and c(t) represent the transport costs of, respectively, IC and GC. We show below, with proofs in Appendix A.3, that c(t) is decreasing for σt close to zero and for large σt.

Lemma 1 (Transport cost of GC coupling). Assume that f is a continuously differentiable function representing the ground-truth consistency model, i.e. the flow of the PFODE induced by the diffusion process xt. Define wt = z E[z|xt] = 1 σt ( xt E[ xt | xt]). Then:

c (t) = 2 σt E

* f x (xt, σt) wt, wt

Corollary 1 (Decreasing transport cost of GC coupling in t 0+). There exists a t > 0 such that for all t [0, t ], the derivative of c(t) takes the form c (t) = 2 σtat with at > 0. Hence for σt positive, the cost is decreasing. In particular, in the EDM setting where σt = t, c(t) is decreasing for small t.

The proof of this corollary proceeds by noting that for t = 0, the consistency model f(x, t) is an identity function, its Jacobian is an identity matrix, and thus at = E[ wt 2]. Using the continuity of Jacobian elements and invoking intermediate value theorem on at concludes the proof.

Corollary 2 (Decreasing transport cost of GC coupling in t tmax). Assume that the consistency model f(x, σ) is a scaling function f(x, σt) = σ0 σt x. Then, we have c (t) = 2 σtσ0

σt E[ wt 2]. In particular, c(t) is decreasing whenever σt is increasing.

We note that, while the assumption of the consistency model being a scaling function is strong, it nonetheless bears some degree of truth for t tmax, see Lemma 3 of Appendix A.

Experimental validation. As stressed in Section 3, a line of work has brought evidence that reducing the transport cost between noise and data distributions could fasten the training and help produce better samples. We compare the quadratic transport costs involved in IC, batch-OT (Pooladian et al., 2023; Dou et al., 2024), and GC (resp. c(0), c OT(0), and c(t)). Results are presented in Figure 3. Interestingly, GC reduces transport cost more than batch-OT on CIFAR-10 because batch-OT is tied to the batch data points xt whereas our computed ˆxt are not.

Improving Consistency Models with Generator-Augmented Flows

200 400 600 800 1000 Training iteration ( 1e3)

GC with pre-trained IC GC with pre-trained weak-IC IC FID of IC predictor FID of weak-IC predictor

Figure 4. Performance of GC w.r.t. the performance of the predictor on CIFAR-10.

5. Training With Generator-Augmented Flows for Image Generation

In this section, we present a methodology to train consistency models with GC on unconditional image generation. To construct points drawn from GC trajectories ( xti), our theory requires an optimal predictor f on intermediate points drawn from IC (xti). Thus, this lets us two potential training strategies: (i) pre-train an IC generator, and leverage it to construct GC trajectories that train a GC model; (ii) a joint learning strategy: train a single consistency model from scratch with both types of trajectories. Note that in this second setting, the model is unique: f = fθ. The second option is more appealing, since it is a simple one-stage training. We demonstrate that the joint learning approach improves performance and accelerates convergence compared to standard consistency models.

Our experiments are done on the following datasets: CIFAR10 (Krizhevsky, 2009), Image Net (Deng et al., 2009), Celeb A (Liu et al., 2015) and LSUN Church (Yu et al., 2015). For the evaluation metrics, we report the Fr echet Inception Distance (FID, Heusel et al. (2017)), Kernel Inception Distance (KID, Bi nkowski et al. (2018)), and Inception Score (IS, Salimans et al. (2016)). Most of our experiments are based on the improved training techniques for consistency models from Song and Dhariwal (2024), denoted i CT-IC. Moreover, we present some results in the setting of Easy Consistency Tuning (ECT, Geng et al. (2024)). Details are provided in Appendix D. The code is shared in the supplementary material and will be open-sourced upon publication for reproducibility.

5.1. GC with Pre-Trained Endpoint Predictor

Our theoretical results assume having access to an ideal generator on IC trajectories, meaning that the generator ap-

Figure 5. Consistency models trained with GC with joint learning converges faster and outperforms consistency models trained with IC or minibatch-OT on CIFAR-10.

proximates the diffusion flow output. To train a consistency model on GC, we can thus rely on a separate endpoint predictor pre-trained on IC (i CT-IC): f gϕ (cf. Section 4.1). This network predicts the endpoint: ˆxti = gϕ(xti, σti). During the training of the consistency model on GC, gϕ is kept frozen and considered a proxy of the true flow, as in our theoretical results. In Figure 4, we report the performance of consistency models on CIFAR-10 trained with GC using two different gϕ: (i) a gϕ fully trained as standard i CT-IC with 100k training steps; (ii) a weak gϕ partially trained as i CT-IC with 20k training steps.

Finding 1. Using a partially pre-trained and frozen endpoint predictor, trained on IC trajectories, allows to train a consistency model with GC and which converges faster. However, the performance of the GC model depends on the quality of the endpoint predictor on IC trajectories.

It is important to note that this setup is not practical, as it requires pre-training a standard consistency model. We aim for a training methodology that accelerates convergence and improves performance when training from scratch, without doubling the number of required training iterations.

5.2. GC from Scratch with Joint Learning

In this section, we propose to learn simultaneously a single model on IC and GC trajectories from the start of the training, i.e. f sg(fθ) (cf. Section 4.1). Thereby, we combine the training of the ideal IC predictor with the training of GC model based on this predictor. We introduce a joint learning factor µ: at each training step, training pairs are drawn from GC with probability µ, while the remaining pairs are drawn from standard IC. The loss can be written on average as:

LGC-µ(θ) = µLGC(θ) + (1 µ)LCT(θ) (21)

Improving Consistency Models with Generator-Augmented Flows

Table 1. i CT-IC is the standard improved consistency model with independent coupling (Song and Dhariwal, 2024); i CT-OT is i CT with minibatch optimal transport coupling (Pooladian et al., 2023; Dou et al., 2024); i CT-GC (µ = 0.5) is our proposed GC with joint learning.

Dataset Model FID KID ( 102) IS

CIFAR-10 i CT-IC 7.42 0.04 0.44 0.03 8.76 0.06 i CT-OT 6.75 0.04 0.36 0.04 8.86 0.09 i CT-GC (µ = 0.5) 5.95 0.05 0.26 0.02 9.10 0.05

Image Net (32 32) i CT-IC 14.89 0.17 1.23 0.05 9.46 0.06 i CT-OT 14.13 0.17 1.18 0.05 9.62 0.06 i CT-GC (µ = 0.5) 13.99 0.28 1.13 0.03 9.77 0.07

Celeb A (64 64) i CT-IC 15.82 0.13 1.31 0.04 2.33 0.00 i CT-OT 13.63 0.13 1.09 0.03 2.40 0.01 i CT-GC (µ = 0.5) 11.74 0.08 0.91 0.04 2.45 0.01

LSUN Church (64 64) i CT-IC 10.58 0.11 0.73 0.03 1.99 0.01 i CT-OT 9.71 0.13 0.64 0.03 2.00 0.01 i CT-GC (µ = 0.5) 9.88 0.07 0.66 0.04 2.14 0.01

We denote this joint learning procedure as GC (µ = ). Hence, GC (µ = 0) corresponds to the standard IC procedure, while GC (µ = 1) corresponds to training only with GC points.Note that GC (µ = 1) is not expected to work, since our theoretical guarantees assume an optimal IC predictor. The detailed algorithm is presented in Algorithm 1 in Appendix. We apply this joint learning to four image datasets, and include comparisons to i CT with batch-OT (Dou et al., 2024) as an additional baseline. Results across multiple datasets and metrics are presented in Table 1, and visual examples are shown in Figure 8 in Appendix.

Finding 2. Joint learning of IC and GC trajectories consistently improves results compared to the base IC model and outperforms batch-OT in most cases.

As shown in Figure 5, we observe an interesting interpolation phenomenon between µ = 0 and µ = 1. For µ = 0, we recover the steady FID improvement typical of IC training. As µ increases, the convergence of the generative model accelerates. For 0.3 µ 0.7, on CIFAR-10, convergence speed and final FID are improved compared to IC and batch-OT models. For µ = 1, the FID score decreases faster than other configurations early in the training process, but it soons increases as training progresses further. It is explained by the poor performance of the predictions on IC yielding a deviation from the ideal IC predictor from Section 4. For the other datasets, we simply chose µ = 0.5 and report those results. We provide further detail on the sensitivity of our results to the choice of µ in Appendices C.1 and D.

5.3. GC in the ECT Setting

As an additional experiment, we explore the recent ECT setting (Geng et al., 2024) on CIFAR-10, where consis-

Table 2. Performance of IC and GC consistency models trained in the ECT setting (Geng et al., 2024). Short training: 4k iterations. Long training: 100k iterations.

CIFAR-10 (Short Training) ECT-IC 7.37 0.05 ECT-GC (µ = 0.3) 6.41 0.05

CIFAR-10 (Long Training) ECT-IC 4.11 0.03 ECT-GC (µ = 0.3) 3.74 0.04

FFHQ 64 64 (Short Training) ECT-IC 13.29 0.10 ECT-GC (µ = 0.3) 11.73 0.09

FFHQ 64 64 (Long Training) ECT-IC 9.68 0.06 ECT-GC (µ = 0.3) 8.51 0.09

Image Net 64 64 Cond. (Short Training) ECT-IC 10.82 0.18 ECT-GC (µ = 0.3) 10.31 0.22

Image Net 64 64 Cond. (Long Training) ECT-IC 5.84 0.21 ECT-GC (µ = 0.3) 6.39 0.20

Improving Consistency Models with Generator-Augmented Flows

tency models are fine-tuned from a pre-trained diffusion model. This approach enables training high-quality consistency models in one GPU-hour, though it requires an already trained diffusion model.

We compare IC and GC trajectories in this setting, with both short (approximately one GPU-hour) and long (100k steps, 1 GPU-day) training times. Using the referenced hyperparameters selected by Geng et al. (2024), we observe a consistent advantage for GC, with an optimal µ value of 0.3. These preliminary results, summarized in Table 2, align with our previous findings on the i CT setting, further supporting the effectiveness of GC.

6. Conclusion

In this paper, we identify a discrepancy between consistency training and consistency distillation. Building on this theoretical analysis, we introduce generator-augmented flows and show that they reduce a proxy term measuring this discrepancy. Additionally, generator-augmented flows decrease the data-to-noise transport cost, as demonstrated by theory and experiments. Finally, we derive practical algorithms for training consistency models using generatoraugmented flows and demonstrate improved empirical performance.

Impact Statement

If used in large-scale generative models, notably in text-toimage models, this work may increase potential negative impacts of deep generative models such as deepfakes (Fallis, 2021).

Acknowledgements. This research was funded by grant Nos. 2023-00262155, 2024-00460980 and 2025-02304717 (IITP) funded by the Korea government (the Ministry of Science and ICT).

Mikołaj Bi nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018.

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. In Advances in Neural Information Processing Systems, volume 36, pages 49205 49233. Curran Associates, Inc., 2023.

Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion Schr odinger bridge with appli-

cations to score-based generative modeling. In Advances in Neural Information Processing Systems, volume 34, pages 17695 17709. Curran Associates, Inc., 2021.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255, 2009.

Hongkun Dou, Junzhe Lu, Jinyang Du, Chengwei Fu, Wen Yao, Xiao qian Chen, and Yue Deng. A unified framework for consistency generative modeling, 2024. URL https: //openreview.net/forum?id=Qfqb8ue Idy.

Don Fallis. The epistemic threat of deepfakes. Philosophy & Technology, 34:623 643, 2021.

Kilian Fatras, Younes Zine, Szymon Majewski, R emi Flamary, R emi Gribonval, and Nicolas Courty. Minibatch optimal transport distances; analysis and applications. ar Xiv preprint ar Xiv:2101.01792, 2021.

Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. ar Xiv preprint ar Xiv:2406.14548, 2024.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, volume 30, pages 6629 6640. Curran Associates, Inc., 2017.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840 6851. Curran Associates, Inc., 2020.

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Sanmi Koyejo, Shakir Mohamed, Alekh Agarwal, Danielle Belgrave, Kyunghyun Cho, and Alice Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 26565 26577. Curran Associates, Inc., 2022.

Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36, 2024.

Alexander Korotin, Nikita Gushchin, and Evgeny Burnaev. Light Schr odinger bridge. In International Conference on Learning Representations, 2024.

Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009. URL https://www.cs.toronto.edu/ kriz/ learning-features-2009-TR.pdf.

Improving Consistency Models with Generator-Augmented Flows

Sangyun Lee, Beomsu Kim, and Jong Chul Ye. Minimizing trajectory curvature of ODE-based generative models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 18957 18973. PMLR, 23 29 Jul 2023.

Yiheng Li, Heyang Jiang, Akio Kodaira, Masayoshi Tomizuka, Kurt Keutzer, and Chenfeng Xu. Immiscible diffusion: Accelerating diffusion training with noise assignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In International Conference on Learning Representations, 2023.

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations, 2023.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision (ICCV), 2015.

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. ar Xiv preprint ar Xiv:2410.11081, 2024.

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

Levon Nurbekyan, Alexander Iannantuono, and Adam M Oberman. No-collision transportation maps. Journal of Scientific Computing, 82(2):45, 2020.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K opf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py Torch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, pages 8026 8037. Curran Associates, Inc., 2019.

Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, and Ricky T. Q. Chen. Multisample flow matching: Straightening flows with minibatch couplings. In Proceedings

of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28100 28127. PMLR, July 2023.

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training GANs. In Advances in Neural Information Processing Systems, volume 29, pages 2234 2242. Curran Associates, Inc., 2016.

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. ar Xiv preprint ar Xiv:2311.17042, 2023.

Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Arnaud Doucet. Diffusion Schr odinger bridge matching. In Advances in Neural Information Processing Systems, volume 36, pages 62183 62223. Curran Associates, Inc., 2023.

Nicki Skafte Detlefsen, Jiri Borovec, Justus Schock, Ananya Harsh, Teddy Koker, Luca Di Liello, Daniel Stancl, Changsheng Quan, Maxim Grechkin, and William Falcon. Torch Metrics measuring reproducibility in Py Torch, February 2022. URL https://github.com/ Lightning-AI/torchmetrics.

Max Sommerfeld, J orn Schrieber, Yoav Zemel, and Axel Munk. Optimal transport: Fast probabilistic approximation with exact solvers. Journal of Machine Learning Research, 20(105):1 23, 2019.

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. In International Conference on Learning Representations, 2024.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32, pages 11918 11930. Curran Associates, Inc., 2019.

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Scorebased generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32211 32252. PMLR, July 2023.

Improving Consistency Models with Generator-Augmented Flows

Alexander Y. Tong, Nikolay Malkin, Kilian Fatras, Lazar Atanackovic, Yanlei Zhang, Guillaume Huguet, Guy Wolf, and Yoshua Bengio. Simulation-free schr odinger bridges via score and flow matching. In International Conference on Artificial Intelligence and Statistics, pages 1279 1287. PMLR, 2024.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23 (7), 2011.

Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015.

Improving Consistency Models with Generator-Augmented Flows

A.1. Continuous-Time Consistency Objectives

Theorem 1 (Discrepancy between consistency distillation and consistency training objectives). Assume that the distance function is given by D(x, y) = φ( x y ) for a continuous convex function φ : [0, ) [0, ) with φ(x) Cxα as x 0+ for some C > 0 and α 1, and that the timesteps are equally spaced, i.e., ti = i T

N . Furthermore, assume that the Jacobian fθ

x does not vanish identically. Then the following assertions hold:

(i) The scaled consistency losses N αLCD(θ) and N αLCT(θ) converge as N . Moreover, the minimization objectives corresponding to these limiting scaled consistency losses are not equivalent, and their difference is given by:

lim N N α LCT(θ) LCD(θ) = CT α 1R(θ), (22)

where R(θ) is defined by

0 λ(σt)E CTfθ α CDfθ α dt (23)

and satisfies R(θ) > 0, with

σ (xt, σt) σt + fθ

x (xt, σt) xt, (24)

σ (xt, σt) σt + fθ

x (xt, σt) vt(xt). (25)

In particular, if α = 2,

x (xt, σt) xt vt(xt)

(ii) The scaled gradient N α 1 θLCD(θ) and N α 1 θLCT(θ) converge as N . Moreover, if α = 2, then their respective limits are not identical as functions of θ:

lim N N α 1 θLCT(θ) = lim N N α 1 θLCD(θ). (27)

Proof. (i) Note that CDfθ and CTfθ satisfy:

CTfθ(xt, σt) =

tfθ(xt, σt), CDfθ(xt, σt) = E

tfθ(xt, σt) xt

Here, the second equality follows by noting that vt(xt) = E[ xt|xt] and all the other terms in the expansion of

tfθ(xt, σt) are completely determined once the value of xt is known.

Now, we use Taylor s theorem to expand the difference between fθ(xti+1, σti+1) and fθ(xΦ ti, σti) in the consistency distillation loss, Equation (4). Together with the definition of xΦ ti, Equation (5), this yields:

fθ(xti+1, σti+1) fθ(xΦ ti, σti)

σ (xti+1, σti+1) (σti+1 σti) + fθ

x (xti+1, σti+1) (xti+1 xΦ ti) + o(ti+1 ti) (29)

= CDfθ(xti+1, σti+1) (ti+1 ti) + o(ti+1 ti). (30)

Similarly, by expanding the difference between fθ(xti+1, σti+1) and fθ(xti, σti) in Equation (6),

fθ(xti+1, σti+1) fθ(xti, σti)

σ (xti+1, σti+1) (σti+1 σti) + fθ

x (xti+1, σti+1) (xti+1 xti) + o(ti+1 ti) (31)

= CTfθ(xti+1, σti+1) (ti+1 ti) + o(ti+1 ti). (32)

Improving Consistency Models with Generator-Augmented Flows

Therefore, for each {CD, CT},

N αL (θ) = N α 1

i=0 λ(σti)E h C fθ(xti+1, σti+1) α (1 + o(1)) i (ti+1 ti)α (33)

= CT α 1 N 1 X

i=0 λ(σti)E h fθ(xti+1, σti+1) α (1 + o(1)) i (ti+1 ti) (34)

0 λ(σt)E h fθ(xt, σt) αi dt (35)

in the continuous-time limit as N .

For simplicity of notation, we write

L (θ) = lim N N αL (θ) (36)

for each {CD, CT}. Then, from the formula for the limiting losses L (θ), Equation (35), we immediately obtain

L CT(θ) L CD(θ) = CT α 1 Z T

0 λ(σt)E h CTfθ(xt, σt) α CDfθ(xt, σt) αi dt. (37)

Now, we specialize in the case α = 2 and invoke the general observation that, for any random vectors x and y, the following identity holds:

E h x 2 E[x|y] 2i = E h x E[x|y] 2i . (38)

This can be easily proved by expanding the squared Euclidean norm as the inner product and applying the law of iterated expectations. Plugging in x tfθ(xt, σt) and y xt, and noting that CDfθ(xt, σt) = E CTfθ(xt, σt) | xt by Equation (28), it follows that

L CT(θ) L CD(θ) = CT Z T

0 λ(σt)E h CTfθ(xt, σt) CDfθ(xt, σt) 2i dt (39)

x (xt, σt) xt vt(xt)

Next, we establish the positivity of R(θ). To this end, note that α is a convex function for α 1. By invoking the conditional Jensen s inequality, we find that the expectation inside the limiting scaled consistency training losses, Equation (35) satisfy:

E h CTfθ(xt, σt) αi = E

" tfθ(xt, σt)

E tfθ(xt, σt)

tfθ(xt, σt) xt

= E h CDfθ(xt, σt) αi . (42)

Integrating both sides with respect to λ(σt) dt, we obtain the desired inequality. The Jensen s inequality also tells that the equality holds precisely when

tfθ(xt, σt) = E[

tfθ(xt, σt)|xt] holds, or equivalently, fθ

x (xt, σt) ( xt E[ xt|xt]) = 0. However, given the value of xt, the quantity xt can assume an arbitrary value in Rd because the conditional density of xt = σtz given xt is strictly positive everywhere. Consequently, the equality condition implies fθ

x = 0. Since this contradicts the assumption of the theorem, the strict inequality between the two limiting losses must hold.

Finally, recall that the continuous-time consistency distillation loss, L CD(θ), is given by

L CD(θ) = CT α 1 Z T

σ (xt, σt) σt + fθ

x (xt, σt) vt(xt)

Improving Consistency Models with Generator-Augmented Flows

Similarly, the continuous-time consistency training loss, L CT(θ), is given by

L CT(θ) = CT α 1 Z T

σ (xt, σt) σt + fθ

x (xt, σt) xt

Since vt(xt) = E[ xt|xt] and E[ xt E[ xt] 2] > E[ vt(xt) E[ xt] 2], it follows that L CT(θ) penalizes the Jacobian fθ

x more strongly than L CD(θ) does. Therefore, the two limiting consistency losses do not define equivalent objectives.

(ii) Using the convexity of φ, we can show that φ (x) Cαxα 1 as x 0+. Combining this with the vector calculus formula y y = y y , we get yφ( y ) Cα y α 2y for small y. From this, we can estimate the gradient of the distance between sg fθ(xΦ ti, σti) and fθ(xti+1, σti+1) with respect to the model parameter θ as:

θD sg fθ(xΦ ti, σti) , fθ(xti+1, σti+1)

= (1 + o(1))Cα CDfθ α 2( CDfθ) fθ

(ti+1 ti)α 1 (45)

Here, the expression CDfθ α 2( CDfθ) fθ

θ in the square bracket is evaluated at (xti+1, σti+1). Similarly, the gradient of the distance between fθ(xti, σti) and fθ(xti+1, σti+1) is estimated as:

θD sg fθ(xti, σti) , fθ(xti+1, σti+1)

= (1 + o(1))Cα CTfθ α 2( CTfθ) fθ

(ti+1 ti)α 1 (46)

Combining these two estimates, we can now compute the limit of the scaled gradient N α 1 θL (θ) for each {CD, CT} as:

N α 1 θL (θ)

= CαT α 2 N 1 X

i=0 λ(σti)E (1 + o(1)) fθ α 2 ( fθ) fθ

(ti+1 ti) (47)

CαT α 2 Z T

0 λ(σt)E fθ(xt, σt) α 2 fθ(xt, σt) fθ

θ (xt, σt) dt (48)

as N . Finally, if α = 2, then the term fθ α 2 f θ is a nonlinear transformation of fθ. This nonlinearity tells that, in general,

E h CTfθ α 2 ( CTfθ) xt i = CDfθ α 2 ( CDfθ) . (49)

Therefore, the scaled gradient limits are not identical as functions of θ, and in particular, their zero sets do not coincide.

Differences with Song et al. (2023) s results. The previous theorem states a discrepancy between CT and CD objectives. However, Song et al. (2023) provide equivalence results between consistency training and consistency distillation. The differences come from the following reasons.

In Song et al. (2023, Theorem 2), it is stated that LCT = LCD + o( T). However, in this theorem, the o( T) is actually too large compared to the other term, and consequently the result is uninformative. Indeed it has two the two following problems: (i) if the distance function decays faster than the norm does, i.e., D(x, y) = o( x y ), then the o( T) term is actually too large compared to the magnitude of the two losses as N ; (ii) The C2-regularity assumption on the distance function D is too restrictive, excluding many cases such as the distance function given by a norm. For example, such a breakdown happens when D(x, y) is a metric induced by the norm, i.e., D(x, y) = x y . In this case, its partial derivatives, such as 2D(x, y) = y D(x, y) appearing in the proof, are undefined along x = y.

In Song et al. (2023, Theorem 6), the theorem about limiting gradient equality is stated with a general distance function D. However, the requirements on the Hessian of the distance function restrict the theorem s validity where the distance function is an (asymptotic) quadratic loss. Indeed, in their proof, it turns out that the Hessian can define a non-zero value only when D is an (asymptotic) quadratic loss. This coincides with our results in the case α = 2.

Improving Consistency Models with Generator-Augmented Flows

A.2. Proxy of the Regularizer

In this subsection, we establish a theoretical result about the decay rate of the proxy of the regularizer. As preparation for the main result and for future use, we introduce a simple lemma that decomposes the forward flow generated by a vector field into the sum of a scaling term and a correction term that is well-behaved.

Lemma 2. Assume that ϕ is the forward flow generated by the vector field vt, meaning that it solves the characteristic equation: tϕ(x, σt) = vt(ϕ(x, σt)), ϕ(x, σ0) = x. (50)

Also, assume that vt is defined as

σt (x D(x, σt)) (51)

for some function D, which we call a denoiser . Then ϕ satisfies the following integral equation:

σt ϕ(x, σt) + σ0

σs σ2s D(ϕ(x, σs), σs) ds. (52)

Proof. We first compute the derivative of ϕ/σt:

σ2 t ϕ(x, σt) + 1

σt (ϕ(x, σt) D(ϕ(x, σt), σt)) (53)

σ2 t D(ϕ(x, σt), σt). (54)

Integrating both sides with respect to t, it follows that

σt ϕ(x, σ0)

σs σ2s D(ϕ(x, σs), σs) ds. (55)

Rearranging and applying the initial condition ϕ(x, σ0) = x, we obtain the desired equation.

As an immediate consequence of this lemma, we obtain the following result about the asymptotic structure of a trained consistency model:

Lemma 3. Assume that f is the consistency model generated by a bounded denoiser D, in the sense that f solves the transport equation f σ (x, σt) σt + f x (x, σt) vt(x) = 0 (56)

for a vector field vt defined as in Equation (51) with the denoiser D. Then

f(x, σt) = σ0

σt x + O(1) (57)

uniformly in x and σt. The implicit bound of the error term can be chosen to be the bound of D.

Proof. Let ϕ be the forward flow generated by vt as in Lemma 2. This ϕ is precisely the inverse of the consistency model f, in the sense that ϕ( f(x, σ), σ) = x holds. Then, replacing x in the equation of Lemma 2 with f(x, σt), we get

f(x, σt) = σ0

σs σ2s D(ϕ( f(x, σt), σs), σs) ds. (58)

Now let R be such that D(x, σ) R for any x Rd and noise level σ. Then, the integral term in Equation (58) is bounded as: σ0

σs σ2s D(ϕ( f(x, σt), σs), σs) ds

σs σ2s R ds = σ0R 1

This proves the desired claim.

Improving Consistency Models with Generator-Augmented Flows

Now we turn to the main result, which analyzes the asymptotic behavior of Rt,IC and Rt,GC, as t :

Theorem 2. Assume that the data distribution contains more than a single point. Also, assume that the generator-augmented coupling between the predicted data point ˆxt and noise z is computed via an ideal consistency model f, i.e., the flow of the PF-ODE. Then, as t , Rt,GC Rt,IC. (60)

Proof. We first investigate the asymptotic behavior of Rt,IC in the limit of t . Recall that the diffusion process xt is given by xt = x + σtz for (x , z) q I, and note that

xt vt(xt) = σtz E[ σtz|xt] = σt

σt (x D(xt, σt)), (61)

where D(xt, σt) = E[x |xt] is the denoiser. Plugging this into the definition of Rt,IC, we get

2 E h x D(xt, σt) 2i . (62)

Now, we claim that D(xt, σt) = E[x |xt] E[x ] as t . Intuitively, this is because xt σtz for large t, and σtz is independent of x . More formally, note that the conditional distribution of xt given x is p(xt|x ) = N(xt; x , σ2 t I). By Bayes theorem, the conditional distribution of x given xt is

p(x |xt) = p(xt|x )p(x ) R

Rd p(xt|x )p(x ) dx = exp 1

2σ2 t |xt x |2 p(x ) R

2σ2 t |xt x |2 p(x ) dx . (63)

As t , we have σt , so the exponential terms converge to 1. Consequently, p(x |xt) p(x ) and hence E[x |xt] E[x ] as claimed. Thus,

2 E h x E[x ] 2i . (64)

Since the data distribution p is assumed to have more than one point, the variance E[ x E[x ] 2] is strictly positive. Therefore, Rt,IC decays at a rate asymptotically proportional to ( σt

Next, we investigate the asymptotic behavior of Rt,GC. Recall the consistency training loss for GC, Equation (15). Under the assumptions in Theorem 1, the scaled loss N αLGC(θ) converges to

L GC(θ) = CT α 1 Z T

σ ( xt, σt) σt + fθ

x ( xt, σt) σtz

Here, xt = ˆxt + σtz and ˆxt = f(xt, σt), where f is the ideal consistency model for the flow associated with the diffusion process xt. The proof of this claim is similar to that of Theorem 1, so we only highlight the necessary changes. Most importantly, the velocity term is not xt but σtz. This is due to how the discrete-time samples are constructed. Indeed, from Equation (14), we find that xti+1 xti = (σti+1 σti)z, which manifests as the velocity term σtz in Equation (65). Consequently, the associated (average) velocity field vt is given by

vt( xt) = E[ σtz| xt] = σt

σt ( xt E[ˆxt| xt]). (66)

Therefore, Rt,GC reduces to

2 E h ˆxt E[ˆxt| xt] 2i . (67)

Now, unlike in the IC case, we claim that E[ˆxt| xt] ˆxt as t . Heuristically, this is because both ˆxt and xt are almost deterministic functions of z; hence, the conditioning has negligible effect in the limit.

Improving Consistency Models with Generator-Augmented Flows

More precisely, let ϕ be the forward flow generated by the PF-ODE vector field vt. As in the proof of Lemma 2, integrating both sides of Equation (54) from t to u yields

σu = ϕ(x, σt)

σs σ2s D(ϕ(x, σs), σs) ds. (68)

Letting u , we claim that the right-hand side converges. Indeed, the empirical data distribution p has compact support, meaning all the data points are confined in a bounded region of Rd. Since the values of D are weighted averages of the data points, it follows that D is also bounded. Then the integrand σs σ2s D(ϕ(x, σs), σs) is absolutely integrable on [t, ), hence the convergence follows. Moreover, the limit does not depend on t. Denote this limit by ρ(x):

ρ(x) = ϕ(x, σt)

σs σ2s D(ϕ(x, σs), σs) ds. (69)

As shown in the previous part, we know that D(x, t) = c + o(1) as t with c = E[x ]. Then, multiplying both sides of Equation (69) by σt and rearranging, we have, for large t,

ϕ(x, σt) = σtρ(x) + σt

σs σ2s D(ϕ(x, σs), σs) ds (70)

= σtρ(x) + (c + o(1))σt

σs σ2s ds (71)

= σtρ(x) + c + o(1). (72)

Since ϕ is a bijection, the above relation tells that ρ(x) is also a bijection. Next, we replace x ˆxt in the equation defining ρ(x), Equation (69), to obtain:

ρ(ˆxt) = z + x

σs σ2s D(ϕ(ˆxt, σs), σs) ds. (73)

Since ρ is invertible, applying ρ 1 to both sides yields

ˆxt = ρ 1 z + x

σs σ2s D(ϕ(ˆxt, σs), σs) ds (74)

σs σ2s D(ϕ(ˆxt, σs), σs) ds (75)

Since all of x , ˆxt, and D are bounded by the largest norm of the data point, they are all finite. Hence, the last line shows that ˆxt = ρ 1 xt

σt ) , demonstrating that ˆxt is almost a deterministic function of xt. Therefore, E[ˆxt| xt] ˆxt as required. Consequently, Rt,GC satisfies

This proves that Rt,GC Rt,IC as required.

A.3. Transport Cost

As a base for the two corollaries presented in the paper, we will first derive a useful representation of the derivative of the transport cost.

The main purpose of the lemma is to provide a more tractable representation of c (t), the time derivative of the expected transport cost. We expect c(t) to decrease with t because the predicted data point f(xt, σt) becomes more dependent on the noise z as t increases. However, directly analyzing f(xt, σt) z is challenging because the dependence of f(xt, σt) on z is not explicit. Therefore, the lemma aims to:

identify a quantity that better captures the dependence between z and xt;

relate c(t) to this quantity.

Improving Consistency Models with Generator-Augmented Flows

The proof proceeds by deriving a key property of the ground-truth consistency map f: it satisfies the transport equation,

f σ (x, σt) σt + f x (x, σt) vt(x) = 0. (77)

This equation is equivalent to saying that the conditional expectation of the time derivative of f(xt, σt) given xt is zero:

t f(xt, σt) xt

By leveraging this property, we can simplify c (t) into an expression involving wt = z E[z | xt], the residual between the true noise z and its prediction given xt. This residual captures the uncertainty in predicting z based on xt, allowing us to relate c (t) directly to the prediction accuracy of f.

Lemma 1 (Transport cost of GC coupling). Assume that f is a continuously differentiable function representing the ground-truth consistency model, i.e. the flow of the PF-ODE induced by the diffusion process xt. Define wt = z E[z|xt] = 1 σt ( xt E[ xt | xt]). Then:

c (t) = 2 σt E

* f x (xt, σt) wt, wt

Proof. Note that the inverse flow f 1(y, σt) transports the initial point y at time t = 0 along the vector field vt up to time t. Consequently, f 1 is a flow with the corresponding vector field vt:

t[ f 1(y, σt)] = vt( f 1(y, σt)). (80)

By differentiating both sides of the identity y = f( f 1(y, σt), σt) with respect to t and applying the above observation, we get:

h f( f 1(y, σt), σt) i (81)

= f σ ( f 1(y, σt), σt) σt + f x ( f 1(y, σt), σt)

t[ f 1(y, σt)] (82)

= f σ (x, σt) σt + f x (x, σt) vt(x), (83)

where the substitution x = f 1(y, σt) is used in the last step. Consequently,

t[ f(xt, σt)], f(xt, σt) z #

* f σ (xt, σt) σt + f x (xt, σt) xt, f(xt, σt) z

* f x (xt, σt) ( xt vt(x)), f(xt, σt) z

* f x (xt, σt) (z E[z|xt]), f(xt, σt) z

Improving Consistency Models with Generator-Augmented Flows

where we used the relations xt = x + σtz and vt(x) = E[ xt|xt]. Now, let wt = z E[z | xt]. Then E[wt | xt] = 0, hence by an application of the law of iterated expectations, E[ wt, g(xt) ] = 0 for essentially any function g : Rd Rd. Using this, we can further simplify the last line as:

c (t) = 2 σt E

* f x (xt, σt) wt, z

* f x (xt, σt) wt, wt

proving the desired equality.

An immediate consequence of this lemma is that c(t) decreases for small t:

Corollary 1 (Decreasing transport cost of GC coupling in t 0+). There exists a t > 0 such that for all t [0, t ], the derivative of c(t) takes the form c (t) = 2 σtat with at > 0. Hence for σt positive, the cost is decreasing. In particular, in the EDM setting where σt = t, c(t) is decreasing for small t.

Proof. The proof of this corollary proceeds by noting that for t = 0, the consistency model f(x, t) is an identity function, its Jacobian is an identity matrix leading to at = E[ wt 2] > 0 and by assumption, all the elements of the Jacobian are continuous. By continuity of at, t exists and invoking intermediate value theorem on at concludes the proof.

The next result is the statement about the asymptotic behavior of the transport cost c(t) in the large-t regime.

Corollary 2 (Decreasing transport cost of GC coupling in t tmax). Assume that the consistency model f(x, σ) is a scaling function f(x, σt) = σ0

σt x. Then, we have c (t) = 2 σtσ0

σt E[ wt 2]. In particular, c(t) is decreasing whenever σt is increasing.

Proof. Under the assumption, we have f x = σ0

σt I. Thus, by Lemma 1,

c (t) = 2 σt E

σt E[ wt 2]. (89)

This proves that c (t) < 0 whenever σt > 0.

Toy example. Let us consider a one-dimensional toy example where x N(0, σ2 ) with σ 0 and z N(0, 1) are independent. Also, we assume σ0 = 0 for the sake of simplicity. In this case, the marginal law of xt is also Gaussian with pt = N(0, σ2 +σ2 t ), so the vector field for the diffusion process xt is calculated as vt(x) = σtσt x log pt(x) = σtσt σ2 +σ2 t x. Then, the corresponding target diffusion flow and the transport cost function are:

f(x, σt) = σ p

σ2 + σ2 t x and c(t) = σ2 + 1 2σ σt p

σ2 + σ2 t . (90)

We note that f(x, σt) is indeed a scaling function which is asymptotically proportional to x σt for large t, and c(t) is decreasing in t for t > 0.

Experimental validation. We validation the transport cost decrease in Figure 6, on a toy dataset composed of two 2D-Diracs, and on CIFAR-10. Interestingly, we observe that when computing OT transport plans between batches instead of on the full data, GC allows to reduce transport cost more than batch-OT.

A.4. Proxy Term

In this part, we clarify the connection between the proxy term and the in the case of the quadratic loss (α = 2). Indeed, we can bound the regularization term with the proxy term thanks to the Jacobian s maximum singular value smax( fθ

x ), which is bounded as typical networks are Lipschitz:

x (xt, σt) xt vt(xt)

2 xt vt(xt) 2 s2 max( fθ

x ) xt vt(xt) 2 (91)

Improving Consistency Models with Generator-Augmented Flows

10 2 10 1 100 101

Mean quadratic transport cost

Transport Cost on One-Dimensional 2-Diracs

Transport cost for IC Transport cost for GC Transport cost for OT

1e 02 1e 01 1 10 100 Timestep σ

Mean quadratic transport cost

Transport Cost on CIFAR-10

IC GC Batch-OT (b=128)

Batch-OT (b=2048)

Figure 6. Comparison of transport costs between IC, batch-OT, and GC on two 2D-Diracs (left) and CIFAR-10 (right).

Algorithm 1 Training of consistency models with generator-augmented trajectories

Input: Randomly initialized consistency model fθ, number of timesteps N, noise schedule σti, loss weighting λ( ), learning rate η, distance function D, noise distribution pz, joint learning parameter µ. Output: Trained consistency model fθ. while not converged do

x p , z pz {batch of real data and noise vectors} i multinomial p(σt0), . . . , p(σt N ) {sampling timesteps} m binomial(µ, size=batch size) {mask of dimension (batch size) with each mj binomial(µ)} xti x + σtiz {IC intermediate points} ˆxti sg fθ(xti, σti) {endpoint prediction from the model} ˆxti m ˆxti + (1 m) x {mixing IC and GC trajectories} xti ˆxti + σtiz, xti+1 ˆxti + σti+1z {GC intermediate points} L(θ) = λ(σti)D sg(fθ( xti, σti)), fθ( xti+1, σti+1) {consistency loss} θ θ η θL(θ) {update model s weights} end while

We could also use some assumptions on f, e.g. the fact that it is close to a scaling function for large t (see Corrolary 2). If f(x, σt) = σ0

σt x, then we would have:

x (xt, σt) xt vt(xt)

σt )2 xt vt(xt) 2. (92)

B. Algorithm

We present the detailed algorithm for GC (µ = ) in Algorithm 1.

C. Additional Results

C.1. Ablation Studies

Understanding why GC(µ = 1) fails. This experiment involves training a consistency model with GC(µ = 1). As shown in Figure 7(a), we observe that these models converge quickly but reach saturation early in the training process. When applying the timestep scheduling method with an increasing number of timesteps from Song and Dhariwal (2024), the FID of the models worsens. Using a fixed number of timesteps prevents divergence of the FID, but it still plateaus at a higher FID than i CT-IC.

In Figure 7(b), we plot the FID per timestep for three model / trajectory pairs: GC(µ = 1)-model on IC trajectories, GC(µ = 1)-model on GC trajectories, and IC-model on IC trajectories. Notably, we observe a distribution shift between IC

Improving Consistency Models with Generator-Augmented Flows

200 400 600 800 1000 Training iteration ( 1e3)

i CT-IC i CT-GC i CT-GC, ﬁxed 20 steps

(a) FID of IC vs GC models during training.

10 2 10 1 100 101 102

Timesteps (σ)

i CT-GC on GC trajectories i CT-GC on IC trajectories i CT-IC on IC trajectories

(b) FID of trained IC vs GC along trajectories.

Figure 7. Analysis of consistency models trained only with GC on CIFAR-10. (a) When trained with only GC trajectories, consistency models does not reach the performance of the base model (i CT-IC). In (b), we show that is linked to a distribution shift problem: GC models are weak on IC trajectoires, thus are sub-optimal for predicting ˆxti required in their own training (Equation (13)).

Table 3. Analysis of performance with regards to some hyper-parameters of i CT-GC (µ = 0.5) on CIFAR-10.

i CT-IC 7.42 0.04 i CT-GC (µ = 0.5) iso-time 6.38 0.03

i CT-GC (µ = 0.5) 5.95 0.05 i CT-GC(µ = 0.5) + dropout 7.77 0.04 i CT-GC (µ = 0.5) - EMA 6.73 0.05

and GC trajectories: the FID of the GC-model on IC trajectories degrades at the intermediate timesteps of the diffusion process. This highlights why deviating from the theory and training a model exclusively on GC trajectories is insufficient: to build xti in Equation (13), the model is inferred on IC but trained on GC trajectories. If IC and GC differ too much, the model cannot improve on IC.

Iso wall-clock training time. As illustrated above, consistency models trained with GC converge faster than IC. However, each training step is more time-consuming, as it necessitates a forward evaluation of the consistency model without gradient computation. Regarding wall-clock training time, the computational overhead of i CT-GC is approximately 20% of the i CT-IC. In top part of Table 3, we report under i CT-GC (µ = 0.5) iso-time the results of i CT-GC (µ = 0.5) trained with the same wall-clock duration as i CT-IC. Even when considering wall-clock training time, i CT-GC (µ = 0.5) is still superior to i CT-IC.

Hyper-parameters. We evaluate the influence of two important hyper-parameters. First, the dropout in the learned model. Second, whether to use or not the EMA to compute GC endpoints ˆx. The results are presented in the bottom part of Table 3. Interestingly, the results on dropout are opposite to those found by Song and Dhariwal (2024), since using dropout lowers the performance of i CT-GC (µ = 0.5).

Analysis of µ on Image Net. We present further results of the joint learning procedure with varying µ ({0.3, 0.5, 0.7, 1.}) on Image Net-32 in Figure 9. For µ = {0.3, 0.5}, i CT-GC outperforms the base model i CT-IC.

C.2. Visual Results

We include in Figure 8 examples of generated images for considered baselines.

Improving Consistency Models with Generator-Augmented Flows

(a) Trained with IC.

(b) Trained with OT.

(c) Trained with GC (µ = 0.5).

Figure 8. Uncurated samples from consistency models trained on Celeb A 64 64 for fixed noise vectors. Note that models trained with generator-augmented trajectories tend to generate sharper images.

500 1000 1500 Training iteration ( 1e3)

Minimum FID of CT i CT-IC i CT-GC (µ = 0.3)

i CT-GC (µ = 0.5)

i CT-GC (µ = 0.7)

i CT-GC (µ = 1)

Figure 9. Results of varying µ for i CT-GC on Image Net-32.

Improving Consistency Models with Generator-Augmented Flows

D. Experimental Details

The code is based on the Py Torch library (Paszke et al., 2019).

Scheduling functions and hyperparameters from Song and Dhariwal (2024). The training of consistency models heavily rely on several scheduling functions. First, there is a noise schedule {σi}N i=0 which is chosen as in Karras et al. (2022). Precisely, σi = σ0

1 ρ ) ρ with ρ = 7. Second, there is a weighting function that affects the training loss, chosen as λ(σi) = 1 σi+1 σi . Combined with the choice of noise schedule, it emphasizes to be consistent on timesteps with low noise. Then, Song et al. (2023) propose to progressively increase the number of timesteps N during training. Song and Dhariwal (2024) argue that a good choice of dicretization schedule is an exponential one: N(k) = min(s02 k

K , s1)+1 where K = K log2[s1/s0]+1 , K is the total number of training steps, k is the current training step, s0 (respectively s1) the initial (respectively final) number of timesteps. Finally, Song and Dhariwal (2024) propose a discrete probability distribution on the timesteps which mimics the continuous probability distribution recommended in the continuous training of score-based models by Karras et al. (2022). It is defined as p(σi) erf( log(σi+1) Pmean

2Pstd ) erf( log(σi) Pmean

2Pstd ). In practice, Song and Dhariwal (2024) recommend using: s0 = 10, s1 = 1280, ρ = 7, Pmean = 1.1, Pstd = 2.0.

We use the lion optimizer (Chen et al., 2023) implemented from https://github.com/lucidrains/lion-pytorch.

Selection of hyper-parameter µ. We have selected µ based on the results from Figure 5, which presents a grid search for µ on CIFAR-10. Given the bell-shaped relationship observed between µ and FID, we opted to retain the best performing value identified on CIFAR-10, µ = 0.5, for all subsequent experiments (Table 1), including those on other datasets, without further tuning. Importantly, even without an exhaustive hyperparameter search, our method consistently outperforms baseline approaches. This choice is validated by the ablation study presented in Appendix C.1 showing similar trend for another dataset, showing that the hyper-parameter µ is easy to tune.

In the ECT setting, we found that µ < 0.5 leads to improved performance, while µ > 0.5 can degrade final performance. Overall, we recommend setting µ to small values (around 0.3) since it leads to improved performance in all our experiments.

Details on neural networks architectures. We use the NCSN++ architecture (Song et al., 2021) and follow the implementation from https://github.com/NVlabs/edm.

Evaluation metrics. We report the FID, KID and IS. For the three different metrics, we rely on the implementation from Torch Metrics (Skafte Detlefsen et al., 2022). For the three different metrics, we use the standard practice (e.g. (Song and Dhariwal, 2024)) of FID which is to compare sets of 50 000 generated versus training images. Confidence intervals reported in Table 1 are averaged on five runs by sampling new sets of training images, and new sets of generated images from the same model.

Datasets. CIFAR-10 is a dataset introduced in Krizhevsky (2009). Image Net (Deng et al., 2009), Celeb A (Liu et al., 2015), and LSUN Church (Yu et al., 2015) are used respectively at 32 32, 64 64 and 64 64 resolutions. We preprocess these images by resizing smaller side to the desired value, center cropping, and linearly scaling pixel values to [ 1, 1].

Details on computational ressources As mentioned in the paper, the image dataset experiments have been conducted on NVIDIA A100 40GB GPUs.

Improving Consistency Models with Generator-Augmented Flows

Table 4. Hyperparameters for CIFAR-10. Arrays indicate quantities per resolution of the UNet model. {} indicate an hyper-parameter search performed for each type of model (i CT, i CT-OT, i CT-GC (µ = 0.5)).

Hyperparameter Value

batch size 512 image resolution 32 training steps 100 000 learning rate {0.0001, 0.00003} optimizer lion s0 10 s1 1280 ρ 7 σ0 0.002 σ1 80 network architecture Song UNet (from (Karras et al., 2022) implementation) model channels 128 dropout {0., 0.3} num blocks 3 embedding type positional channel multiplicative factor [1, 2, 2] attn resolutions

Table 5. Hyperparameters for Celeb A and LSUN Church. Arrays indicate quantities per resolution of the UNet model. {} indicate an hyper-parameter search performed for each type of model (i CT, i CT-OT, i CT-GC (µ = 0.5)).

Hyperparameter Value

batch size 128 image resolution 64 training steps 150 000 learning rate 0.00008 optimizer lion s0 10 s1 1280 ρ 7 σ0 0.002 σ1 80 network architecture Song UNet (from (Karras et al., 2022) implementation) model channels 128 dropout {0., [0., 0., 0.2, 0.2]} num blocks [3, 3, 4, 5] embedding type positional channel multiplicative factor [1, 2, 2, 2] attn resolutions

Improving Consistency Models with Generator-Augmented Flows

Table 6. Hyperparameters for Image Net-1k. Arrays indicate quantities per resolution of the UNet model. {} indicate an hyper-parameter search performed for each type of model (i CT, i CT-OT, i CT-GC (µ = 0.5)).

Hyperparameter Value

batch size 512 image resolution 32 training steps 150 000 learning rate 0.00008 optimizer lion s0 10 s1 1280 ρ 7 σ0 0.002 σ1 80 network architecture Song UNet (from (Karras et al., 2022) implementation) model channels 128 dropout {0., [0., 0., 0.2, 0.2]} num blocks [3, 5, 7] embedding type positional channel mult [1, 1, 2] attn resolutions [16]