# improved_techniques_for_training_consistency_models__e3ad2563.pdf

Published as a conference paper at ICLR 2024

IMPROVED TECHNIQUES FOR TRAINING CONSISTENCY MODELS

Yang Song & Prafulla Dhariwal Open AI {songyang,prafulla}@openai.com

Consistency models are a nascent family of generative models that can sample high quality data in one step without the need for adversarial training. Current consistency models achieve optimal sample quality by distilling from pre-trained diffusion models and employing learned metrics such as LPIPS. However, distillation limits the quality of consistency models to that of the pre-trained diffusion model, and LPIPS causes undesirable bias in evaluation. To tackle these challenges, we present improved techniques for consistency training, where consistency models learn directly from data without distillation. We delve into the theory behind consistency training and identify a previously overlooked flaw, which we address by eliminating Exponential Moving Average from the teacher consistency model. To replace learned metrics like LPIPS, we adopt Pseudo-Huber losses from robust statistics. Additionally, we introduce a lognormal noise schedule for the consistency training objective, and propose to double total discretization steps every set number of training iterations. Combined with better hyperparameter tuning, these modifications enable consistency models to achieve FID scores of 2.51 and 3.25 on CIFAR-10 and Image Net 64 ˆ 64 respectively in a single sampling step. These scores mark a 3.5ˆ and 4ˆ improvement compared to prior consistency training approaches. Through two-step sampling, we further reduce FID scores to 2.24 and 2.77 on these two datasets, surpassing those obtained via distillation in both one-step and two-step settings, while narrowing the gap between consistency models and other state-of-the-art generative models.

1 INTRODUCTION

Consistency models (Song et al., 2023) are an emerging family of generative models that produce high-quality samples using a single network evaluation. Unlike GANs (Goodfellow et al., 2014), consistency models are not trained with adversarial optimization and thus sidestep the associated training difficulty. Compared to score-based diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; 2020; Ho et al., 2020; Song et al., 2021), consistency models do not require numerous sampling steps to generate high-quality samples. They are trained to generate samples in a single step, but still retain important advantages of diffusion models, such as the flexibility to exchange compute for sample quality through multistep sampling, and the ability to perform zero-shot data editing.

We can train consistency models using either consistency distillation (CD) or consistency training (CT). The former requires pre-training a diffusion model and distilling the knowledge therein into a consistency model. The latter allows us to train consistency models directly from data, establishing them as an independent family of generative models. Previous work (Song et al., 2023) demonstrates that CD significantly outperforms CT. However, CD adds computational overhead to the training process since it requires learning a separate diffusion model. Additionally, distillation limits the sample quality of the consistency model to that of the diffusion model. To avoid the downsides of CD and to position consistency models as an independent family of generative models, we aim to improve CT to either match or exceed the performance of CD.

For optimal sample quality, both CD and CT rely on learned metrics like the Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018) in previous work (Song et al., 2023). However, depending on LPIPS has two primary downsides. Firstly, there could be potential bias in evaluation

Published as a conference paper at ICLR 2024

since the same Image Net dataset (Deng et al., 2009) trains both LPIPS and the Inception network in Fréchet Inception Distance (FID) (Heusel et al., 2017), which is the predominant metric for image quality. As analyzed in Kynkäänniemi et al. (2023), improvements of FIDs can come from accidental leakage of Image Net features from LPIPS, causing inflated FID scores. Secondly, learned metrics require pre-training auxiliary networks for feature extraction. Training with these metrics requires backpropagating through extra neural networks, which increases the demand for compute.

To tackle these challenges, we introduce improved techniques for CT that not only surpass CD in sample quality but also eliminate the dependence on learned metrics like LPIPS. Our techniques are motivated from both theoretical analysis, and comprehensive experiments on the CIFAR-10 dataset (Krizhevsky et al., 2014). Specifically, we perform an in-depth study on the empirical impact of weighting functions, noise embeddings, and dropout in CT. Additionally, we identify an overlooked flaw in prior theoretical analysis for CT and propose a simple fix by removing the Exponential Moving Average (EMA) from the teacher network. We adopt Pseudo-Huber losses from robust statistics to replace LPIPS. Furthermore, we study how sample quality improves as the number of discretization steps increases, and utilize the insights to propose a simple but effective curriculum for total discretization steps. Finally, we propose a new schedule for sampling noise levels in the CT objective based on lognormal distributions.

Taken together, these techniques allow CT to attain FID scores of 2.51 and 3.25 for CIFAR-10 and Image Net 64 ˆ 64 in one sampling step, respectively. These scores not only surpass CD but also represent improvements of 3.5ˆ and 4ˆ over previous CT methods. Furthermore, they significantly outperform the best few-step diffusion distillation techniques for diffusion models even without the need for distillation. By two-step generation, we achieve improved FID scores of 2.24 and 2.77 on CIFAR-10 and Image Net 64 ˆ 64, surpassing the scores from CD in both one-step and two-step settings. These results rival many top-tier diffusion models and GANs, showcasing the strong promise of consistency models as a new independent family of generative models.

2 CONSISTENCY MODELS

Central to the formulation of consistency models is the probability flow ordinary differential equation (ODE) from Song et al. (2021). Let us denote the data distribution by pdatapxq. When we add Gaussian noise with mean zero and standard deviation σ to this data, the resulting perturbed distribution is given by pσpxq ş pdatapyq Npx | y, σ2Iq dy. The probability flow ODE, as presented in Karras et al. (2022), takes the form of

dx dσ σ x log pσpxq σ P rσmin, σmaxs, (1)

where the term x log pσpxq is known as the score function of pσpxq (Song et al., 2019; Song & Ermon, 2019; 2020; Song et al., 2021). Here σmin is a small positive value such that pσminpxq pdatapxq, introduced to avoid numerical issues in ODE solving. Meanwhile, σmax is sufficiently large so that pσpxq Np0, σ2 max Iq. Following Karras et al. (2022); Song et al. (2023), we adopt σmin 0.002, and σmax 80 throughout the paper. Crucially, solving the probability flow ODE from noise level σ1 to σ2 allows us to transform a sample xσ1 pσ1pxq into xσ2 pσ2pxq.

The ODE in Eq. (1) establishes a bijective mapping between a noisy data sample xσ pσpxq and xσmin pσminpxq pdatapxq. This mapping, denoted as f : pxσ, σq ÞÑ xσmin, is termed the consistency function. By its very definition, the consistency function satisfies the boundary condition f px, σminq x. A consistency model, which we denote by fθpx, σq, is a neural network trained to approximate the consistency function f px, σq. To meet the boundary condition, we follow Song et al. (2023) to parameterize the consistency model as

fθpx, σq cskippσqx coutpσq Fθpx, σq, (2)

where Fθpx, σq is a free-form neural network, while cskippσq and coutpσq are differentiable functions such that cskippσminq 1 and coutpσminq 0.

To train the consistency model, we discretize the probability flow ODE using a sequence of noise levels σmin σ1 ă σ2 ă ă σN σmax, where we follow Karras et al. (2022); Song et al. (2023) in setting σi pσ1{ρ min i 1

N 1pσ1{ρ max σ1{ρ minqqρ for i P J1, NK, and ρ 7, where Ja, b K denotes the set

Published as a conference paper at ICLR 2024

of integers ta, a 1, , bu. The model is trained by minimizing the following consistency matching (CM) loss over θ:

LNpθ, θ q E λpσiqdpfθpxσi 1, σi 1q, fθ p xσi, σiqq , (3)

where xσi xσi 1 pσi σi 1qσi 1 x log pσi 1pxq|x xσi 1. In Eq. (3), dpx, yq is a metric function comparing vectors x and y, and λpσq ą 0 is a weighting function. Typical metric functions include the squared ℓ2 metric dpx, yq x y 2 2, and the Learned Perceptual Image Patch Similarity (LPIPS) metric introduced in Zhang et al. (2018). The expectation in Eq. (3) is taken over the following sampling process: i UJ1, N 1K where UJ1, N 1K represents the uniform distribution over t1, 2, , N 1u, and xσi 1 pσi 1pxq. Note that xσi is derived from xσi 1 by solving the probability flow ODE in the reverse direction for a single step. In Eq. (3), fθ and fθ are referred to as the student network and the teacher network, respectively. The teacher s parameter θ is obtained by applying Exponential Moving Average (EMA) to the student s parameter θ during the course of training as follows:

θ Ð stopgradpµθ p1 µqθq, (4)

with 0 ď µ ă 1 representing the EMA decay rate. Here we explicitly employ the stopgrad operator to highlight that the teacher network remains fixed for each optimization step of the student network. However, in subsequent discussions, we will omit the stopgrad operator when its presence is clear and unambiguous. In practice, we also maintain EMA parameters for the student network to achieve better sample quality at inference time. It is clear that as N increases, the consistency model optimized using Eq. (3) approaches the true consistency function. For faster training, Song et al. (2023) propose a curriculum learning strategy where N is progressively increased and the EMA decay rate µ is adjusted accordingly. This curriculum for N and µ is denoted by Npkq and µpkq, where k P N is a non-negative integer indicating the current training step.

Given that xσi relies on the unknown score function x log pσi 1pxq, directly optimizing the consistency matching objective in Eq. (3) is infeasible. To circumvent this challenge, Song et al. (2023) propose two training algorithms: consistency distillation (CD) and consistency training (CT). For consistency distillation, we first train a diffusion model sϕpx, σq to estimate x log pσpxq via score matching (Hyvärinen, 2005; Vincent, 2011; Song et al., 2019; Song & Ermon, 2019), then approximate xσi with ˆxσi xσi 1 pσi σi 1qσi 1sϕpxσi 1, σi 1q. On the other hand, consistency training employs a different approximation method. Recall that xσi 1 x σi 1z with x pdatapxq and z Np0, Iq. Using the same x and z, Song et al. (2023) define ˇxσi x σiz as an approximation to xσi, which leads to the consistency training objective below:

LN CTpθ, θ q E rλpσiqdpfθpx σi 1z, σi 1q, fθ px σiz, σiqqs . (5)

As analyzed in Song et al. (2023), this objective is asymptotically equivalent to consistency matching in the limit of N Ñ 8. We will revisit this analysis in Section 3.2.

After training a consistency model fθpx, σq through CD or CT, we can directly generate a sample x by starting with z Np0, σ2 max Iq and computing x fθpz, σmaxq. Notably, these models also enable multistep generation. For a sequence of indices 1 i1 ă i2 ă ă i K N, we start by

sampling x K Np0, σ2 max Iq and then iteratively compute xk Ð fθpxk 1, σik 1q b

σ2 ik σ2 minzk for k K 1, K 2, , 1, where zk Np0, Iq. The resulting sample x1 approximates the distribution pdatapxq. In our experiments, setting K 3 (two-step generation) often enhances the quality of one-step generation considerably, though increasing the number of sampling steps further provides diminishing benefits.

3 IMPROVED TECHNIQUES FOR CONSISTENCY TRAINING

Below we re-examine the design choices of CT in Song et al. (2023) and pinpoint modifications that improve its performance, which we summarize in Table 1. We focus on CT without learned metric functions. For our experiments, we employ the Score SDE architecture in Song et al. (2021) and train the consistency models for 400,000 iterations on the CIFAR-10 dataset (Krizhevsky et al., 2014) without class labels. While our primary focus remains on CIFAR-10 in this section, we observe similar improvements on other datasets, including Image Net 64ˆ64 (Deng et al., 2009). We measure sample quality using Fréchet Inception Distance (FID) (Heusel et al., 2017).

Published as a conference paper at ICLR 2024

Table 1: Comparing the design choices for CT in Song et al. (2023) versus our modifications.

Design choice in Song et al. (2023) Our modifications

EMA decay rate for the teacher network µpkq expp s0 log µ0

Npkq q µpkq 0

Metric in consistency loss dpx, yq LPIPSpx, yq dpx, yq b

x y 2 2 c2 c

Discretization curriculum Npkq Qb

k K pps1 1q2 s2 0q s2 0 1 U 1

Npkq minps02t k

K1 u, s1q 1, where K1 Y K log2ts1{s0u 1 ]

Noise schedule σi, where i UJ1, Npkq 1K σi, where i ppiq, and ppiq9 erf logpσi 1q Pmean ?

2Pstd erf logpσiq Pmean ?

Weighting function λpσiq 1 λpσiq 1 σi 1 σi

s0 2, s1 150, µ0 0.9 on CIFAR-10 s0 10, s1 1280

s0 2, s1 200, µ0 0.95 on Image Net 64 ˆ 64 c 0.00054 ?

d, d is data dimensionality

Pmean 1.1, Pstd 2.0

k P J0, KK, where K is the total training iterations

σi pσ1{ρ min i 1 Npkq 1pσ1{ρ max σ1{ρ minqqρ, where i P J1, Npkq K, ρ 7, σmin 0.002, σmax 80

3.1 WEIGHTING FUNCTIONS, NOISE EMBEDDINGS, AND DROPOUT

We start by exploring several hyperparameters that are known to be important for diffusion models, including the weighting function λpσq, the embedding layer for noise levels, and dropout (Ho et al., 2020; Song et al., 2021; Dhariwal & Nichol, 2021; Karras et al., 2022). We find that proper selection of these hyperparameters greatly improve CT when using the squared ℓ2 metric.

The default weighting function in Song et al. (2023) is uniform, i.e., λpσq 1. This assigns equal weights to consistency losses at all noise levels, which we find to be suboptimal. We propose to modify the weighting function so that it reduces as noise levels increase. The rationale is that errors from minimizing consistency losses in smaller noise levels can influence larger ones and therefore should be weighted more heavily. Specifically, our weighting function (cf., Table 1) is defined as λpσiq 1 σi 1 σi . The default choice for σi, given in Section 2, ensures that λpσiq 1 σi 1 σi reduces monotonically as σi increases, thus assigning smaller weights to higher noise levels. As shown in Fig. 1c, this refined weighting function notably improves the sample quality in CT with the squared ℓ2 metric.

In Song et al. (2023), Fourier embedding layers (Tancik et al., 2020) and positional embedding layers (Vaswani et al., 2017) are used to embed noise levels for CIFAR-10 and Image Net 64 ˆ 64 respectively. It is essential that noise embeddings are sufficiently sensitive to minute differences to offer training signals, yet too much sensitivity can lead to training instability. As shown in Fig. 1b, high sensitivity can lead to the divergence of continuous-time CT (Song et al., 2023). This is a known challenge in Song et al. (2023), which they circumvent by initializing the consistency model with parameters from a pre-trained diffusion model. In Fig. 1b, we show continuous-time CT on CIFAR-10 converges with random initial parameters, provided we use a less sensitive noise embedding layer with a reduced Fourier scale parameter, as visualized in Fig. 1a. For discrete-time CT, models are less affected by the sensitivity of the noise embedding layers, but as shown in Fig. 1c, reducing the scale parameter in Fourier embedding layers from the default value of 16.0 to a smaller value of 0.02 still leads to slight improvement of FIDs on CIFAR-10. For Image Net models, we employ the default positional embedding, as it has similar sensitivity to Fourier embedding with scale 0.02 (see Fig. 1a).

Previous experiments with consistency models in Song et al. (2023) always employ zero dropout, motivated by the fact that consistency models generate samples in a single step, unlike diffusion models that do so in multiple steps. Therefore, it is intuitive that consistency models, facing a more challenging task, would be less prone to overfitting and need less regularization than their diffusion counterparts. Contrary to our expectations, we discovered that using larger dropout than diffusion models improves the sample quality of consistency models. Specifically, as shown in Fig. 1c, a dropout rate of 0.3 for consistency models on CIFAR-10 obtains better FID scores. For Image Net 64 ˆ 64, we find it beneficial to apply dropout of 0.2 to layers with resolution less than or equal to 16 ˆ 16, following Hoogeboom et al. (2023). We additionally ensure that the random number

Published as a conference paper at ICLR 2024

(a) Sensitivity of noise embeddings.

(b) Continuous-time CT.

(c) Ablation study.

Figure 1: (a) As the Fourier scale parameter decreases, Fourier noise embeddings become less sensitive to minute noise differences. This sensitivity is closest to that of positional embeddings when the Fourier scale is set to 0.02. (b) Continuous-time CT diverges when noise embeddings are overly sensitive to minor noise differences. (c) An ablation study examines the effects of our selections for weighting function ( 1 σi 1 σi ), noise embedding (Fourier scale 0.02), and dropout ( 0.3) on CT using the squared ℓ2 metric. Here baseline models for both metrics follow configurations in Song et al. (2023). All models are trained on CIFAR-10 without class labels.

generators for dropout share the same states across the student and teacher networks when optimizing the CT objective in Eq. (5).

By choosing the appropriate weighting function, noise embedding layers, and dropout, we significantly improve the sample quality of consistency models using the squared ℓ2 metric, closing the gap with the original CT in Song et al. (2023) that relies on LPIPS (see Fig. 1c). Although our modifications do not immediately improve the sample quality of CT with LPIPS, combining with additional techniques in Section 3.2 will yield significant improvements for both metrics.

3.2 REMOVING EMA FOR THE TEACHER NETWORK

When training consistency models, we minimize the discrepancy between models evaluated at adjacent noise levels. Recall from Section 2 that the model with the lower noise level is termed the teacher network, and its counterpart the student network. While Song et al. (2023) maintains EMA parameters for both networks with potentially varying decay rates, we present a theoretical argument indicating that the EMA decay rate for the teacher network should always be zero for CT, although it can be nonzero for CD. We revisit the theoretical analysis in Song et al. (2023) to support our assertion and provide empirical evidence that omitting EMA from the teacher network in CT notably improves the sample quality of consistency models.

To support the use of CT, Song et al. (2023) present two theoretical arguments linking the CT and CM objectives as N Ñ 8. The first line of reasoning, which we call Argument (i), draws upon Theorem 2 from Song et al. (2023) to show that under certain regularity conditions, LN CTpθ, θ q LNpθ, θ q op σq. That is, when N Ñ 8, we have σ Ñ 0 and hence LN CTpθ, θ q converges to LNpθ, θ q asymptotically. The second argument, called Argument (ii), is grounded in Theorem 6 from Song et al. (2023) which asserts that when θ θ, both lim NÑ8p N 1q θLNpθ, θ q and lim NÑ8p N 1q θLN CTpθ, θ q are well-defined and identical. This suggests that after scaling by N 1, gradients of the CT and CM objectives match in the limit of N Ñ 8, leading to equivalent training dynamics. Unlike Argument (i), Argument (ii) is valid only when θ θ, which can be enforced by setting the EMA decay rate µ for the teacher network to zero in Eq. (4).

We show this inconsistency in requirements for Argument (i) and (ii) to hold is caused by flawed theoretical analysis of the former. Specifically, Argument (i) fails if lim NÑ8 LNpθ, θ q is not a valid objective for learning consistency models, which we show can happen when θ θ. To give a concrete example, consider a data distribution pdatapxq δpx ξq, which leads to pσpxq Npx; ξ, σ2q and a ground truth consistency function f px, σq σmin

σ ξ. Let us define the consistency model as fθpx, σq σmin

σ θ. In addition, let σi σmin i 1

N 1pσmax σminq for i P J1, NK be the noise levels, where we have σ σmax σmin

N 1 . Given z Np0, 1q and xσi 1 ξ σi 1z, it is straightforward to show that xσi xσi 1 σi 1pσi σi 1q x log pσpxσi 1q

Published as a conference paper at ICLR 2024

(a) LPIPS & squared ℓ2 metrics.

(b) s0 2, s1 150

(c) s0 10, s1 1280

Figure 2: (a) Removing EMA in the teacher network leads to significant improvement in FIDs. (b, c) Pseudo-Huber metrics significantly improve the sample quality of squared ℓ2 metric, and catches up with LPIPS when using overall larger Npkq, where the Pseudo-Huber metric with c 0.03 is the optimal. All training runs here employ the improved techniques from Sections 3.1 and 3.2.

simplifies to ˇxσi ξ σiz. As a result, the objectives for CM and CT align perfectly in this toy example. Building on top of this analysis, the following result proves that lim NÑ8 LNpθ, θ q here is not amenable for learning consistency models whenever θ θ.

Proposition 1. Given the notations introduced earlier, and using the uniform weighting function λpσq 1 along with the squared ℓ2 metric, we have

lim NÑ8 LNpθ, θ q lim NÑ8 LN CTpθ, θ q E 1 σmin

2pθ θ q2ı if θ θ (6)

lim NÑ8 1 σ d LNpθ, θ q

pθ ξq2ı , θ θ

8, θ ă θ 8, θ ą θ

Proof. See Appendix A.

Recall that typically θ θ when µ 0. In this case, Eq. (6) shows that the CM/CT objective is independent of ξ, thus providing no signal of the data distribution and are therefore impossible to train correct consistency models. This directly refutes Argument (i). In contrast, when we set µ 0 to ensure θ θ, Eq. (7) indicates that the gradient of the CM/CT objective, when scaled by 1{ σ, converges to the gradient of the mean squared error between θ and ξ. Optimizing this gradient consequently yields θ ξ, accurately learning the ground truth consistency function. This analysis is consistent with Argument (ii).

As illustrated in Fig. 2a, discarding EMA from the teacher network notably improves sample quality for CT across both LPIPS and squared ℓ2 metrics. The curves labeled Improved correspond to CT using the improved design outlined in Section 3.1. Setting µpkq 0 for all training iteration k effectively counters the sample quality degradation of LPIPS caused by the modifications in Section 3.1. Combining the strategies from Section 3.1 with a zero EMA for the teacher, we are able to match the sample quality of the original CT in Song et al. (2023) that necessitates LPIPS, by using simple squared ℓ2 metrics.

3.3 PSEUDO-HUBER METRIC FUNCTIONS

Using the methods from Sections 3.1 and 3.2, we are able to improve CT with squared ℓ2 metric, matching the original CT in Song et al. (2023) that utilizes LPIPS. Yet, as shown in Fig. 2a, LPIPS still maintains a significant advantage over traditional metric functions when the same improved techniques are in effect for all. To address this disparity, we adopt the Pseudo-Huber metric family (Charbonnier et al., 1997), defined as

x y 2 2 c2 c, (8)

Published as a conference paper at ICLR 2024

where c ą 0 is an adjustable constant. As depicted in Fig. 5a, Pseudo-Huber metrics smoothly bridge the ℓ1 and squared ℓ2 metrics, with c determining the breadth of the parabolic section. In contrast to common metrics like ℓ0, ℓ1, and ℓ8, Pseudo-Huber metrics are continuously twice differentiable, and hence meet the theoretical requirement for CT outlined in Song et al. (2023).

Compared to the squared ℓ2 metric, the Pseudo-Huber metric is more robust to outliers as it imposes a smaller penalty for large errors than the squared ℓ2 metric does, yet behaves similarly for smaller errors. We posit that this added robustness can reduce variance during training. To validate this hypothesis, we examine the ℓ2 norms of parameter updates obtained from the Adam optimizer during the course of training for both squared ℓ2 and Pseudo-Huber metric functions, and summarize results in Fig. 5b. Our observations confirm that the Pseudo-Huber metric results in reduced variance relative to the squared ℓ2 metric, aligning with our hypothesis.

We evaluate the effectiveness of Pseudo-Huber metrics by training several consistency models with varying c values on CIFAR-10 and comparing their sample quality with models trained using LPIPS and squared ℓ2 metrics. We incorporate improved techniques from Sections 3.1 and 3.2 for all metrics. Fig. 2 reveals that Pseudo-Huber metrics yield notably better sample quality than the squared ℓ2 metric. By increasing the overall size of Npkq adjusting s0 and s1 from the standard values of 2 and 150 in Song et al. (2023) to our new values of 10 and 1280 (more in Section 3.4) we for the first time surpass the performance of CT with LPIPS on equal footing using a traditional metric function that does not rely on learned feature representations. Furthermore, Fig. 2c indicates that c 0.03 is optimal for CIFAR-10 images. We suggest that c should scale linearly with x y 2, and propose a heuristic of c 0.00054 ?

d for images with d dimensions. Empirically, we find this recommendation to work well on both CIFAR-10 and Image Net 64 ˆ 64 datasets.

3.4 IMPROVED CURRICULUM FOR TOTAL DISCRETIZATION STEPS

As mentioned in Section 3.2, CT s theoretical foundation holds asymptotically as N Ñ 8. In practice, we have to select a finite N for training consistency models, potentially introducing bias into the learning process. To understand the influence of N on sample quality, we train a consistency model with improved techniques from Sections 3.1 to 3.3. Unlike Song et al. (2023), we use an exponentially increasing curriculum for the total discretization steps N, doubling N after a set number of training iterations. Specifically, the curriculum is described by

Npkq minps02t k

K1 u, s1q 1, K1 Y K log2ts1{s0u 1

and its shape is labelled Exp in Fig. 3b. Here s0 and s1 control the minimum and maximum number of discretization steps, and K is the total number of training iterations.

As revealed in Fig. 3a, the sample quality of consistency models improves predictably as N increases. Importantly, FID scores relative to N adhere to a precise power law until reaching saturation, after which further increases in N yield diminishing benefits. As noted by Song et al. (2023), while larger N can reduce bias in CT, they might increase variance. On the contrary, smaller N reduces variance at the cost of higher bias. Based on Fig. 3a, we cap N at 1281 in Npkq, which we empirically find to strike a good balance between bias and variance. In our experiments, we set s0 and s1 in discretization curriculums from their default values of 2 and 150 in Song et al. (2023) to 10 and 1280 respectively.

Aside from the exponential curriculum above, we also explore various shapes for Npkq with the same s0 10 and s1 1280, including a constant function, the square root function from Song et al. (2023), a linear function, a square function, and a cosine function. The shapes of various curriculums are illustrated in Fig. 3b. As Fig. 3c demonstrates, the exponential curriculum yields the best sample quality for consistency models. Consequently, we adopt the exponential curriculum in Eq. (9) as our standard for setting Npkq going forward.

3.5 IMPROVED NOISE SCHEDULES

Song et al. (2023) propose to sample a random i from UJ1, N 1K and select σi and σi 1 to compute the CT objective. Given that σi pσ1{ρ min i 1

N 1pσ1{ρ max σ1{ρ minqqρ, this corresponds to sampling from

the distribution pplog σq σ σ1{ρ 1

ρpσ1{ρ max σ1{ρ min q as N Ñ 8. As shown in Fig. 4a, this distribution exhibits a

Published as a conference paper at ICLR 2024

(a) FID scores vs. N

(b) Various curriculums for Npkq.

(c) FIDs vs. Npkq curriculums.

Figure 3: (a) FID scores improve predictably as the number of discretization steps N grows. (b) The shapes of various curriculums for total discretization steps Npkq. (c) The FID curves of various curriculums for discretization. All models are trained with improved techniques from Sections 3.1 to 3.3 with the only difference in discretization curriculums.

higher probability density for larger values of log σ. This is at odds with the intuition that consistency losses at lower noise levels influence subsequent ones and cause error accumulation, so losses at lower noise levels should be given greater emphasis. Inspired by Karras et al. (2022), we address this by adopting a lognormal distribution to sample noise levels, setting a mean of -1.1 and a standard deviation of 2.0. As illustrated in Fig. 4a, this lognormal distribution assigns significantly less weight to high noise levels. Moreover, it also moderates the emphasis on smaller noise levels. This is helpful because learning is easier at smaller noise levels due to the inductive bias in our parameterization of the consistency model to meet the boundary condition.

For practical implementation, we sample noise levels in the set tσ1, σ2, , σNu according to a discretized lognormal distribution defined as

ppσiq9 erf ˆlogpσi 1q Pmean ?

erf ˆlogpσiq Pmean ?

where Pmean 1.1 and Pstd 2.0. As depicted in Fig. 4b, this lognormal noise schedule significantly improves the sample quality of consistency models.

4 PUTTING IT TOGETHER

Combining all the improved techniques from Sections 3.1 to 3.5, we employ CT to train several consistency models on CIFAR-10 and Image Net 64 ˆ 64 and benchmark their performance with competing methods in the literature. We evaluate sample quality using FID (Heusel et al., 2017), Inception score (Salimans et al., 2016), and Precision/Recall (Kynkäänniemi et al., 2019). For best performance, we use a larger batch size and an increased EMA decay rate for the student network in CT across all models. The model architectures are based on Score SDE (Song et al., 2021) for CIFAR-10 and ADM (Dhariwal & Nichol, 2021) for Image Net 64 ˆ 64. We also explore deeper variants of these architectures by doubling the model depth. We call our method i CT which stands for improved consistency training , and the deeper variants i CT-deep. We summarize our results in Tables 2 and 3 and provide uncurated samples from both i CT and i CT-deep in Figs. 6 to 9. More experimental details and results are provided in Appendix B.

It is important to note that we exclude methods based on Fast GAN (Liu et al., 2020; Sauer et al., 2021) or Style GAN-XL (Sauer et al., 2022) from our comparison, because both utilize Image Net pre-trained feature extractors in their discriminators. As noted by Kynkäänniemi et al. (2023), this can skew FIDs and lead to inflated sample quality. Methods based on LPIPS suffer from similar issues, as LPIPS is also pre-trained on Image Net. We include these methods in Tables 2 and 3 for completeness, but we do not consider them as direct competitors to i CT or i CT-deep methods.

Several key observations emerge from Tables 2 and 3. First, i CT methods surpass previous diffusion distillation approaches in both one-step and two-step generation on CIFAR-10 and Image Net 64ˆ64, all while circumventing the need for training diffusion models. Secondly, i CT models demonstrate

Published as a conference paper at ICLR 2024

Table 2: Comparing the quality of unconditional samples on CIFAR-10.

METHOD NFE (Ó) FID (Ó) IS (Ò)

Fast samplers & distillation for diffusion models DDIM (Song et al., 2020) 10 13.36 DPM-solver-fast (Lu et al., 2022) 10 4.70 3-DEIS (Zhang & Chen, 2022) 10 4.17 Uni PC (Zhao et al., 2023) 10 3.87 Knowledge Distillation (Luhman & Luhman, 2021) 1 9.36 DFNO (LPIPS) (Zheng et al., 2022) 1 3.78 2-Rectified Flow (+distill) (Liu et al., 2022) 1 4.85 9.01 TRACT (Berthelot et al., 2023) 1 3.78 2 3.32 Diff-Instruct (Luo et al., 2023) 1 4.53 9.89 PD (Salimans & Ho, 2022) 1 8.34 8.69 2 5.58 9.05 CD (LPIPS) (Song et al., 2023) 1 3.55 9.48 2 2.93 9.75 Direct Generation Score SDE (Song et al., 2021) 2000 2.38 9.83 Score SDE (deep) (Song et al., 2021) 2000 2.20 9.89 DDPM (Ho et al., 2020) 1000 3.17 9.46 LSGM (Vahdat et al., 2021) 147 2.10 PFGM (Xu et al., 2022) 110 2.35 9.68 EDM (Karras et al., 2022) 35 2.04 9.84 EDM-G++ (Kim et al., 2023) 35 1.77 IGEBM (Du & Mordatch, 2019) 60 40.6 6.02 NVAE (Vahdat & Kautz, 2020) 1 23.5 7.18 Glow (Kingma & Dhariwal, 2018) 1 48.9 3.92 Residual Flow (Chen et al., 2019) 1 46.4 Big GAN (Brock et al., 2019) 1 14.7 9.22 Style GAN2 (Karras et al., 2020b) 1 8.32 9.21 Style GAN2-ADA (Karras et al., 2020a) 1 2.92 9.83 CT (LPIPS) (Song et al., 2023) 1 8.70 8.49 2 5.83 8.85 i CT (ours) 1 2.83 9.54 2 2.46 9.80 i CT-deep (ours) 1 2.51 9.76 2 2.24 9.89

Table 3: Comparing the quality of classconditional samples on Image Net 64 ˆ 64.

METHOD NFE (Ó) FID (Ó) Prec. (Ò) Rec. (Ò)

Fast samplers & distillation for diffusion models DDIM (Song et al., 2020) 50 13.7 0.65 0.56 10 18.3 0.60 0.49 DPM solver (Lu et al., 2022) 10 7.93 20 3.42 DEIS (Zhang & Chen, 2022) 10 6.65 20 3.10 DFNO (LPIPS) (Zheng et al., 2022) 1 7.83 0.61 TRACT (Berthelot et al., 2023) 1 7.43 2 4.97 BOOT (Gu et al., 2023) 1 16.3 0.68 0.36 Diff-Instruct (Luo et al., 2023) 1 5.57 PD (Salimans & Ho, 2022) 1 15.39 0.59 0.62 2 8.95 0.63 0.65 4 6.77 0.66 0.65 PD (LPIPS) (Song et al., 2023) 1 7.88 0.66 0.63 2 5.74 0.67 0.65 4 4.92 0.68 0.65 CD (LPIPS) (Song et al., 2023) 1 6.20 0.68 0.63 2 4.70 0.69 0.64 3 4.32 0.70 0.64 Direct Generation RIN (Jabri et al., 2023) 1000 1.23 DDPM (Ho et al., 2020) 250 11.0 0.67 0.58 i DDPM (Nichol & Dhariwal, 2021) 250 2.92 0.74 0.62 ADM (Dhariwal & Nichol, 2021) 250 2.07 0.74 0.63 EDM (Karras et al., 2022) 511 1.36 EDM (Heun) (Karras et al., 2022) 79 2.44 0.71 0.67 Big GAN-deep (Brock et al., 2019) 1 4.06 0.79 0.48 CT (LPIPS) (Song et al., 2023) 1 13.0 0.71 0.47 2 11.1 0.69 0.56 i CT (ours) 1 4.02 0.70 0.63 2 3.20 0.73 0.63 i CT-deep (ours) 1 3.25 0.72 0.63 2 2.77 0.74 0.62 Most results for existing methods are taken from a previous paper, except for those marked with *, which are from our own re-implementation.

sample quality comparable to many leading generative models, including diffusion models and GANs. For instance, with one-step generation, i CT-deep obtains FIDs of 2.51 and 3.25 for CIFAR-10 and Image Net respectively, whereas DDPMs (Ho et al., 2020) necessitate thousands of sampling steps to reach FIDs of 3.17 and 11.0 (result taken from Gu et al. (2023)) on both datasets. The one-step FID for i CT already exceeds that of Style GAN-ADA (Karras et al., 2020b) on CIFAR-10, and that of Big GAN-deep (Brock et al., 2019) on Image Net 64 ˆ 64, let alone i CT-deep models. For two-step generation, i CT-deep records an FID of 2.24, matching Score SDE in Song et al. (2021), a diffusion model with an identical architecture but demands 2000 sampling steps for an FID of 2.20. Lastly, i CT methods show improved recall than CT (LPIPS) in Song et al. (2023) and Big GAN-deep, indicating better diversity and superior mode coverage.

5 CONCLUSION

Our improved techniques for CT have successfully addressed its previous limitations, surpassing the performance of CD in generating high-quality samples without relying on LPIPS. We examined the impact of weighting functions, noise embeddings, and dropout. By removing EMA for teacher networks, adopting Pseudo-Huber losses in lieu of LPIPS, combined with a new curriculum for discretization and noise sampling schedule, we have achieved unprecedented FID scores for consistency models on both CIFAR-10 and Image Net 64 ˆ 64 datasets. Remarkably, these results outpace previous CT methods by a considerable margin, surpass previous few-step diffusion distillation techniques, and challenge the sample quality of leading diffusion models and GANs.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

We would like to thank Alex Nichol, Allan Jabri, Ishaan Gulrajani, Jakub Pachocki, Mark Chen and Ilya Sutskever for discussions and support throughout this project.

David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbot, and Eric Gu. Tract: Denoising diffusion models with transitive closure time-distillation. ar Xiv preprint ar Xiv:2303.04248, 2023.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm.

Pierre Charbonnier, Laure Blanc-Féraud, Gilles Aubert, and Michel Barlaud. Deterministic edgepreserving regularization in computed imaging. IEEE Transactions on image processing, 6(2): 298 311, 1997.

Ricky TQ Chen, Jens Behrmann, David K Duvenaud, and Jörn-Henrik Jacobsen. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pp. 9916 9926, 2019.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis. ar Xiv preprint ar Xiv:2105.05233, 2021.

Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/ 2019/file/378a063b8fdb1db941e34f4bde584c7d-Paper.pdf.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, and Joshua M Susskind. Boot: Data-free distillation of denoising diffusion models with bootstrapping. In ICML 2023 Workshop on Structured Probabilistic Inference tz&u Generative Modeling, 2023.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 2020.

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. ar Xiv preprint ar Xiv:2301.11093, 2023.

Aapo Hyvärinen. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(Apr):695 709, 2005.

Allan Jabri, David J. Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. In Proceedings of the 40th International Conference on Machine Learning, ICML 23. JMLR.org, 2023.

Published as a conference paper at ICLR 2024

Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Advances in neural information processing systems, 33:12104 12114, 2020a.

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. 2020b.

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. In Proc. Neur IPS, 2022.

Dongjun Kim, Yeongmin Kim, Se Jung Kwon, Wanmo Kang, and Il-Chul Moon. Refining generative process with discriminator guidance in score-based diffusion models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 16567 16598. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/kim23i.html.

Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215 10224, 2018.

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 Dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 55, 2014.

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.

Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of imagenet classes in fréchet inception distance. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=4o XTQ6m_ws8.

Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed Elgammal. Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. In International Conference on Learning Representations, 2020.

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. ar Xiv preprint ar Xiv:1908.03265, 2019.

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022.

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775 5787, 2022.

Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. ar Xiv preprint ar Xiv:2101.02388, 2021.

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diffinstruct: A universal approach for transferring knowledge from pre-trained diffusion models. ar Xiv preprint ar Xiv:2305.18455, 2023.

Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2102.09672, 2021.

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=TId IXIpzho I.

Published as a conference paper at ICLR 2024

Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 2226 2234, 2016. URL https://proceedings.neurips.cc/ paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html.

Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. Advances in Neural Information Processing Systems, 34:17480 17492, 2021.

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1 10, 2022.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. In International Conference on Machine Learning, pp. 2256 2265, 2015.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11918 11930, 2019.

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced Score Matching: A Scalable Approach to Density and Score Estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 204, 2019.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=Px TIG12RRHS.

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 32211 32252. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/song23a.html.

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537 7547, 2020.

Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33:19667 19679, 2020.

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287 11302, 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7):1661 1674, 2011.

Published as a conference paper at ICLR 2024

Yilun Xu, Ziming Liu, Max Tegmark, and Tommi S. Jaakkola. Poisson flow generative models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum? id=vo V_TRqc Wh.

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. ar Xiv preprint ar Xiv:2204.13902, 2022.

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586 595, 2018.

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictorcorrector framework for fast sampling of diffusion models. ar Xiv preprint ar Xiv:2302.04867, 2023.

Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning. ar Xiv preprint ar Xiv:2211.13449, 2022.

Published as a conference paper at ICLR 2024

Proposition 1. Given the notations introduced in Section 3.2, and using the uniform weighting function λpσq 1 along with the squared ℓ2 metric, we have

lim NÑ8 LNpθ, θ q lim NÑ8 LN CTpθ, θ q E 1 σmin

2pθ θ q2ı if θ θ (11)

lim NÑ8 1 σ d LNpθ, θ q

pθ ξq2ı , θ θ

8, θ ă θ 8, θ ą θ

Proof. Since λpσq 1 and dpx, yq px yq2, we can write down the CM and CT objectives as LNpθ, θ q Erpfθpxσi 1, σi 1q fθ p xσi, σiqq2s and LN CTpθ, θ q Erpfθpxσi 1, σi 1q fθ pˇxσi, σiqq2s respectively. Since pdatapxq δpx ξq, we have pσpxq Npx | ξ, σ2q, and therefore log pσpxq x ξ

σ2 . According to the definition of xσi and xσi 1 ξ σi 1z, we have

xσi xσi 1 pσi σi 1qσi 1 log ppxσi 1, σi 1q

xσi 1 pσi σi 1qσi 1 xσi 1 ξ

σ2 i 1 xσi 1 pσi σi 1qz

ξ σi 1z pσi σi 1qz ξ σiz ˇxσi.

As a result, the CM and CT objectives are exactly the same, that is, LNpθ, θ q LN CTpθ, θ q. Recall that the consistency model fθpx, σq is defined as fθpx, σq σmin

σ θ, so we have fθpxσ, σq σminz σmin

σ θ. Now, let us focus on the CM objective

LNpθ, θ q Erpfθpxσi 1, σi 1q fθ p xσi, σiqq2s

Erpfθpxσi 1, σi 1q fθ pˇxσi, σiqq2s

σi 1 ξ ˆ 1 σmin

σi ξ ˆ 1 σmin

E ˆ σmin σi σ ξ ˆ 1 σmin σi σ

σi ξ ˆ 1 σmin

where σ σmax σmin

N 1 , because σi σmin i 1 N 1pσmax σminq. By taking the limit N Ñ 8, we have σ Ñ 0, and therefore

lim NÑ8 LNpθ, θ q

lim σÑ0 E ˆ σmin σi σ ξ ˆ 1 σmin σi σ

σi ξ ˆ 1 σmin

lim σÑ0 E ˆσmin

ξ ˆ 1 σmin σi σ

σi ξ ˆ 1 σmin

lim σÑ0 E ˆ σmin σ

σ2 i ξ ˆ 1 σmin σi σ

lim σÑ0 E ˆ σmin σ

σ2 i ξ ˆ 1 σmin

θ 2ȷ op σq.

Published as a conference paper at ICLR 2024

Suppose θ θ, we have

lim NÑ8 LNpθ, θ q

lim σÑ0 E ˆ σmin σ

σ2 i ξ ˆ 1 σmin

lim σÑ0 E 1 σmin

2pθ θ q2ı op σq

2pθ θ q2ı ,

which proves our first statement in the proposition.

Now, let s consider θLNpθ, θ q. It has the following form

θLNpθ, θ q 2E ˆ σmin

σi 1 ξ ˆ 1 σmin

σi ξ ˆ 1 σmin

As N Ñ 8 and σ Ñ 0, we have

lim NÑ8 θLNpθ, θ q

lim σÑ0 2E σmin σ

σ2 i ξ ˆ 1 σmin

θ ȷ ˆ 1 σmin

lim σÑ0 2E σmin σ

σ2 i ξ σmin σ

σ2 i θ ı 1 σmin

σ 2pθ θ q ı , θ θ

lim σÑ0 2E σmin σ

σ2 i pθ ξq ı 1 σmin

σi 2pθ θ q ı , θ θ. (13)

Now it becomes obvious from Eq. (13) that when θ θ, we have

lim NÑ8 1 σ θLNpθ, θ q lim σÑ0 2E σmin

σ2 i pθ ξq ı ˆ 1 σmin

σ2 i pθ ξq ı ˆ 1 σmin

Moreover, we can deduce from Eq. (13) that

lim NÑ8 1 σ θLNpθ, θ q " 8, θ ą θ

which concludes the proof.

B ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS

Model architecture Unless otherwise noted, we use the NCSN++ architecture (Song et al., 2021) on CIFAR-10, and the ADM architecture (Dhariwal & Nichol, 2021) on Image Net 64 ˆ 64. For i CT-deep models in Tables 2 and 3, we double the depth of base architectures by increasing the number of residual blocks per resolution from 4 and 3 to 8 and 6 for CIFAR-10 and Image Net 64ˆ64 respectively. We use a dropout rate of 0.3 for all consistency models on CIFAR-10. For Image Net 64 ˆ 64, we use a dropout rate of 0.2, but only apply them to convolutional layers whose the feature map resolution is smaller or equal to 16ˆ16, following the configuration in Hoogeboom et al. (2023). We also found that Ada GN introduced in Dhariwal & Nichol (2021) hurts consistency training and opt to remove it for our Image Net 64 ˆ 64 experiments. All models on CIFAR-10 are unconditional, and all models on Image Net 64 ˆ 64 are conditioned on class labels.

Published as a conference paper at ICLR 2024

(a) PDF of log σ

(b) Lognormal vs. default schedules.

Figure 4: The PDF of log σ indicates that the default noise schedule in Song et al. (2023) assigns more weight to larger values of log σ, corrected by our lognormal schedule. We compare the FID scores of CT using both the lognormal noise schedule and the original one, where both models incorporate the improved techniques in Sections 3.1 to 3.4.

(a) dp0, xq as a function of x.

(b) Comparing Adam updates.

Figure 5: (a) The shapes of various metric functions. (b) The ℓ2 norms of parameter updates in Adam optimizer. Curves are rescaled to have the same mean. The Pseudo-Huber metric has lower variance compared to the squared ℓ2 metric.

Training We train all models with the RAdam optimizer (Liu et al., 2019) using learning rate 0.0001. All CIFAR-10 models are trained for 400,000 iterations, whereas Image Net 64 ˆ 64 models are trained for 800,000 iterations. For CIFAR-10 models in Section 3, we use batch size 512 and EMA decay rate 0.9999 for the student network. For i CT and i CT-deep models in Table 2, we use batch size 1024 and EMA decay rate of 0.99993 for CIFAR-10 models, and batch size 4096 and EMA decay rate 0.99997 for Image Net 64 ˆ 64 models. All models are trained on a cluster of Nvidia A100 GPUs.

Pseudo-Huber losses and variance reduction In Fig. 5, we provide additional analysis for the Pseudo-Huber metric proposed in Section 3.3. We show the shapes of squared ℓ2 metric, as well as Pseudo-Huber losses with various values of c in Fig. 5a, illustrating that Pseudo-Huber losses smoothly interpolates between the ℓ1 and squared ℓ2 metrics. In Fig. 5b, we plot the ℓ2 norms of parameter updates retrieved from the Adam optimizer for models trained with squared ℓ2 and Pseudo-Huber metrics. We observe that the Pseudo-Huber metric has lower variance compared to the squared ℓ2 metric, which is consistent with our hypothesis in Section 3.3.

Samples We provide additional uncurated samples from i CT and i CT-deep models on both CIFAR10 and Image Net 64 ˆ 64. See Figs. 6 to 9. For two-step sampling, the intermediate noise level σi2 is 0.821 for CIFAR-10 and 1.526 for Image Net 64 ˆ 64 when using i CT. When employing i CT-deep, σi2 is 0.661 for CIFAR-10 and 0.973 for Image Net 64 ˆ 64.

Published as a conference paper at ICLR 2024

(a) One-step samples from the i CT model on CIFAR-10 (FID = 2.83).

(b) Two-step samples from the i CT model on CIFAR-10 (FID = 2.46).

Figure 6: Uncurated samples from i CT models on CIFAR-10. All corresponding samples use the same initial noise.

Published as a conference paper at ICLR 2024

(a) One-step samples from the i CT-deep model on CIFAR-10 (FID = 2.51).

(b) Two-step samples from the i CT-deep model on CIFAR-10 (FID = 2.24).

Figure 7: Uncurated samples from i CT-deep models on CIFAR-10. All corresponding samples use the same initial noise.

Published as a conference paper at ICLR 2024

(a) One-step samples from the i CT model on Image Net 64 ˆ 64 (FID = 4.02).

(b) Two-step samples from the i CT model on Image Net 64 ˆ 64 (FID = 3.20).

Figure 8: Uncurated samples from i CT models on Image Net 64 ˆ 64. All corresponding samples use the same initial noise.

Published as a conference paper at ICLR 2024

(a) One-step samples from the i CT-deep model on Image Net 64 ˆ 64 (FID = 3.25).

(b) Two-step samples from the i CT-deep model on Image Net 64 ˆ 64 (FID = 2.77).

Figure 9: Uncurated samples from i CT-deep models on Image Net 64ˆ64. All corresponding samples use the same initial noise.