# consistency_models__2e441c60.pdf

Consistency Models

Yang Song 1 Prafulla Dhariwal 1 Mark Chen 1 Ilya Sutskever 1

Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in oneand few-step sampling, achieving the new state-ofthe-art FID of 3.55 on CIFAR-10 and 6.20 on Image Net 64 ˆ 64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR10, Image Net 64 ˆ 64 and LSUN 256 ˆ 256.

1. Introduction

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; 2020; Ho et al., 2020; Song et al., 2021), also known as score-based generative models, have achieved unprecedented success across multiple fields, including image generation (Dhariwal & Nichol, 2021; Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022; Rombach et al., 2022), audio synthesis (Kong et al., 2020; Chen et al., 2021; Popov et al., 2021), and video generation (Ho et al.,

1Open AI, San Francisco, CA 94110, USA. Correspondence to: Yang Song <songyang@openai.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Figure 1: Given a Probability Flow (PF) ODE that smoothly converts data to noise, we learn to map any point (e.g., xt, xt1, and x T ) on the ODE trajectory to its origin (e.g., x0) for generative modeling. Models of these mappings are called consistency models, as their outputs are trained to be consistent for points on the same trajectory.

2022b;a). A key feature of diffusion models is the iterative sampling process which progressively removes noise from random initial vectors. This iterative process provides a flexible trade-off of compute and sample quality, as using extra compute for more iterations usually yields samples of better quality. It is also the crux of many zero-shot data editing capabilities of diffusion models, enabling them to solve challenging inverse problems ranging from image inpainting, colorization, stroke-guided image editing, to Computed Tomography and Magnetic Resonance Imaging (Song & Ermon, 2019; Song et al., 2021; 2022; 2023; Kawar et al., 2021; 2022; Chung et al., 2023; Meng et al., 2021). However, compared to single-step generative models like GANs (Goodfellow et al., 2014), VAEs (Kingma & Welling, 2014; Rezende et al., 2014), or normalizing flows (Dinh et al., 2015; 2017; Kingma & Dhariwal, 2018), the iterative generation procedure of diffusion models typically requires 10 2000 times more compute for sample generation (Song & Ermon, 2020; Ho et al., 2020; Song et al., 2021; Zhang & Chen, 2022; Lu et al., 2022), causing slow inference and limited real-time applications.

Our objective is to create generative models that facilitate efficient, single-step generation without sacrificing important advantages of iterative sampling, such as trading compute for sample quality when necessary, as well as performing zero-shot data editing tasks. As illustrated in Fig. 1, we build on top of the probability flow (PF) ordinary differential equation (ODE) in continuous-time diffusion models (Song et al., 2021), whose trajectories smoothly transition

Consistency Models

the data distribution into a tractable noise distribution. We propose to learn a model that maps any point at any time step to the trajectory s starting point. A notable property of our model is self-consistency: points on the same trajectory map to the same initial point. We therefore refer to such models as consistency models. Consistency models allow us to generate data samples (initial points of ODE trajectories, e.g., x0 in Fig. 1) by converting random noise vectors (endpoints of ODE trajectories, e.g., x T in Fig. 1) with only one network evaluation. Importantly, by chaining the outputs of consistency models at multiple time steps, we can improve sample quality and perform zero-shot data editing at the cost of more compute, similar to what iterative sampling enables for diffusion models.

To train a consistency model, we offer two methods based on enforcing the self-consistency property. The first method relies on using numerical ODE solvers and a pre-trained diffusion model to generate pairs of adjacent points on a PF ODE trajectory. By minimizing the difference between model outputs for these pairs, we can effectively distill a diffusion model into a consistency model, which allows generating high-quality samples with one network evaluation. By contrast, our second method eliminates the need for a pre-trained diffusion model altogether, allowing us to train a consistency model in isolation. This approach situates consistency models as an independent family of generative models. Importantly, neither approach necessitates adversarial training, and they both place minor constraints on the architecture, allowing the use of flexible neural networks for parameterizing consistency models.

We demonstrate the efficacy of consistency models on several image datasets, including CIFAR-10 (Krizhevsky et al., 2009), Image Net 64 ˆ 64 (Deng et al., 2009), and LSUN 256 ˆ 256 (Yu et al., 2015). Empirically, we observe that as a distillation approach, consistency models outperform existing diffusion distillation methods like progressive distillation (Salimans & Ho, 2022) across a variety of datasets in few-step generation: On CIFAR-10, consistency models reach new state-of-the-art FIDs of 3.55 and 2.93 for one-step and two-step generation; on Image Net 64 ˆ 64, it achieves record-breaking FIDs of 6.20 and 4.70 with one and two network evaluations respectively. When trained as standalone generative models, consistency models can match or surpass the quality of one-step samples from progressive distillation, despite having no access to pre-trained diffusion models. They are also able to outperform many GANs, and existing non-adversarial, single-step generative models across multiple datasets. Furthermore, we show that consistency models can be used to perform a wide range of zero-shot data editing tasks, including image denoising, interpolation, inpainting, colorization, super-resolution, and stroke-guided image editing (SDEdit, Meng et al. (2021)).

2. Diffusion Models

Consistency models are heavily inspired by the theory of continuous-time diffusion models (Song et al., 2021; Karras et al., 2022). Diffusion models generate data by progressively perturbing data to noise via Gaussian perturbations, then creating samples from noise via sequential denoising steps. Let pdatapxq denote the data distribution. Diffusion models start by diffusing pdatapxq with a stochastic differential equation (SDE) (Song et al., 2021)

dxt µpxt, tq dt σptq dwt, (1)

where t P r0, Ts, T ą 0 is a fixed constant, µp , q and σp q are the drift and diffusion coefficients respectively, and twtut Pr0,T s denotes the standard Brownian motion. We denote the distribution of xt as ptpxq and as a result p0pxq pdatapxq. A remarkable property of this SDE is the existence of an ordinary differential equation (ODE), dubbed the Probability Flow (PF) ODE by Song et al. (2021), whose solution trajectories sampled at t are distributed according to ptpxq:

dxt µpxt, tq 1

2σptq2 log ptpxtq ȷ dt. (2)

Here log ptpxq is the score function of ptpxq; hence diffusion models are also known as score-based generative models (Song & Ermon, 2019; 2020; Song et al., 2021).

Typically, the SDE in Eq. (1) is designed such that p T pxq is close to a tractable Gaussian distribution πpxq. We hereafter adopt the settings in Karras et al. (2022), where µpx, tq 0 and σptq ?

2t. In this case, we have ptpxq pdatapxq b Np0, t2Iq, where b denotes the convolution operation, and πpxq Np0, T 2Iq. For sampling, we first train a score model sϕpx, tq log ptpxq via score matching (Hyv arinen & Dayan, 2005; Vincent, 2011; Song et al., 2019; Song & Ermon, 2019; Ho et al., 2020), then plug it into Eq. (2) to obtain an empirical estimate of the PF ODE, which takes the form of

dt tsϕpxt, tq. (3)

We call Eq. (3) the empirical PF ODE. Next, we sample ˆx T π Np0, T 2Iq to initialize the empirical PF ODE and solve it backwards in time with any numerical ODE solver, such as Euler (Song et al., 2020; 2021) and Heun solvers (Karras et al., 2022), to obtain the solution trajectory tˆxtut Pr0,T s. The resulting ˆx0 can then be viewed as an approximate sample from the data distribution pdatapxq. To avoid numerical instability, one typically stops the solver at t ϵ, where ϵ is a fixed small positive number, and accepts ˆxϵ as the approximate sample. Following Karras et al. (2022), we rescale image pixel values to r 1, 1s, and set T 80, ϵ 0.002.

Consistency Models

Figure 2: Consistency models are trained to map points on any trajectory of the PF ODE to the trajectory s origin.

Diffusion models are bottlenecked by their slow sampling speed. Clearly, using ODE solvers for sampling requires iterative evaluations of the score model sϕpx, tq, which is computationally costly. Existing methods for fast sampling include faster numerical ODE solvers (Song et al., 2020; Zhang & Chen, 2022; Lu et al., 2022; Dockhorn et al., 2022), and distillation techniques (Luhman & Luhman, 2021; Salimans & Ho, 2022; Meng et al., 2022; Zheng et al., 2022). However, ODE solvers still need more than 10 evaluation steps to generate competitive samples. Most distillation methods like Luhman & Luhman (2021) and Zheng et al. (2022) rely on collecting a large dataset of samples from the diffusion model prior to distillation, which itself is computationally expensive. To our best knowledge, the only distillation approach that does not suffer from this drawback is progressive distillation (PD, Salimans & Ho (2022)), with which we compare consistency models extensively in our experiments.

3. Consistency Models

We propose consistency models, a new type of models that support single-step generation at the core of its design, while still allowing iterative generation for trade-offs between sample quality and compute, and zero-shot data editing. Consistency models can be trained in either the distillation mode or the isolation mode. In the former case, consistency models distill the knowledge of pre-trained diffusion models into a single-step sampler, significantly improving other distillation approaches in sample quality, while allowing zero-shot image editing applications. In the latter case, consistency models are trained in isolation, with no dependence on pretrained diffusion models. This makes them an independent new class of generative models.

Below we introduce the definition, parameterization, and sampling of consistency models, plus a brief discussion on their applications to zero-shot data editing.

Definition Given a solution trajectory txtut Prϵ,T s of the PF ODE in Eq. (2), we define the consistency function as f : pxt, tq ÞÑ xϵ. A consistency function has the property

of self-consistency: its outputs are consistent for arbitrary pairs of pxt, tq that belong to the same PF ODE trajectory, i.e., fpxt, tq fpxt1, t1q for all t, t1 P rϵ, Ts. As illustrated in Fig. 2, the goal of a consistency model, symbolized as fθ, is to estimate this consistency function f from data by learning to enforce the self-consistency property (details in Sections 4 and 5). Note that a similar definition is used for neural flows (Biloˇs et al., 2021) in the context of neural ODEs (Chen et al., 2018). Compared to neural flows, however, we do not enforce consistency models to be invertible.

Parameterization For any consistency function fp , q, we have fpxϵ, ϵq xϵ, i.e., fp , ϵq is an identity function. We call this constraint the boundary condition. All consistency models have to meet this boundary condition, as it plays a crucial role in the successful training of consistency models. This boundary condition is also the most confining architectural constraint on consistency models. For consistency models based on deep neural networks, we discuss two ways to implement this boundary condition almost for free. Suppose we have a free-form deep neural network Fθpx, tq whose output has the same dimensionality as x. The first way is to simply parameterize the consistency model as

# x t ϵ Fθpx, tq t P pϵ, Ts . (4)

The second method is to parameterize the consistency model using skip connections, that is,

fθpx, tq cskipptqx coutptq Fθpx, tq, (5)

where cskipptq and coutptq are differentiable functions such that cskippϵq 1, and coutpϵq 0. This way, the consistency model is differentiable at t ϵ if Fθpx, tq, cskipptq, coutptq are all differentiable, which is critical for training continuous-time consistency models (Appendices B.1 and B.2). The parameterization in Eq. (5) bears strong resemblance to many successful diffusion models (Karras et al., 2022; Balaji et al., 2022), making it easier to borrow powerful diffusion model architectures for constructing consistency models. We therefore follow the second parameterization in all experiments.

Sampling With a well-trained consistency model fθp , q, we can generate samples by sampling from the initial distribution ˆx T Np0, T 2Iq and then evaluating the consistency model for ˆxϵ fθpˆx T , Tq. This involves only one forward pass through the consistency model and therefore generates samples in a single step. Importantly, one can also evaluate the consistency model multiple times by alternating denoising and noise injection steps for improved sample quality. Summarized in Algorithm 1, this multistep sampling procedure provides the flexibility to trade compute for sample quality. It also has important applications in zero-shot data editing. In practice, we find time points

Consistency Models

Algorithm 1 Multistep Consistency Sampling

Input: Consistency model fθp , q, sequence of time points τ1 ą τ2 ą ą τN 1, initial noise ˆx T x Ð fθpˆx T , Tq for n 1 to N 1 do

Sample z Np0, Iq

τ 2n ϵ2z x Ð fθpˆxτn, τnq end for Output: x

tτ1, τ2, , τN 1u in Algorithm 1 with a greedy algorithm, where the time points are pinpointed one at a time using ternary search to optimize the FID of samples obtained from Algorithm 1. This assumes that given prior time points, the FID is a unimodal function of the next time point. We find this assumption to hold empirically in our experiments, and leave the exploration of better strategies as future work.

Zero-Shot Data Editing Similar to diffusion models, consistency models enable various data editing and manipulation applications in zero shot; they do not require explicit training to perform these tasks. For example, consistency models define a one-to-one mapping from a Gaussian noise vector to a data sample. Similar to latent variable models like GANs, VAEs, and normalizing flows, consistency models can easily interpolate between samples by traversing the latent space (Fig. 11). As consistency models are trained to recover xϵ from any noisy input xt where t P rϵ, Ts, they can perform denoising for various noise levels (Fig. 12). Moreover, the multistep generation procedure in Algorithm 1 is useful for solving certain inverse problems in zero shot by using an iterative replacement procedure similar to that of diffusion models (Song & Ermon, 2019; Song et al., 2021; Ho et al., 2022b). This enables many applications in the context of image editing, including inpainting (Fig. 10), colorization (Fig. 8), super-resolution (Fig. 6b) and stroke-guided image editing (Fig. 13) as in SDEdit (Meng et al., 2021). In Section 6.3, we empirically demonstrate the power of consistency models on many zero-shot image editing tasks.

4. Training Consistency Models via Distillation

We present our first method for training consistency models based on distilling a pre-trained score model sϕpx, tq. Our discussion revolves around the empirical PF ODE in Eq. (3), obtained by plugging the score model sϕpx, tq into the PF ODE. Consider discretizing the time horizon rϵ, Ts into N 1 sub-intervals, with boundaries t1 ϵ ă t2 ă ă t N T. In practice, we follow Karras et al. (2022) to determine the boundaries with the formula ti pϵ1{ρ i 1{N 1p T 1{ρ ϵ1{ρqqρ, where ρ 7. When

N is sufficiently large, we can obtain an accurate estimate of xtn from xtn 1 by running one discretization step of a numerical ODE solver. This estimate, which we denote as ˆxϕ tn, is defined by

ˆxϕ tn : xtn 1 ptn tn 1qΦpxtn 1, tn 1; ϕq, (6)

where Φp ; ϕq represents the update function of a onestep ODE solver applied to the empirical PF ODE. For example, when using the Euler solver, we have Φpx, t; ϕq tsϕpx, tq which corresponds to the following update rule

ˆxϕ tn xtn 1 ptn tn 1qtn 1sϕpxtn 1, tn 1q.

For simplicity, we only consider one-step ODE solvers in this work. It is straightforward to generalize our framework to multistep ODE solvers and we leave it as future work.

Due to the connection between the PF ODE in Eq. (2) and the SDE in Eq. (1) (see Section 2), one can sample along the distribution of ODE trajectories by first sampling x pdata, then adding Gaussian noise to x. Specifically, given a data point x, we can generate a pair of adjacent data points pˆxϕ tn, xtn 1q on the PF ODE trajectory efficiently by sampling x from the dataset, followed by sampling xtn 1 from the transition density of the SDE Npx, t2 n 1Iq, and then computing ˆxϕ tn using one discretization step of the numerical ODE solver according to Eq. (6). Afterwards, we train the consistency model by minimizing its output differences on the pair pˆxϕ tn, xtn 1q. This motivates our following consistency distillation loss for training consistency models.

Definition 1. The consistency distillation loss is defined as

LN CDpθ, θ ; ϕq :

Erλptnqdpfθpxtn 1, tn 1q, fθ pˆxϕ tn, tnqqs, (7)

where the expectation is taken with respect to x pdata, n UJ1, N 1K, and xtn 1 Npx; t2 n 1Iq. Here UJ1, N 1K denotes the uniform distribution over t1, 2, , N 1u, λp q P R is a positive weighting function, ˆxϕ tn is given by Eq. (6), θ denotes a running average of the past values of θ during the course of optimization, and dp , q is a metric function that satisfies @x, y : dpx, yq ě 0 and dpx, yq 0 if and only if x y.

Unless otherwise stated, we adopt the notations in Definition 1 throughout this paper, and use Er s to denote the expectation over all random variables. In our experiments, we consider the squared ℓ2 distance dpx, yq }x y}2 2, ℓ1 distance dpx, yq }x y}1, and the Learned Perceptual Image Patch Similarity (LPIPS, Zhang et al. (2018)). We find λptnq 1 performs well across all tasks and datasets. In practice, we minimize the objective by stochastic gradient descent on the model parameters θ, while updating θ with exponential moving average (EMA). That is, given a decay

Consistency Models

Algorithm 2 Consistency Distillation (CD)

Input: dataset D, initial model parameter θ, learning rate η, ODE solver Φp , ; ϕq, dp , q, λp q, and µ θ Ð θ repeat

Sample x D and n UJ1, N 1K Sample xtn 1 Npx; t2 n 1Iq ˆxϕ tn Ð xtn 1 ptn tn 1qΦpxtn 1, tn 1; ϕq

Lpθ, θ ; ϕq Ð

λptnqdpfθpxtn 1, tn 1q, fθ pˆxϕ tn, tnqq

θ Ð θ η θLpθ, θ ; ϕq θ Ð stopgradpµθ p1 µqθ) until convergence

rate 0 ď µ ă 1, we perform the following update after each optimization step:

θ Ð stopgradpµθ p1 µqθq. (8)

The overall training procedure is summarized in Algorithm 2. In alignment with the convention in deep reinforcement learning (Mnih et al., 2013; 2015; Lillicrap et al., 2015) and momentum based contrastive learning (Grill et al., 2020; He et al., 2020), we refer to fθ as the target network , and fθ as the online network . We find that compared to simply setting θ θ, the EMA update and stopgrad operator in Eq. (8) can greatly stabilize the training process and improve the final performance of the consistency model.

Below we provide a theoretical justification for consistency distillation based on asymptotic analysis.

Theorem 1. Let t : maxn PJ1,N 1Kt|tn 1 tn|u, and fp , ; ϕq be the consistency function of the empirical PF ODE in Eq. (3). Assume fθ satisfies the Lipschitz condition: there exists L ą 0 such that for all t P rϵ, Ts, x, and y, we have fθpx, tq fθpy, tq 2 ď L x y 2. Assume further that for all n P J1, N 1K, the ODE solver called at tn 1 has local error uniformly bounded by Opptn 1 tnqp 1q with p ě 1. Then, if LN CDpθ, θ; ϕq 0, we have

sup n,x }fθpx, tnq fpx, tn; ϕq}2 Opp tqpq.

Proof. The proof is based on induction and parallels the classic proof of global error bounds for numerical ODE solvers (S uli & Mayers, 2003). We provide the full proof in Appendix A.2.

Since θ is a running average of the history of θ, we have θ θ when the optimization of Algorithm 2 converges. That is, the target and online consistency models will eventually match each other. If the consistency model additionally achieves zero consistency distillation loss, then Theorem 1

Algorithm 3 Consistency Training (CT)

Input: dataset D, initial model parameter θ, learning rate η, step schedule Np q, EMA decay rate schedule µp q, dp , q, and λp q θ Ð θ and k Ð 0 repeat

Sample x D, and n UJ1, Npkq 1K Sample z Np0, Iq Lpθ, θ q Ð λptnqdpfθpx tn 1z, tn 1q, fθ px tnz, tnqq θ Ð θ η θLpθ, θ q θ Ð stopgradpµpkqθ p1 µpkqqθq k Ð k 1 until convergence

implies that, under some regularity conditions, the estimated consistency model can become arbitrarily accurate, as long as the step size of the ODE solver is sufficiently small. Importantly, our boundary condition fθpx, ϵq x precludes the trivial solution fθpx, tq 0 from arising in consistency model training.

The consistency distillation loss LN CDpθ, θ ; ϕq can be extended to hold for infinitely many time steps (N Ñ 8) if θ θ or θ stopgradpθq. The resulting continuoustime loss functions do not require specifying N nor the time steps tt1, t2, , t Nu. Nonetheless, they involve Jacobianvector products and require forward-mode automatic differentiation for efficient implementation, which may not be well-supported in some deep learning frameworks. We provide these continuous-time distillation loss functions in Theorems 3 to 5, and relegate details to Appendix B.1.

5. Training Consistency Models in Isolation

Consistency models can be trained without relying on any pre-trained diffusion models. This differs from existing diffusion distillation techniques, making consistency models a new independent family of generative models.

Recall that in consistency distillation, we rely on a pretrained score model sϕpx, tq to approximate the ground truth score function log ptpxq. It turns out that we can avoid this pre-trained score model altogether by leveraging the following unbiased estimator (Lemma 1 in Appendix A):

log ptpxtq E xt x

where x pdata and xt Npx; t2Iq. That is, given x and xt, we can estimate log ptpxtq with pxt xq{t2.

This unbiased estimate suffices to replace the pre-trained diffusion model in consistency distillation when using the Euler method as the ODE solver in the limit of N Ñ 8, as

Consistency Models

justified by the following result. Theorem 2. Let t : maxn PJ1,N 1Kt|tn 1 tn|u. Assume d and fθ are both twice continuously differentiable with bounded second derivatives, the weighting function λp q is bounded, and Er log ptnpxtnq 2 2s ă 8. Assume further that we use the Euler ODE solver, and the pre-trained score model matches the ground truth, i.e., @t P rϵ, Ts : sϕpx, tq log ptpxq. Then,

LN CDpθ, θ ; ϕq LN CTpθ, θ q op tq, (9)

where the expectation is taken with respect to x pdata, n UJ1, N 1K, and xtn 1 Npx; t2 n 1Iq. The consistency training objective, denoted by LN CTpθ, θ q, is defined as

Erλptnqdpfθpx tn 1z, tn 1q, fθ px tnz, tnqqs, (10)

where z Np0, Iq. Moreover, LN CTpθ, θ q ě Op tq if inf N LN CDpθ, θ ; ϕq ą 0.

Proof. The proof is based on Taylor series expansion and properties of score functions (Lemma 1). A complete proof is provided in Appendix A.3.

We refer to Eq. (10) as the consistency training (CT) loss. Crucially, Lpθ, θ q only depends on the online network fθ, and the target network fθ , while being completely agnostic to diffusion model parameters ϕ. The loss function Lpθ, θ q ě Op tq decreases at a slower rate than the remainder op tq and thus will dominate the loss in Eq. (9) as N Ñ 8 and t Ñ 0.

For improved practical performance, we propose to progressively increase N during training according to a schedule function Np q. The intuition (cf., Fig. 3d) is that the consistency training loss has less variance but more bias with respect to the underlying consistency distillation loss (i.e., the left-hand side of Eq. (9)) when N is small (i.e., t is large), which facilitates faster convergence at the beginning of training. On the contrary, it has more variance but less bias when N is large (i.e., t is small), which is desirable when closer to the end of training. For best performance, we also find that µ should change along with N, according to a schedule function µp q. The full algorithm of consistency training is provided in Algorithm 3, and the schedule functions used in our experiments are given in Appendix C.

Similar to consistency distillation, the consistency training loss LN CTpθ, θ q can be extended to hold in continuous time (i.e., N Ñ 8) if θ stopgradpθq, as shown in Theorem 6. This continuous-time loss function does not require schedule functions for N or µ, but requires forward-mode automatic differentiation for efficient implementation. Unlike the discrete-time CT loss, there is no undesirable bias associated with the continuous-time objective, as we effectively take t Ñ 0 in Theorem 2. We relegate more details to Appendix B.2.

6. Experiments

We employ consistency distillation and consistency training to learn consistency models on real image datasets, including CIFAR-10 (Krizhevsky et al., 2009), Image Net 64 ˆ 64 (Deng et al., 2009), LSUN Bedroom 256 ˆ 256, and LSUN Cat 256 ˆ 256 (Yu et al., 2015). Results are compared according to Fr echet Inception Distance (FID, Heusel et al. (2017), lower is better), Inception Score (IS, Salimans et al. (2016), higher is better), Precision (Prec., Kynk a anniemi et al. (2019), higher is better), and Recall (Rec., Kynk a anniemi et al. (2019), higher is better). Additional experimental details are provided in Appendix C.

6.1. Training Consistency Models

We perform a series of experiments on CIFAR-10 to understand the effect of various hyperparameters on the performance of consistency models trained by consistency distillation (CD) and consistency training (CT). We first focus on the effect of the metric function dp , q, the ODE solver, and the number of discretization steps N in CD, then investigate the effect of the schedule functions Np q and µp q in CT.

To set up our experiments for CD, we consider the squared ℓ2 distance dpx, yq }x y}2 2, ℓ1 distance dpx, yq }x y}1, and the Learned Perceptual Image Patch Similarity (LPIPS, Zhang et al. (2018)) as the metric function. For the ODE solver, we compare Euler s forward method and Heun s second order method as detailed in Karras et al. (2022). For the number of discretization steps N, we compare N P t9, 12, 18, 36, 50, 60, 80, 120u. All consistency models trained by CD in our experiments are initialized with the corresponding pre-trained diffusion models, whereas models trained by CT are randomly initialized.

As visualized in Fig. 3a, the optimal metric for CD is LPIPS, which outperforms both ℓ1 and ℓ2 by a large margin over all training iterations. This is expected as the outputs of consistency models are images on CIFAR-10, and LPIPS is specifically designed for measuring the similarity between natural images. Next, we investigate which ODE solver and which discretization step N work the best for CD. As shown in Figs. 3b and 3c, Heun ODE solver and N 18 are the best choices. Both are in line with the recommendation of Karras et al. (2022) despite the fact that we are training consistency models, not diffusion models. Moreover, Fig. 3b shows that with the same N, Heun s second order solver uniformly outperforms Euler s first order solver. This corroborates with Theorem 1, which states that the optimal consistency models trained by higher order ODE solvers have smaller estimation errors with the same N. The results of Fig. 3c also indicate that once N is sufficiently large, the performance of CD becomes insensitive to N. Given these insights, we hereafter use LPIPS and Heun ODE solver for CD unless otherwise stated. For N in CD, we follow the

Consistency Models

(a) Metric functions in CD.

(b) Solvers and N in CD.

(c) N with Heun solver in CD.

(d) Adaptive N and µ in CT.

Figure 3: Various factors that affect consistency distillation (CD) and consistency training (CT) on CIFAR-10. The best configuration for CD is LPIPS, Heun ODE solver, and N 18. Our adaptive schedule functions for N and µ make CT converge significantly faster than fixing them to be constants during the course of optimization.

(a) CIFAR-10

(b) Image Net 64 ˆ 64

(c) Bedroom 256 ˆ 256

(d) Cat 256 ˆ 256

Figure 4: Multistep image generation with consistency distillation (CD). CD outperforms progressive distillation (PD) across all datasets and sampling steps. The only exception is single-step generation on Bedroom 256 ˆ 256.

suggestions in Karras et al. (2022) on CIFAR-10 and Image Net 64 ˆ 64. We tune N separately on other datasets (details in Appendix C).

Due to the strong connection between CD and CT, we adopt LPIPS for our CT experiments throughout this paper. Unlike CD, there is no need for using Heun s second order solver in CT as the loss function does not rely on any particular numerical ODE solver. As demonstrated in Fig. 3d, the convergence of CT is highly sensitive to N smaller N leads to faster convergence but worse samples, whereas larger N leads to slower convergence but better samples upon convergence. This matches our analysis in Section 5, and motivates our practical choice of progressively growing N and µ for CT to balance the trade-off between convergence speed and sample quality. As shown in Fig. 3d, adaptive schedules of N and µ significantly improve the convergence speed and sample quality of CT. In our experiments, we tune the schedules Np q and µp q separately for images of different resolutions, with more details in Appendix C.

6.2. Few-Step Image Generation

Distillation In current literature, the most directly comparable approach to our consistency distillation (CD) is progressive distillation (PD, Salimans & Ho (2022)); both are thus far the only distillation approaches that do not construct synthetic data before distillation. In stark contrast, other dis-

tillation techniques, such as knowledge distillation (Luhman & Luhman, 2021) and DFNO (Zheng et al., 2022), have to prepare a large synthetic dataset by generating numerous samples from the diffusion model with expensive numerical ODE/SDE solvers. We perform comprehensive comparison for PD and CD on CIFAR-10, Image Net 64ˆ64, and LSUN 256 ˆ 256, with all results reported in Fig. 4. All methods distill from an EDM (Karras et al., 2022) model that we pretrained in-house. We note that across all sampling iterations, using the LPIPS metric uniformly improves PD compared to the squared ℓ2 distance in the original paper of Salimans & Ho (2022). Both PD and CD improve as we take more sampling steps. We find that CD uniformly outperforms PD across all datasets, sampling steps, and metric functions considered, except for single-step generation on Bedroom 256 ˆ 256, where CD with ℓ2 slightly underperforms PD with ℓ2. As shown in Table 1, CD even outperforms distillation approaches that require synthetic dataset construction, such as Knowledge Distillation (Luhman & Luhman, 2021) and DFNO (Zheng et al., 2022).

Direct Generation In Tables 1 and 2, we compare the sample quality of consistency training (CT) with other generative models using one-step and two-step generation. We also include PD and CD results for reference. Both tables report PD results obtained from the ℓ2 metric function, as this is the default setting used in the original paper of Salimans

Consistency Models

Table 1: Sample quality on CIFAR-10. Methods that require synthetic data construction for distillation.

METHOD NFE (Ó) FID (Ó) IS (Ò)

Diffusion + Samplers DDIM (Song et al., 2020) 50 4.67 DDIM (Song et al., 2020) 20 6.84 DDIM (Song et al., 2020) 10 8.23 DPM-solver-2 (Lu et al., 2022) 10 5.94 DPM-solver-fast (Lu et al., 2022) 10 4.70 3-DEIS (Zhang & Chen, 2022) 10 4.17

Diffusion + Distillation Knowledge Distillation (Luhman & Luhman, 2021) 1 9.36 DFNO (Zheng et al., 2022) 1 4.12 1-Rectified Flow (+distill) (Liu et al., 2022) 1 6.18 9.08 2-Rectified Flow (+distill) (Liu et al., 2022) 1 4.85 9.01 3-Rectified Flow (+distill) (Liu et al., 2022) 1 5.21 8.79 PD (Salimans & Ho, 2022) 1 8.34 8.69 CD 1 3.55 9.48 PD (Salimans & Ho, 2022) 2 5.58 9.05 CD 2 2.93 9.75 Direct Generation Big GAN (Brock et al., 2019) 1 14.7 9.22 Diffusion GAN (Xiao et al., 2022) 1 14.6 8.93 Auto GAN (Gong et al., 2019) 1 12.4 8.55 E2GAN (Tian et al., 2020) 1 11.3 8.51 Vi TGAN (Lee et al., 2021) 1 6.66 9.30 Trans GAN (Jiang et al., 2021) 1 9.26 9.05 Style GAN2-ADA (Karras et al., 2020) 1 2.92 9.83 Style GAN-XL (Sauer et al., 2022) 1 1.85 Score SDE (Song et al., 2021) 2000 2.20 9.89 DDPM (Ho et al., 2020) 1000 3.17 9.46 LSGM (Vahdat et al., 2021) 147 2.10 PFGM (Xu et al., 2022) 110 2.35 9.68 EDM (Karras et al., 2022) 35 2.04 9.84 1-Rectified Flow (Liu et al., 2022) 1 378 1.13 Glow (Kingma & Dhariwal, 2018) 1 48.9 3.92 Residual Flow (Chen et al., 2019) 1 46.4 GLFlow (Xiao et al., 2019) 1 44.6 Dense Flow (Grci c et al., 2021) 1 34.9 DC-VAE (Parmar et al., 2021) 1 17.9 8.20 CT 1 8.70 8.49 CT 2 5.83 8.85

Table 2: Sample quality on Image Net 64 ˆ 64, and LSUN Bedroom & Cat 256 ˆ 256. :Distillation techniques.

METHOD NFE (Ó) FID (Ó) Prec. (Ò) Rec. (Ò)

Image Net 64 ˆ 64 PD: (Salimans & Ho, 2022) 1 15.39 0.59 0.62 DFNO: (Zheng et al., 2022) 1 8.35 CD: 1 6.20 0.68 0.63 PD: (Salimans & Ho, 2022) 2 8.95 0.63 0.65 CD: 2 4.70 0.69 0.64 ADM (Dhariwal & Nichol, 2021) 250 2.07 0.74 0.63 EDM (Karras et al., 2022) 79 2.44 0.71 0.67 Big GAN-deep (Brock et al., 2019) 1 4.06 0.79 0.48 CT 1 13.0 0.71 0.47 CT 2 11.1 0.69 0.56

LSUN Bedroom 256 ˆ 256 PD: (Salimans & Ho, 2022) 1 16.92 0.47 0.27 PD: (Salimans & Ho, 2022) 2 8.47 0.56 0.39 CD: 1 7.80 0.66 0.34 CD: 2 5.22 0.68 0.39 DDPM (Ho et al., 2020) 1000 4.89 0.60 0.45 ADM (Dhariwal & Nichol, 2021) 1000 1.90 0.66 0.51 EDM (Karras et al., 2022) 79 3.57 0.66 0.45 PGGAN (Karras et al., 2018) 1 8.34 PG-SWGAN (Wu et al., 2019) 1 8.0 TDPM (GAN) (Zheng et al., 2023) 1 5.24 Style GAN2 (Karras et al., 2020) 1 2.35 0.59 0.48 CT 1 16.0 0.60 0.17 CT 2 7.85 0.68 0.33

LSUN Cat 256 ˆ 256 PD: (Salimans & Ho, 2022) 1 29.6 0.51 0.25 PD: (Salimans & Ho, 2022) 2 15.5 0.59 0.36 CD: 1 11.0 0.65 0.36 CD: 2 8.84 0.66 0.40 DDPM (Ho et al., 2020) 1000 17.1 0.53 0.48 ADM (Dhariwal & Nichol, 2021) 1000 5.57 0.63 0.52 EDM (Karras et al., 2022) 79 6.69 0.70 0.43 PGGAN (Karras et al., 2018) 1 37.5 Style GAN2 (Karras et al., 2020) 1 7.25 0.58 0.43 CT 1 20.7 0.56 0.23 CT 2 11.7 0.63 0.36

Figure 5: Samples generated by EDM (top), CT + single-step generation (middle), and CT + 2-step generation (Bottom). All corresponding images are generated from the same initial noise.

Consistency Models

(a) Left: The gray-scale image. Middle: Colorized images. Right: The ground-truth image.

(b) Left: The downsampled image (32 ˆ 32). Middle: Full resolution images (256 ˆ 256). Right: The ground-truth image (256 ˆ 256).

(c) Left: A stroke input provided by users. Right: Stroke-guided image generation.

Figure 6: Zero-shot image editing with a consistency model trained by consistency distillation on LSUN Bedroom 256ˆ256.

& Ho (2022). For fair comparison, we ensure PD and CD distill the same EDM models. In Tables 1 and 2, we observe that CT outperforms existing single-step, non-adversarial generative models, i.e., VAEs and normalizing flows, by a significant margin on CIFAR-10. Moreover, CT achieves comparable quality to one-step samples from PD without relying on distillation. In Fig. 5, we provide EDM samples (top), single-step CT samples (middle), and two-step CT samples (bottom). In Appendix E, we show additional samples for both CD and CT in Figs. 14 to 21. Importantly, all samples obtained from the same initial noise vector share significant structural similarity, even though CT and EDM models are trained independently from one another. This indicates that CT is less likely to suffer from mode collapse, as EDMs do not.

6.3. Zero-Shot Image Editing

Similar to diffusion models, consistency models allow zeroshot image editing by modifying the multistep sampling process in Algorithm 1. We demonstrate this capability with a consistency model trained on the LSUN bedroom dataset using consistency distillation. In Fig. 6a, we show such a consistency model can colorize gray-scale bedroom images at test time, even though it has never been trained on colorization tasks. In Fig. 6b, we show the same consistency model can generate high-resolution images from low-resolution inputs. In Fig. 6c, we additionally demonstrate that it can generate images based on stroke inputs created by humans, as in SDEdit for diffusion models (Meng et al., 2021). Again, this editing capability is zero-shot, as the model has not been trained on stroke inputs. In Appendix D, we additionally demonstrate the zero-shot

capability of consistency models on inpainting (Fig. 10), interpolation (Fig. 11) and denoising (Fig. 12), with more examples on colorization (Fig. 8), super-resolution (Fig. 9) and stroke-guided image generation (Fig. 13).

7. Conclusion

We have introduced consistency models, a type of generative models that are specifically designed to support one-step and few-step generation. We have empirically demonstrated that our consistency distillation method outshines the existing distillation techniques for diffusion models on multiple image benchmarks and small sampling iterations. Furthermore, as a standalone generative model, consistency models generate better samples than existing single-step generation models except for GANs. Similar to diffusion models, they also allow zero-shot image editing applications such as inpainting, colorization, super-resolution, denoising, interpolation, and stroke-guided image generation.

In addition, consistency models share striking similarities with techniques employed in other fields, including deep Q-learning (Mnih et al., 2015) and momentum-based contrastive learning (Grill et al., 2020; He et al., 2020). This offers exciting prospects for cross-pollination of ideas and methods among these diverse fields.

Acknowledgements

We thank Alex Nichol for reviewing the manuscript and providing valuable feedback, Chenlin Meng for providing stroke inputs needed in our stroke-guided image generation experiments, and the Open AI Algorithms team.

Consistency Models

Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., Karras, T., and Liu, M.-Y. ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. ar Xiv preprint ar Xiv:2211.01324, 2022.

Biloˇs, M., Sommer, J., Rangapuram, S. S., Januschowski, T., and G unnemann, S. Neural flows: Efficient alternative to neural odes. Advances in Neural Information Processing Systems, 34:21325 21337, 2021.

Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=B1xsqj09Fm.

Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Chan, W. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations (ICLR), 2021.

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural Ordinary Differential Equations. In Advances in neural information processing systems, pp. 6571 6583, 2018.

Chen, R. T., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pp. 9916 9926, 2019.

Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. Diffusion posterior sampling for general noisy inverse problems. In International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=On D9z GAGT0k.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems (Neur IPS), 2021.

Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear independent components estimation. International Conference in Learning Representations Workshop Track, 2015.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.

Open Review.net, 2017. URL https://openreview. net/forum?id=Hkpbn H9lx.

Dockhorn, T., Vahdat, A., and Kreis, K. Genie: Higherorder denoising diffusion solvers. ar Xiv preprint ar Xiv:2210.05475, 2022.

Gong, X., Chang, S., Jiang, Y., and Wang, Z. Autogan: Neural architecture search for generative adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3224 3234, 2019.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Grci c, M., Grubiˇsi c, I., and ˇSegvi c, S. Densely connected normalizing flows. Advances in Neural Information Processing Systems, 34:23968 23982, 2021.

Grill, J.-B., Strub, F., Altch e, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729 9738, 2020.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017.

Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 2020.

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. ar Xiv preprint ar Xiv:2210.02303, 2022a.

Ho, J., Salimans, T., Gritsenko, A. A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022b. URL https://openreview. net/forum?id=BBel R2Nd DZ5.

Hyv arinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research (JMLR), 6(4), 2005.

Consistency Models

Jiang, Y., Chang, S., and Wang, Z. Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems, 34:14745 14758, 2021.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=Hk99z Ce Ab.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. 2020.

Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. In Proc. Neur IPS, 2022.

Kawar, B., Vaksman, G., and Elad, M. Snips: Solving noisy inverse problems stochastically. ar Xiv preprint ar Xiv:2105.14951, 2021.

Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.

Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 10215 10224. 2018.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.

Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diff Wave: A Versatile Diffusion Model for Audio Synthesis. ar Xiv preprint ar Xiv:2009.09761, 2020.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Kynk a anniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.

Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., and Liu, C. Vitgan: Training gans with vision transformers. ar Xiv preprint ar Xiv:2107.04589, 2021.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. ar Xiv preprint ar Xiv:1908.03265, 2019.

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022.

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. ar Xiv preprint ar Xiv:2206.00927, 2022.

Luhman, E. and Luhman, T. Knowledge distillation in iterative generative models for improved sampling speed. ar Xiv preprint ar Xiv:2101.02388, 2021.

Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. Sdedit: Image synthesis and editing with stochastic differential equations. ar Xiv preprint ar Xiv:2108.01073, 2021.

Meng, C., Gao, R., Kingma, D. P., Ermon, S., Ho, J., and Salimans, T. On distillation of guided diffusion models. ar Xiv preprint ar Xiv:2210.03142, 2022.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518(7540): 529 533, 2015.

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mc Grew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ar Xiv preprint ar Xiv:2112.10741, 2021.

Parmar, G., Li, D., Lee, K., and Tu, Z. Dual contradistinctive generative autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 823 832, 2021.

Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., and Kudinov, M. Grad-TTS: A diffusion probabilistic model for text-to-speech. ar Xiv preprint ar Xiv:2105.06337, 2021.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pp. 1278 1286, 2014.

Consistency Models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., et al. Photorealistic text-to-image diffusion models with deep language understanding. ar Xiv preprint ar Xiv:2205.11487, 2022.

Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=TId IXIpzho I.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234 2242, 2016.

Sauer, A., Schwarz, K., and Geiger, A. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1 10, 2022.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. In International Conference on Machine Learning, pp. 2256 2265, 2015.

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020.

Song, J., Vahdat, A., Mardani, M., and Kautz, J. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=9_gs MA8MRKQ.

Song, Y. and Ermon, S. Generative Modeling by Estimating Gradients of the Data Distribution. In Advances in Neural Information Processing Systems, pp. 11918 11930, 2019.

Song, Y. and Ermon, S. Improved Techniques for Training Score-Based Generative Models. Advances in Neural Information Processing Systems, 33, 2020.

Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, pp. 204, 2019.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations,

2021. URL https://openreview.net/forum? id=Px TIG12RRHS.

Song, Y., Shen, L., Xing, L., and Ermon, S. Solving inverse problems in medical imaging with score-based generative models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=va RCHVj0u GI.

S uli, E. and Mayers, D. F. An introduction to numerical analysis. Cambridge university press, 2003.

Tian, Y., Wang, Q., Huang, Z., Li, W., Dai, D., Yang, M., Wang, J., and Fink, O. Off-policy reinforcement learning for efficient and effective gan architecture search. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part VII 16, pp. 175 192. Springer, 2020.

Vahdat, A., Kreis, K., and Kautz, J. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287 11302, 2021.

Vincent, P. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7): 1661 1674, 2011.

Wu, J., Huang, Z., Acharya, D., Li, W., Thoma, J., Paudel, D. P., and Gool, L. V. Sliced wasserstein generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3713 3722, 2019.

Xiao, Z., Yan, Q., and Amit, Y. Generative latent flow. ar Xiv preprint ar Xiv:1905.10485, 2019.

Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=Jpr M0p-q0Co.

Xu, Y., Liu, Z., Tegmark, M., and Jaakkola, T. S. Poisson flow generative models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=vo V_TRqc Wh.

Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015.

Zhang, Q. and Chen, Y. Fast sampling of diffusion models with exponential integrator. ar Xiv preprint ar Xiv:2204.13902, 2022.

Consistency Models

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.

Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., and Anandkumar, A. Fast sampling of diffusion models via operator learning. ar Xiv preprint ar Xiv:2211.13449, 2022.

Zheng, H., He, P., Chen, W., and Zhou, M. Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=HDxga Kk956l.

Consistency Models

1 Introduction 1

2 Diffusion Models 2

3 Consistency Models 3

4 Training Consistency Models via Distillation 4

5 Training Consistency Models in Isolation 5

6 Experiments 6

6.1 Training Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6.2 Few-Step Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6.3 Zero-Shot Image Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

7 Conclusion 9

Appendices 15

Appendix A Proofs 15

A.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2 Consistency Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.3 Consistency Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Appendix B Continuous-Time Extensions 18

B.1 Consistency Distillation in Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.2 Consistency Training in Continuous Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

B.3 Experimental Verifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Appendix C Additional Experimental Details 25

Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Parameterization for Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Schedule Functions for Consistency Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Appendix D Additional Results on Zero-Shot Image Editing 26

Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Colorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Super-resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Consistency Models

Stroke-guided image generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Appendix E Additional Samples from Consistency Models 28

A.1. Notations

We use fθpx, tq to denote a consistency model parameterized by θ, and fpx, t; ϕq the consistency function of the empirical PF ODE in Eq. (3). Here ϕ symbolizes its dependency on the pre-trained score model sϕpx, tq. For the consistency function of the PF ODE in Eq. (2), we denote it as fpx, tq. Given a multi-variate function hpx, yq, we let B1hpx, yq denote the Jacobian of h over x, and analogously B2hpx, yq denote the Jacobian of h over y. Unless otherwise stated, x is supposed to be a random variable sampled from the data distribution pdatapxq, n is sampled uniformly at random from J1, N 1K, and xtn is sampled from Npx; t2 n Iq. Here J1, N 1K represents the set of integers t1, 2, , N 1u. Furthermore, recall that we define

ˆxϕ tn : xtn 1 ptn tn 1qΦpxtn 1, tn 1; ϕq,

where Φp ; ϕq denotes the update function of a one-step ODE solver for the empirical PF ODE defined by the score model sϕpx, tq. By default, Er s denotes the expectation over all relevant random variables in the expression.

A.2. Consistency Distillation

Theorem 1. Let t : maxn PJ1,N 1Kt|tn 1 tn|u, and fp , ; ϕq be the consistency function of the empirical PF ODE in Eq. (3). Assume fθ satisfies the Lipschitz condition: there exists L ą 0 such that for all t P rϵ, Ts, x, and y, we have fθpx, tq fθpy, tq 2 ď L x y 2. Assume further that for all n P J1, N 1K, the ODE solver called at tn 1 has local error uniformly bounded by Opptn 1 tnqp 1q with p ě 1. Then, if LN CDpθ, θ; ϕq 0, we have

sup n,x }fθpx, tnq fpx, tn; ϕq}2 Opp tqpq.

Proof. From LN CDpθ, θ; ϕq 0, we have

LN CDpθ, θ; ϕq Erλptnqdpfθpxtn 1, tn 1q, fθpˆxϕ tn, tnqqs 0. (11)

According to the definition, we have ptnpxtnq pdatapxq b Np0, t2 n Iq where tn ě ϵ ą 0. It follows that ptnpxtnq ą 0 for every xtn and 1 ď n ď N. Therefore, Eq. (11) entails

λptnqdpfθpxtn 1, tn 1q, fθpˆxϕ tn, tnqq 0. (12)

Because λp q ą 0 and dpx, yq 0 ô x y, this further implies that

fθpxtn 1, tn 1q fθpˆxϕ tn, tnq. (13)

Now let en represent the error vector at tn, which is defined as

en : fθpxtn, tnq fpxtn, tn; ϕq.

We can easily derive the following recursion relation

en 1 fθpxtn 1, tn 1q fpxtn 1, tn 1; ϕq

Consistency Models

piq fθpˆxϕ tn, tnq fpxtn, tn; ϕq

fθpˆxϕ tn, tnq fθpxtn, tnq fθpxtn, tnq fpxtn, tn; ϕq

fθpˆxϕ tn, tnq fθpxtn, tnq en, (14)

where (i) is due to Eq. (13) and fpxtn 1, tn 1; ϕq fpxtn, tn; ϕq. Because fθp , tnq has Lipschitz constant L, we have

en 1 2 ď en 2 L ˆxϕ tn xtn 2 piq en 2 L Opptn 1 tnqp 1q

en 2 Opptn 1 tnqp 1q,

where (i) holds because the ODE solver has local error bounded by Opptn 1 tnqp 1q. In addition, we observe that e1 0, because

e1 fθpxt1, t1q fpxt1, t1; ϕq

piq xt1 fpxt1, t1; ϕq

piiq xt1 xt1 0.

Here (i) is true because the consistency model is parameterized such that fpxt1, t1; ϕq xt1 and (ii) is entailed by the definition of fp , ; ϕq. This allows us to perform induction on the recursion formula Eq. (14) to obtain

en 2 ď e1 2

k 1 Opptk 1 tkqp 1q

k 1 Opptk 1 tkqp 1q

k 1 ptk 1 tkq Opptk 1 tkqpq

k 1 ptk 1 tkq Opp tqpq

k 1 ptk 1 tkq

Opp tqpqptn t1q

ď Opp tqpqp T ϵq

which completes the proof.

A.3. Consistency Training

The following lemma provides an unbiased estimator for the score function, which is crucial to our proof for Theorem 2.

Lemma 1. Let x pdatapxq, xt Npx; t2Iq, and ptpxtq pdatapxqb Np0, t2Iq. We have log ptpxq Er xt x

Proof. According to the definition of ptpxtq, we have log ptpxtq xt log ş pdatapxqppxt | xq dx, where ppxt | xq Npxt; x, t2Iq. This expression can be further simplified to yield

ş pdatapxq xtppxt | xq dx ş pdatapxqppxt | xq dx

Consistency Models

ş pdatapxqppxt | xq xt log ppxt | xq dx ş pdatapxqppxt | xq dx

ş pdatapxqppxt | xq xt log ppxt | xq dx

ż pdatapxqppxt | xq

ptpxtq xt log ppxt | xq dx

piq ż ppx | xtq xt log ppxt | xq dx

Er xt log ppxt | xq | xts

where (i) is due to Bayes rule.

Theorem 2. Let t : maxn PJ1,N 1Kt|tn 1 tn|u. Assume d and fθ are both twice continuously differentiable with bounded second derivatives, the weighting function λp q is bounded, and Er log ptnpxtnq 2 2s ă 8. Assume further that we use the Euler ODE solver, and the pre-trained score model matches the ground truth, i.e., @t P rϵ, Ts : sϕpx, tq log ptpxq. Then,

LN CDpθ, θ ; ϕq LN CTpθ, θ q op tq,

where the expectation is taken with respect to x pdata, n UJ1, N 1K, and xtn 1 Npx; t2 n 1Iq. The consistency training objective, denoted by LN CTpθ, θ q, is defined as

Erλptnqdpfθpx tn 1z, tn 1q, fθ px tnz, tnqqs,

where z Np0, Iq. Moreover, LN CTpθ, θ q ě Op tq if inf N LN CDpθ, θ ; ϕq ą 0.

Proof. With Taylor expansion, we have

LN CDpθ, θ ; ϕq Erλptnqdpfθpxtn 1, tn 1q, fθ pˆxϕ tn, tnqs

Erλptnqdpfθpxtn 1, tn 1q, fθ pxtn 1 ptn 1 tnqtn 1 log ptn 1pxtn 1q, tnqqs

Erλptnqdpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1q B1fθ pxtn 1, tn 1qptn 1 tnqtn 1 log ptn 1pxtn 1q

B2fθ pxtn 1, tn 1qptn tn 1q op|tn 1 tn|qqs

Etλptnqdpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qq λptnq B2dpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qqr

B1fθ pxtn 1, tn 1qptn 1 tnqtn 1 log ptn 1pxtn 1q B2fθ pxtn 1, tn 1qptn tn 1q op|tn 1 tn|qsu

Erλptnqdpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qqs

Etλptnq B2dpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qqr B1fθ pxtn 1, tn 1qptn 1 tnqtn 1 log ptn 1pxtn 1qsu

Etλptnq B2dpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qqr B2fθ pxtn 1, tn 1qptn tn 1qsu Erop|tn 1 tn|qs. (15)

Then, we apply Lemma 1 to Eq. (15) and use Taylor expansion in the reverse direction to obtain

LN CDpθ, θ ; ϕq

Erλptnqdpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qqs

E " λptnq B2dpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qq B1fθ pxtn 1, tn 1qptn tn 1qtn 1E xtn 1 x

Etλptnq B2dpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qqr B2fθ pxtn 1, tn 1qptn tn 1qsu Erop|tn 1 tn|qs

piq Erλptnqdpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qqs

E " λptnq B2dpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qq B1fθ pxtn 1, tn 1qptn tn 1qtn 1

Consistency Models

Etλptnq B2dpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qqr B2fθ pxtn 1, tn 1qptn tn 1qsu Erop|tn 1 tn|qs

E λptnqdpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qq

λptnq B2dpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qq B1fθ pxtn 1, tn 1qptn tn 1qtn 1

λptnq B2dpfθpxtn 1, tn 1q, fθ pxtn 1, tn 1qqr B2fθ pxtn 1, tn 1qptn tn 1qs op|tn 1 tn|q ȷ

Erop|tn 1 tn|qs

E λptnqd ˆ fθpxtn 1, tn 1q, fθ ˆ xtn 1 ptn tn 1qtn 1 xtn 1 x

t2 n 1 , tn

ȷ Erop|tn 1 tn|qs

E λptnqd ˆ fθpxtn 1, tn 1q, fθ ˆ xtn 1 ptn tn 1qxtn 1 x

ȷ Erop|tn 1 tn|qs

E rλptnqd pfθpx tn 1z, tn 1q, fθ px tn 1z ptn tn 1qz, tnqqs Erop|tn 1 tn|qs

E rλptnqd pfθpx tn 1z, tn 1q, fθ px tnz, tnqqs Erop|tn 1 tn|qs

E rλptnqd pfθpx tn 1z, tn 1q, fθ px tnz, tnqqs Erop tqs

E rλptnqd pfθpx tn 1z, tn 1q, fθ px tnz, tnqqs op tq

LN CTpθ, θ q op tq, (16)

where (i) is due to the law of total expectation, and z : xtn 1 x

tn 1 Np0, Iq. This implies LN CDpθ, θ ; ϕq LN CTpθ, θ q op tq and thus completes the proof for Eq. (9). Moreover, we have LN CTpθ, θ q ě Op tq whenever inf N LN CDpθ, θ ; ϕq ą 0. Otherwise, LN CTpθ, θ q ă Op tq and thus lim tÑ0 LN CDpθ, θ ; ϕq 0, which is a clear contradiction to inf N LN CDpθ, θ ; ϕq ą 0.

Remark 1. When the condition LN CTpθ, θ q ě Op tq is not satisfied, such as in the case where θ stopgradpθq, the validity of LN CTpθ, θ q as a training objective for consistency models can still be justified by referencing the result provided in Theorem 6.

B. Continuous-Time Extensions

The consistency distillation and consistency training objectives can be generalized to hold for infinite time steps (N Ñ 8) under suitable conditions.

B.1. Consistency Distillation in Continuous Time

Depending on whether θ θ or θ stopgradpθq (same as setting µ 0), there are two possible continuous-time extensions for the consistency distillation objective LN CDpθ, θ ; ϕq. Given a twice continuously differentiable metric function dpx, yq, we define Gpxq as a matrix, whose pi, jq-th entry is given by

r Gpxqsij : B2dpx, yq

Similarly, we define Hpxq as

r Hpxqsij : B2dpy, xq

The matrices G and H play a crucial role in forming continuous-time objectives for consistency distillation. Additionally, we denote the Jacobian of fθpx, tq with respect to x as Bfθpx,tq

When θ θ (with no stopgrad operator), we have the following theoretical result.

Theorem 3. Let tn τp n 1

N 1q, where n P J1, NK, and τp q is a strictly monotonic function with τp0q ϵ and τp1q T. Assume τ is continuously differentiable in r0, 1s, d is three times continuously differentiable with bounded third derivatives,

Consistency Models

and fθ is twice continuously differentiable with bounded first and second derivatives. Assume further that the weighting function λp q is bounded, and supx,t Prϵ,T s sϕpx, tq 2 ă 8. Then with the Euler solver in consistency distillation, we have

lim NÑ8p N 1q2LN CDpθ, θ; ϕq L8 CDpθ, θ; ϕq, (17)

where L8 CDpθ, θ; ϕq is defined as

λptq rpτ 1q1ptqs2

ˆBfθpxt, tq

Bt t Bfθpxt, tq

Bxt sϕpxt, tq T Gpfθpxt, tqq ˆBfθpxt, tq

Bt t Bfθpxt, tq

Bxt sϕpxt, tq ff

Here the expectation above is taken over x pdata, u Ur0, 1s, t τpuq, and xt Npx, t2Iq.

Proof. Let u 1 N 1 and un n 1

N 1. First, we can derive the following equation with Taylor expansion:

fθpˆxϕ tn, tnq fθpxtn 1, tn 1q fθpxtn 1 tn 1sϕpxtn 1, tn 1qτ 1punq u, tnq fθpxtn 1, tn 1q

tn 1 Bfθpxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1qτ 1punq u Bfθpxtn 1, tn 1q

Btn 1 τ 1punq u Opp uq2q, (19)

Note that τ 1punq 1 τ 1ptn 1q. Then, we apply Taylor expansion to the consistency distillation loss, which gives

p N 1q2LN CDpθ, θ; ϕq 1 p uq2 LN CDpθ, θ; ϕq 1 p uq2 Erλptnqdpfθpxtn 1, tn 1q, fθpˆxϕ tn, tnqs

piq 1 2p uq2

ˆ Etλptnqτ 1punq2rfθpˆxϕ tn, tnq fθpxtn 1, tn 1qs TGpfθpxtn 1, tn 1qq

rfθpˆxϕ tn, tnq fθpxtn 1, tn 1qsu Er Op| u|3qs

2E λptnqτ 1punq2 ˆBfθpxtn 1, tn 1q

Btn 1 tn 1 Bfθpxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1q T Gpfθpxtn 1, tn 1qq

ˆBfθpxtn 1, tn 1q

Btn 1 tn 1 Bfθpxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1q ȷ Er Op| u|qs

2E λptnq rpτ 1q1ptnqs2

ˆBfθpxtn 1, tn 1q

Btn 1 tn 1 Bfθpxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1q T Gpfθpxtn 1, tn 1qq

ˆBfθpxtn 1, tn 1q

Btn 1 tn 1 Bfθpxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1q ȷ Er Op| u|qs

where we obtain (i) by expanding dpfθpxtn 1, tn 1q, q to second order and observing dpx, xq 0 and ydpx, yq|y x 0. We obtain (ii) using Eq. (19). By taking the limit for both sides of Eq. (20) as u Ñ 0 or equivalently N Ñ 8, we arrive at Eq. (17), which completes the proof.

Remark 2. Although Theorem 3 assumes the Euler ODE solver for technical simplicity, we believe an analogous result can be derived for more general solvers, since all ODE solvers should perform similarly as N Ñ 8. We leave a more general version of Theorem 3 as future work.

Remark 3. Theorem 3 implies that consistency models can be trained by minimizing L8 CDpθ, θ; ϕq. In particular, when dpx, yq x y 2 2, we have

L8 CDpθ, θ; ϕq E

λptq rpτ 1q1ptqs2

Bt t Bfθpxt, tq

Bxt sϕpxt, tq

However, this continuous-time objective requires computing Jacobian-vector products as a subroutine to evaluate the loss function, which can be slow and laborious to implement in deep learning frameworks that do not support forward-mode automatic differentiation.

Consistency Models

Remark 4. If fθpx, tq matches the ground truth consistency function for the empirical PF ODE of sϕpx, tq, then

Bt t Bfθpx, tq

Bx sϕpx, tq 0

and therefore L8 CDpθ, θ; ϕq 0. This can be proved by noting that fθpxt, tq xϵ for all t P rϵ, Ts, and then taking the time-derivative of this identity:

fθpxt, tq xϵ

ðñBfθpxt, tq

dt Bfθpxt, tq

ðñBfθpxt, tq

Bxt r tsϕpxt, tqs Bfθpxt, tq

ðñBfθpxt, tq

Bt t Bfθpxt, tq

Bxt sϕpxt, tq 0.

The above observation provides another motivation for L8 CDpθ, θ; ϕq, as it is minimized if and only if the consistency model matches the ground truth consistency function.

For some metric functions, such as the ℓ1 norm, the Hessian Gpxq is zero so Theorem 3 is vacuous. Below we show that a non-vacuous statement holds for the ℓ1 norm with just a small modification of the proof for Theorem 3.

Theorem 4. Let tn τp n 1

N 1q, where n P J1, NK, and τp q is a strictly monotonic function with τp0q ϵ and τp1q T. Assume τ is continuously differentiable in r0, 1s, and fθ is twice continuously differentiable with bounded first and second derivatives. Assume further that the weighting function λp q is bounded, and supx,t Prϵ,T s sϕpx, tq 2 ă 8. Suppose we use the Euler ODE solver, and set dpx, yq x y 1 in consistency distillation. Then we have

lim NÑ8p N 1q LN CDpθ, θ; ϕq L8 CD, ℓ1pθ, θ; ϕq, (22)

L8 CD, ℓ1pθ, θ; ϕq : E λptq pτ 1q1ptq

t Bfθpxt, tq

Bxt sϕpxt, tq Bfθpxt, tq

where the expectation above is taken over x pdata, u Ur0, 1s, t τpuq, and xt Npx, t2Iq.

Proof. Let u 1 N 1 and un n 1

N 1. We have

p N 1q LN CDpθ, θ; ϕq 1 u LN CDpθ, θ; ϕq 1 u Erλptnq}fθpxtn 1, tn 1q fθpˆxϕ tn, tnq}1s

u E λptnq tn 1 Bfθpxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1qτ 1punq Bfθpxtn 1, tn 1q

Btn 1 τ 1punq Opp uq2q 1

E λptnqτ 1punq tn 1 Bfθpxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1q Bfθpxtn 1, tn 1q

Btn 1 Op uq 1

E λptnq pτ 1q1ptnq

tn 1 Bfθpxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1q Bfθpxtn 1, tn 1q

Btn 1 Op uq 1

where (i) is obtained by plugging Eq. (19) into the previous equation. Taking the limit for both sides of Eq. (23) as u Ñ 0 or equivalently N Ñ 8 leads to Eq. (22), which completes the proof.

Remark 5. According to Theorem 4, consistency models can be trained by minimizing L8 CD, ℓ1pθ, θ; ϕq. Moreover, the same reasoning in Remark 4 can be applied to show that L8 CD, ℓ1pθ, θ; ϕq 0 if and only if fθpxt, tq xϵ for all xt P Rd and t P rϵ, Ts.

In the second case where θ stopgradpθq, we can derive a so-called pseudo-objective whose gradient matches the gradient of LN CDpθ, θ ; ϕq in the limit of N Ñ 8. Minimizing this pseudo-objective with gradient descent gives another way to train consistency models via distillation. This pseudo-objective is provided by the theorem below.

Consistency Models

Theorem 5. Let tn τp n 1

N 1q, where n P J1, NK, and τp q is a strictly monotonic function with τp0q ϵ and τp1q T. Assume τ is continuously differentiable in r0, 1s, d is three times continuously differentiable with bounded third derivatives, and fθ is twice continuously differentiable with bounded first and second derivatives. Assume further that the weighting function λp q is bounded, supx,t Prϵ,T s sϕpx, tq 2 ă 8, and supx,t Prϵ,T s θfθpx, tq 2 ă 8. Suppose we use the Euler ODE solver, and θ stopgradpθq in consistency distillation. Then,

lim NÑ8p N 1q θLN CDpθ, θ ; ϕq θL8 CDpθ, θ ; ϕq, (24)

L8 CDpθ, θ ; ϕq : E λptq pτ 1q1ptqfθpxt, tq THpfθ pxt, tqq ˆBfθ pxt, tq

Bt t Bfθ pxt, tq

Bxt sϕpxt, tq ȷ . (25)

Here the expectation above is taken over x pdata, u Ur0, 1s, t τpuq, and xt Npx, t2Iq.

Proof. We denote u 1 N 1 and un n 1

N 1. First, we leverage Taylor series expansion to obtain

p N 1q LN CDpθ, θ ; ϕq 1 u LN CDpθ, θ ; ϕq 1 u Erλptnqdpfθpxtn 1, tn 1q, fθ pˆxϕ tn, tnqs

ˆ Etλptnqrfθpxtn 1, tn 1q fθ pˆxϕ tn, tnqs THpfθ pˆxϕ tn, tnqq

rfθpxtn 1, tn 1q fθ pˆxϕ tn, tnqsu Er Op| u|3qs

1 2 u Etλptnqrfθpxtn 1, tn 1q fθ pˆxϕ tn, tnqs THpfθ pˆxϕ tn, tnqqrfθpxtn 1, tn 1q fθ pˆxϕ tn, tnqsu Er Op| u|2qs

where (i) is derived by expanding dp , fθ pˆxϕ tn, tnqq to second order and leveraging dpx, xq 0 and ydpy, xq|y x 0. Next, we compute the gradient of Eq. (26) with respect to θ and simplify the result to obtain

p N 1q θLN CDpθ, θ ; ϕq 1 u θLN CDpθ, θ ; ϕq

1 2 u θEtλptnqrfθpxtn 1, tn 1q fθ pˆxϕ tn, tnqs THpfθ pˆxϕ tn, tnqqrfθpxtn 1, tn 1q fθ pˆxϕ tn, tnqsu Er Op| u|2qs

u Etλptnqr θfθpxtn 1, tn 1qs THpfθ pˆxϕ tn, tnqqrfθpxtn 1, tn 1q fθ pˆxϕ tn, tnqsu Er Op| u|2qs

u E " λptnqr θfθpxtn 1, tn 1qs THpfθ pˆxϕ tn, tnqq tn 1 Bfθ pxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1qτ 1punq u

Bfθ pxtn 1, tn 1q

Btn 1 τ 1punq u ȷ* Er Op| u|qs

E " λptnqr θfθpxtn 1, tn 1qs THpfθ pˆxϕ tn, tnqq tn 1 Bfθ pxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1qτ 1punq

Bfθ pxtn 1, tn 1q

Btn 1 τ 1punq ȷ* Er Op| u|qs

θE " λptnqrfθpxtn 1, tn 1qs THpfθ pˆxϕ tn, tnqq tn 1 Bfθ pxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1qτ 1punq

Bfθ pxtn 1, tn 1q

Btn 1 τ 1punq ȷ* Er Op| u|qs

θE " λptnq pτ 1q1ptnqrfθpxtn 1, tn 1qs THpfθ pˆxϕ tn, tnqq tn 1 Bfθ pxtn 1, tn 1q

Bxtn 1 sϕpxtn 1, tn 1q

Bfθ pxtn 1, tn 1q

ȷ* Er Op| u|qs

Consistency Models

Here (i) results from the chain rule, and (ii) follows from Eq. (19) and fθpx, tq fθ px, tq, since θ stopgradpθq. Taking the limit for both sides of Eq. (28) as u Ñ 0 (or N Ñ 8) yields Eq. (24), which completes the proof.

Remark 6. When dpx, yq x y 2 2, the pseudo-objective L8 CDpθ, θ ; ϕq can be simplified to

L8 CDpθ, θ ; ϕq 2E λptq pτ 1q1ptqfθpxt, tq T ˆBfθ pxt, tq

Bt t Bfθ pxt, tq

Bxt sϕpxt, tq ȷ . (28)

Remark 7. The objective L8 CDpθ, θ ; ϕq defined in Theorem 5 is only meaningful in terms of its gradient one cannot measure the progress of training by tracking the value of L8 CDpθ, θ ; ϕq, but can still apply gradient descent to this objective to distill consistency models from pre-trained diffusion models. Because this objective is not a typical loss function, we refer to it as the pseudo-objective for consistency distillation.

Remark 8. Following the same reasoning in Remark 4, we can easily derive that L8 CDpθ, θ ; ϕq 0 and θL8 CDpθ, θ ; ϕq 0 if fθpx, tq matches the ground truth consistency function for the empirical PF ODE that involves sϕpx, tq. However, the converse does not hold true in general. This distinguishes L8 CDpθ, θ ; ϕq from L8 CDpθ, θ; ϕq, the latter of which is a true loss function.

B.2. Consistency Training in Continuous Time

A remarkable observation is that the pseudo-objective in Theorem 5 can be estimated without any pre-trained diffusion models, which enables direct consistency training of consistency models. More precisely, we have the following result.

Theorem 6. Let tn τp n 1

N 1q, where n P J1, NK, and τp q is a strictly monotonic function with τp0q ϵ and τp1q T. Assume τ is continuously differentiable in r0, 1s, d is three times continuously differentiable with bounded third derivatives, and fθ is twice continuously differentiable with bounded first and second derivatives. Assume further that the weighting function λp q is bounded, Er log ptnpxtnq 2 2s ă 8, supx,t Prϵ,T s θfθpx, tq 2 ă 8, and ϕ represents diffusion model parameters that satisfy sϕpx, tq log ptpxq. Then if θ stopgradpθq, we have

lim NÑ8p N 1q θLN CDpθ, θ ; ϕq lim NÑ8p N 1q θLN CTpθ, θ q θL8 CTpθ, θ q, (29)

where LN CD uses the Euler ODE solver, and

L8 CTpθ, θ q : E λptq pτ 1q1ptqfθpxt, tq THpfθ pxt, tqq ˆBfθ pxt, tq

Bt Bfθ pxt, tq

Here the expectation above is taken over x pdata, u Ur0, 1s, t τpuq, and xt Npx, t2Iq.

Proof. The proof mostly follows that of Theorem 5. First, we leverage Taylor series expansion to obtain

p N 1q LN CTpθ, θ q 1 u LN CTpθ, θ q 1 u Erλptnqdpfθpx tn 1z, tn 1q, fθ px tnz, tnqqs

ˆ Etλptnqrfθpx tn 1z, tn 1q fθ px tnz, tnqs THpfθ px tnz, tnqq

rfθpx tn 1z, tn 1q fθ px tnz, tnqsu Er Op| u|3qs

1 2 u Etλptnqrfθpx tn 1z, tn 1q fθ px tnz, tnqs THpfθ px tnz, tnqq

rfθpx tn 1z, tn 1q fθ px tnz, tnqsu Er Op| u|2qs

where z Np0, Iq, (i) is derived by first expanding dp , fθ px tnz, tnqq to second order, and then noting that dpx, xq 0 and ydpy, xq|y x 0. Next, we compute the gradient of Eq. (31) with respect to θ and simplify the result to obtain

p N 1q θLN CTpθ, θ q 1 u θLN CTpθ, θ q

1 2 u θEtλptnqrfθpx tn 1z, tn 1q fθ px tnz, tnqs THpfθ px tnz, tnqq

rfθpx tn 1z, tn 1q fθ px tnz, tnqsu Er Op| u|2qs

Consistency Models

u Etλptnqr θfθpx tn 1z, tn 1qs THpfθ px tnz, tnqq

rfθpx tn 1z, tn 1q fθ px tnz, tnqsu Er Op| u|2qs

u E " λptnqr θfθpx tn 1z, tn 1qs THpfθ px tnz, tnqq τ 1punq u B1fθ px tnz, tnqz

B2fθ px tnz, tnqτ 1punq u ȷ* Er Op| u|qs

E " λptnqτ 1punqr θfθpx tn 1z, tn 1qs THpfθ px tnz, tnqq B1fθ px tnz, tnqz

B2fθ px tnz, tnq ȷ* Er Op| u|qs

θE " λptnqτ 1punqrfθpx tn 1z, tn 1qs THpfθ px tnz, tnqq B1fθ px tnz, tnqz

B2fθ px tnz, tnq ȷ* Er Op| u|qs

θE " λptnqτ 1punqrfθpxtn 1, tn 1qs THpfθ pxtn, tnqq B1fθ pxtn, tnqxtn x

tn B2fθ pxtn, tnq ȷ* Er Op| u|qs

θE " λptnq pτ 1q1ptnqrfθpxtn 1, tn 1qs THpfθ pxtn, tnqq B1fθ pxtn, tnqxtn x

tn B2fθ pxtn, tnq ȷ* Er Op| u|qs

Here (i) results from the chain rule, and (ii) follows from Taylor expansion. Taking the limit for both sides of Eq. (33) as u Ñ 0 or N Ñ 8 yields the second equality in Eq. (29).

Now we prove the first equality. Applying Taylor expansion again, we obtain

p N 1q θLN CDpθ, θ ; ϕq 1 u θLN CDpθ, θ ; ϕq 1 u θErλptnqdpfθpxtn 1, tn 1q, fθ pˆxϕ tn, tnqqs

u Erλptnq θdpfθpxtn 1, tn 1q, fθ pˆxϕ tn, tnqqs

u Erλptnq θfθpxtn 1, tn 1q TB1dpfθpxtn 1, tn 1q, fθ pˆxϕ tn, tnqqs

u E " λptnq θfθpxtn 1, tn 1q T B1dpfθ pˆxϕ tn, tnq, fθ pˆxϕ tn, tnqq

Hpfθ pˆxϕ tn, tnqqpfθpxtn 1, tn 1q fθ pˆxϕ tn, tnqq Op| u|2q ȷ*

u Etλptnq θfθpxtn 1, tn 1q Tr Hpfθ pˆxϕ tn, tnqqpfθpxtn 1, tn 1q fθ pˆxϕ tn, tnqqs Op| u|2qu

u Etλptnq θfθpxtn 1, tn 1q Tr Hpfθ pˆxϕ tn, tnqqpfθ pxtn 1, tn 1q fθ pˆxϕ tn, tnqqs Op| u|2qu

u Etλptnqr θfθpx tn 1z, tn 1qs THpfθ px tnz, tnqq

rfθpx tn 1z, tn 1q fθ px tnz, tnqsu Er Op| u|2qs

where (i) holds because xtn 1 x tn 1z and ˆxϕ tn xtn 1 ptn tn 1qtn 1 pxtn 1 xq

t2 n 1 xtn 1 ptn tn 1qz x tnz. Because (i) matches Eq. (32), we can use the same reasoning procedure from Eq. (32) to Eq. (33) to conclude lim NÑ8p N 1q θLN CDpθ, θ ; ϕq lim NÑ8p N 1q θLN CTpθ, θ q, completing the proof.

Remark 9. Note that L8 CTpθ, θ q does not depend on the diffusion model parameter ϕ and hence can be optimized without any pre-trained diffusion models.

Consistency Models

(a) Consistency Distillation

(b) Consistency Training

Figure 7: Comparing discrete consistency distillation/training algorithms with continuous counterparts.

Remark 10. When dpx, yq x y 2 2, the continuous-time consistency training objective becomes

L8 CTpθ, θ q 2E λptq pτ 1q1ptqfθpxt, tq T ˆBfθ pxt, tq

Bt Bfθ pxt, tq

Remark 11. Similar to L8 CDpθ, θ ; ϕq in Theorem 5, L8 CTpθ, θ q is a pseudo-objective; one cannot track training by monitoring the value of L8 CTpθ, θ q, but can still apply gradient descent on this loss function to train a consistency model fθpx, tq directly from data. Moreover, the same observation in Remark 8 holds true: L8 CTpθ, θ q 0 and θL8 CTpθ, θ q 0 if fθpx, tq matches the ground truth consistency function for the PF ODE.

B.3. Experimental Verifications

To experimentally verify the efficacy of our continuous-time CD and CT objectives, we train consistency models with a variety of loss functions on CIFAR-10. All results are provided in Fig. 7. We set λptq pτ 1q1ptq for all continuous-time experiments. Other hyperparameters are the same as in Table 3. We occasionally modify some hyperparameters for improved performance. For distillation, we compare the following objectives:

CD pℓ2q: Consistency distillation LN CD with N 18 and the ℓ2 metric.

CD pℓ1q: Consistency distillation LN CD with N 18 and the ℓ1 metric. We set the learning rate to 2e-4.

CD (LPIPS): Consistency distillation LN CD with N 18 and the LPIPS metric.

CD8 pℓ2q: Consistency distillation L8 CD in Theorem 3 with the ℓ2 metric. We set the learning rate to 1e-3 and dropout to 0.13.

CD8 pℓ1q: Consistency distillation L8 CD in Theorem 4 with the ℓ1 metric. We set the learning rate to 1e-3 and dropout to 0.3.

CD8 (stopgrad, ℓ2): Consistency distillation L8 CD in Theorem 5 with the ℓ2 metric. We set the learning rate to 5e-6.

CD8 (stopgrad, LPIPS): Consistency distillation L8 CD in Theorem 5 with the LPIPS metric. We set the learning rate to 5e-6.

We did not investigate using the LPIPS metric in Theorem 3 because minimizing the resulting objective would require back-propagating through second order derivatives of the VGG network used in LPIPS, which is computationally expensive and prone to numerical instability. As revealed by Fig. 7a, the stopgrad version of continuous-time distillation (Theorem 5) works better than the non-stopgrad version (Theorem 3) for both the LPIPS and ℓ2 metrics, and the LPIPS metric works the best for all distillation approaches. Additionally, discrete-time consistency distillation outperforms continuous-time

Consistency Models

Table 3: Hyperparameters used for training CD and CT models

Hyperparameter CIFAR-10 Image Net 64 ˆ 64 LSUN 256 ˆ 256 CD CT CD CT CD CT Learning rate 4e-4 4e-4 8e-6 8e-6 1e-5 1e-5 Batch size 512 512 2048 2048 2048 2048 µ 0 0.95 0.95 µ0 0.9 0.95 0.95 s0 2 2 2 s1 150 200 150 N 18 40 40 ODE solver Heun Heun Heun EMA decay rate 0.9999 0.9999 0.999943 0.999943 0.999943 0.999943 Training iterations 800k 800k 600k 800k 600k 1000k Mixed-Precision (FP16) No No Yes Yes Yes Yes Dropout probability 0.0 0.0 0.0 0.0 0.0 0.0 Number of GPUs 8 8 64 64 64 64

consistency distillation, possibly due to the larger variance in continuous-time objectives, and the fact that one can use effective higher-order ODE solvers in discrete-time objectives.

For consistency training (CT), we find it important to initialize consistency models from a pre-trained EDM model in order to stabilize training when using continuous-time objectives. We hypothesize that this is caused by the large variance in our continuous-time loss functions. For fair comparison, we thus initialize all consistency models from the same pre-trained EDM model on CIFAR-10 for both discrete-time and continuous-time CT, even though the former works well with random initialization. We leave variance reduction techniques for continuous-time CT to future research.

We empirically compare the following objectives:

CT (LPIPS): Consistency training LN CT with N 120 and the LPIPS metric. We set the learning rate to 4e-4, and the EMA decay rate for the target network to 0.99. We do not use the schedule functions for N and µ here because they cause slower learning when the consistency model is initialized from a pre-trained EDM model.

CT8 pℓ2q: Consistency training L8 CT with the ℓ2 metric. We set the learning rate to 5e-6.

CT8 (LPIPS): Consistency training L8 CT with the LPIPS metric. We set the learning rate to 5e-6.

As shown in Fig. 7b, the LPIPS metric leads to improved performance for continuous-time CT. We also find that continuoustime CT outperforms discrete-time CT with the same LPIPS metric. This is likely due to the bias in discrete-time CT, as t ą 0 in Theorem 2 for discrete-time objectives, whereas continuous-time CT has no bias since it implicitly drives t to 0.

C. Additional Experimental Details

Model Architectures We follow Song et al. (2021); Dhariwal & Nichol (2021) for model architectures. Specifically, we use the NCSN++ architecture in Song et al. (2021) for all CIFAR-10 experiments, and take the corresponding network architectures from Dhariwal & Nichol (2021) when performing experiments on Image Net 64 ˆ 64, LSUN Bedroom 256 ˆ 256 and LSUN Cat 256 ˆ 256.

Parameterization for Consistency Models We use the same architectures for consistency models as those used for EDMs. The only difference is we slightly modify the skip connections in EDM to ensure the boundary condition holds for consistency models. Recall that in Section 3 we propose to parameterize a consistency model in the following form:

fθpx, tq cskipptqx coutptq Fθpx, tq.

In EDM (Karras et al., 2022), authors choose

cskipptq σ2 data t2 σ2 data , coutptq σdatat a

σ2 data t2 ,

Consistency Models

where σdata 0.5. However, this choice of cskip and cout does not satisfy the boundary condition when the smallest time instant ϵ 0. To remedy this issue, we modify them to

cskipptq σ2 data pt ϵq2 σ2 data , coutptq σdatapt ϵq a

σ2 data t2 ,

which clearly satisfies cskippϵq 1 and coutpϵq 0.

Schedule Functions for Consistency Training As discussed in Section 5, consistency generation requires specifying schedule functions Np q and µp q for best performance. Throughout our experiments, we use schedule functions that take the form below:

k K pps1 1q2 s2 0q s2 0 1

µpkq exp ˆs0 log µ0

where K denotes the total number of training iterations, s0 denotes the initial discretization steps, s1 ą s0 denotes the target discretization steps at the end of training, and µ0 ą 0 denotes the EMA decay rate at the beginning of model training.

Training Details In both consistency distillation and progressive distillation, we distill EDMs (Karras et al., 2022). We trained these EDMs ourselves according to the specifications given in Karras et al. (2022). The original EDM paper did not provide hyperparameters for the LSUN Bedroom 256 ˆ 256 and Cat 256 ˆ 256 datasets, so we mostly used the same hyperparameters as those for the Image Net 64 ˆ 64 dataset. The difference is that we trained for 600k and 300k iterations for the LSUN Bedroom and Cat datasets respectively, and reduced the batch size from 4096 to 2048.

We used the same EMA decay rate for LSUN 256 ˆ 256 datasets as for the Image Net 64 ˆ 64 dataset. For progressive distillation, we used the same training settings as those described in Salimans & Ho (2022) for CIFAR-10 and Image Net 64 ˆ 64. Although the original paper did not test on LSUN 256 ˆ 256 datasets, we used the same settings for Image Net 64 ˆ 64 and found them to work well.

In all distillation experiments, we initialized the consistency model with pre-trained EDM weights. For consistency training, we initialized the model randomly, just as we did for training the EDMs. We trained all consistency models with the Rectified Adam optimizer (Liu et al., 2019), with no learning rate decay or warm-up, and no weight decay. We also applied EMA to the weights of the online consistency models in both consistency distillation and consistency training, as well as to the weights of the training online consistency models according to Karras et al. (2022). For LSUN 256 ˆ 256 datasets, we chose the EMA decay rate to be the same as that for Image Net 64 ˆ 64, except for consistency distillation on LSUN Bedroom 256 ˆ 256, where we found that using zero EMA worked better.

When using the LPIPS metric on CIFAR-10 and Image Net 64 ˆ 64, we rescale images to resolution 224 ˆ 224 with bilinear upsampling before feeding them to the LPIPS network. For LSUN 256 ˆ 256, we evaluated LPIPS without rescaling inputs. In addition, we performed horizontal flips for data augmentation for all models and on all datasets. We trained all models on a cluster of Nvidia A100 GPUs. Additional hyperparameters for consistency training and distillation are listed in Table 3.

D. Additional Results on Zero-Shot Image Editing

With consistency models, we can perform a variety of zero-shot image editing tasks. As an example, we present additional results on colorization (Fig. 8), super-resolution (Fig. 9), inpainting (Fig. 10), interpolation (Fig. 11), denoising (Fig. 12), and stroke-guided image generation (SDEdit, Meng et al. (2021), Fig. 13). The consistency model used here is trained via consistency distillation on the LSUN Bedroom 256 ˆ 256.

All these image editing tasks, except for image interpolation and denoising, can be performed via a small modification to the multistep sampling algorithm in Algorithm 1. The resulting pseudocode is provided in Algorithm 4. Here y is a reference image that guides sample generation, Ωis a binary mask, d computes element-wise products, and A is an invertible linear transformation that maps images into a latent space where the conditional information in y is infused into the iterative

Consistency Models

Algorithm 4 Zero-Shot Image Editing

1: Input: Consistency model fθp , q, sequence of time points t1 ą t2 ą ą t N, reference image y, invertible linear transformation A, and binary image mask Ω 2: y Ð A 1rp Ayq d p1 Ωq 0 d Ωs 3: Sample x Npy, t2 1Iq 4: x Ð fθpx, t1q 5: x Ð A 1rp Ayq d p1 Ωq p Axq d Ωs 6: for n 2 to N do 7: Sample x Npx, pt2 n ϵ2q Iq 8: x Ð fθpx, tnq 9: x Ð A 1rp Ayq d p1 Ωq p Axq d Ωs 10: end for 11: Output: x

generation procedure by masking with Ω. Unless otherwise stated, we choose

ti ˆ T 1{ρ i 1

N 1pϵ1{ρ T 1{ρq ρ

in our experiments, where N 40 for LSUN Bedroom 256 ˆ 256.

Below we describe how to perform each task using Algorithm 4.

Inpainting When using Algorithm 4 for inpainting, we let y be an image where missing pixels are masked out, Ωbe a binary mask where 1 indicates the missing pixels, and A be the identity transformation.

Colorization The algorithm for image colorization is similar, as colorization becomes a special case of inpainting once we transform data into a decoupled space. Specifically, let y P Rhˆwˆ3 be a gray-scale image that we aim to colorize, where all channels of y are assumed to be the same, i.e., yr:, :, 0s yr:, :, 1s yr:, :, 2s in Num Py notation. In our experiments, each channel of this gray scale image is obtained from a colorful image by averaging the RGB channels with

0.2989R 0.5870G 0.1140B.

We define ΩP t0, 1uhˆwˆ3 to be a binary mask such that

# 1, k 1 or 2 0, k 0 .

Let Q P R3ˆ3 be an orthogonal matrix whose first column is proportional to the vector p0.2989, 0.5870, 0.1140q. This orthogonal matrix can be obtained easily via QR decomposition, and we use the following in our experiments

0.4471 0.8204 0.3563 0.8780 0.4785 0 0.1705 0.3129 0.9343

We then define the linear transformation A : x P Rhˆwˆ3 ÞÑ y P Rhˆwˆ3, where

l 0 xri, j, ls Qrl, ks.

Because Q is orthogonal, the inversion A 1 : y P Rhˆw ÞÑ x P Rhˆwˆ3 is easy to compute, where

l 0 yri, j, ls Qrk, ls.

With A and Ωdefined as above, we can now use Algorithm 4 for image colorization.

Consistency Models

Super-resolution With a similar strategy, we employ Algorithm 4 for image super-resolution. For simplicity, we assume that the down-sampled image is obtained by averaging non-overlapping patches of size p ˆ p. Suppose the shape of full resolution images is h ˆ w ˆ 3. Let y P Rhˆwˆ3 denote a low-resolution image naively up-sampled to full resolution, where pixels in each non-overlapping patch share the same value. Additionally, let ΩP t0, 1uh{pˆw{pˆp2ˆ3 be a binary mask such that

Ωri, j, k, ls

# 1, k ě 1 0, k 0 .

Similar to image colorization, super-resolution requires an orthogonal matrix Q P Rp2ˆp2 whose first column is p1{p, 1{p, , 1{pq. This orthogonal matrix can be obtained with QR decomposition. To perform super-resolution, we define the linear transformation A : x P Rhˆwˆ3 ÞÑ y P Rh{pˆw{pˆp2ˆ3, where

yri, j, k, ls

m 0 xri ˆ p pm m mod pq{p, j ˆ p m mod p, ls Qrm, ks.

The inverse transformation A 1 : y P Rh{pˆw{pˆp2ˆ3 ÞÑ x P Rhˆwˆ3 is easy to derive, with

xri, j, k, ls

m 0 yri ˆ p pm m mod pq{p, j ˆ p m mod p, ls Qrk, ms.

Above definitions of A and Ωallow us to use Algorithm 4 for image super-resolution.

Stroke-guided image generation We can also use Algorithm 4 for stroke-guided image generation as introduced in SDEdit (Meng et al., 2021). Specifically, we let y P Rhˆwˆ3 be a stroke painting. We set A I, and define ΩP Rhˆwˆ3

as a matrix of ones. In our experiments, we set t1 5.38 and t2 2.24, with N 2.

Denoising It is possible to denoise images perturbed with various scales of Gaussian noise using a single consistency model. Suppose the input image x is perturbed with Np0; σ2Iq. As long as σ P rϵ, Ts, we can evaluate fθpx, σq to produce the denoised image.

Interpolation We can interpolate between two images generated by consistency models. Suppose the first sample x1 is produced by noise vector z1, and the second sample x2 is produced by noise vector z2. In other words, x1 fθpz1, Tq and x2 fθpz2, Tq. To interpolate between x1 and x2, we first use spherical linear interpolation to get

z sinrp1 αqψs

sinpψq z1 sinpαψq

where α P r0, 1s and ψ arccosp z T 1z2 z1 2 z2 2 q, then evaluate fθpz, Tq to produce the interpolated image.

E. Additional Samples from Consistency Models

We provide additional samples from consistency distillation (CD) and consistency training (CT) on CIFAR-10 (Figs. 14 and 18), Image Net 64 ˆ 64 (Figs. 15 and 19), LSUN Bedroom 256 ˆ 256 (Figs. 16 and 20) and LSUN Cat 256 ˆ 256 (Figs. 17 and 21).

Consistency Models

Figure 8: Gray-scale images (left), colorized images by a consistency model (middle), and ground truth (right).

Consistency Models

Figure 9: Downsampled images of resolution 32 ˆ 32 (left), full resolution (256 ˆ 256) images generated by a consistency model (middle), and ground truth images of resolution 256 ˆ 256 (right).

Consistency Models

Figure 10: Masked images (left), imputed images by a consistency model (middle), and ground truth (right).

Consistency Models

Figure 11: Interpolating between leftmost and rightmost images with spherical linear interpolation. All samples are generated by a consistency model trained on LSUN Bedroom 256 ˆ 256.

Consistency Models

Figure 12: Single-step denoising with a consistency model. The leftmost images are ground truth. For every two rows, the top row shows noisy images with different noise levels, while the bottom row gives denoised images.

Consistency Models

Figure 13: SDEdit with a consistency model. The leftmost images are stroke painting inputs. Images on the right side are the results of stroke-guided image generation (SDEdit).

Consistency Models

(a) EDM (FID=2.04)

(b) CD with single-step generation (FID=3.55)

(c) CD with two-step generation (FID=2.93)

Figure 14: Uncurated samples from CIFAR-10 32 ˆ 32. All corresponding samples use the same initial noise.

Consistency Models

(a) EDM (FID=2.44)

(b) CD with single-step generation (FID=6.20)

(c) CD with two-step generation (FID=4.70)

Figure 15: Uncurated samples from Image Net 64 ˆ 64. All corresponding samples use the same initial noise.

Consistency Models

(a) EDM (FID=3.57)

(b) CD with single-step generation (FID=7.80)

(c) CD with two-step generation (FID=5.22)

Figure 16: Uncurated samples from LSUN Bedroom 256 ˆ 256. All corresponding samples use the same initial noise.

Consistency Models

(a) EDM (FID=6.69)

(b) CD with single-step generation (FID=10.99)

(c) CD with two-step generation (FID=8.84)

Figure 17: Uncurated samples from LSUN Cat 256 ˆ 256. All corresponding samples use the same initial noise.

Consistency Models

(a) EDM (FID=2.04)

(b) CT with single-step generation (FID=8.73)

(c) CT with two-step generation (FID=5.83)

Figure 18: Uncurated samples from CIFAR-10 32 ˆ 32. All corresponding samples use the same initial noise.

Consistency Models

(a) EDM (FID=2.44)

(b) CT with single-step generation (FID=12.96)

(c) CT with two-step generation (FID=11.12)

Figure 19: Uncurated samples from Image Net 64 ˆ 64. All corresponding samples use the same initial noise.

Consistency Models

(a) EDM (FID=3.57)

(b) CT with single-step generation (FID=16.00)

(c) CT with two-step generation (FID=7.80)

Figure 20: Uncurated samples from LSUN Bedroom 256 ˆ 256. All corresponding samples use the same initial noise.

Consistency Models

(a) EDM (FID=6.69)

(b) CT with single-step generation (FID=20.70)

(c) CT with two-step generation (FID=11.76)

Figure 21: Uncurated samples from LSUN Cat 256 ˆ 256. All corresponding samples use the same initial noise.