# elucidating_the_preconditioning_in_consistency_distillation__a2a0a791.pdf Published as a conference paper at ICLR 2025 ELUCIDATING THE PRECONDITIONING IN CONSISTENCY DISTILLATION Kaiwen Zheng 1 , Guande He 1 , Jianfei Chen1, Fan Bao12, Jun Zhu 123 1Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, THBI Lab 1Tsinghua-Bosch Joint ML Center, Tsinghua University, Beijing, China 2Shengshu Technology, Beijing 3Pazhou Lab (Huangpu), Guangzhou, China zkwthu@gmail.com; guande.he17@outlook.com; fan.bao@shengshu.ai; {jianfeic, dcszj}@tsinghua.edu.cn Consistency distillation is a prevalent way for accelerating diffusion models adopted in consistency (trajectory) models, in which a student model is trained to traverse backward on the probability flow (PF) ordinary differential equation (ODE) trajectory determined by the teacher model. Preconditioning is a vital technique for stabilizing consistency distillation, by linear combining the input data and the network output with pre-defined coefficients as the consistency function. It imposes the boundary condition of consistency functions without restricting the form and expressiveness of the neural network. However, previous preconditionings are hand-crafted and may be suboptimal choices. In this work, we offer the first theoretical insights into the preconditioning in consistency distillation, by elucidating its design criteria and the connection to the teacher ODE trajectory. Based on these analyses, we further propose a principled way dubbed Analytic-Precond to analytically optimize the preconditioning according to the consistency gap (defined as the gap between the teacher denoiser and the optimal student denoiser) on a generalized teacher ODE. We demonstrate that Analytic-Precond can facilitate the learning of trajectory jumpers, enhance the alignment of the student trajectory with the teacher s, and achieve 2 to 3 training acceleration of consistency trajectory models in multi-step generation across various datasets. 1 INTRODUCTION Diffusion models are a class of powerful deep generative models, showcasing cutting-edge performance in diverse domains including image synthesis (Dhariwal & Nichol, 2021; Karras et al., 2022), speech and video generation (Chen et al., 2021; Ho et al., 2022), controllable image manipulation (Nichol et al., 2022; Ramesh et al., 2022; Rombach et al., 2022; Meng et al., 2022b), density estimation (Song et al., 2021b; Kingma et al., 2021; Lu et al., 2022a; Zheng et al., 2023b) and inverse problem solving (Chung et al., 2022; Kawar et al., 2022). Compared to their generative counterparts like variational auto-encoders (VAEs) (Kingma & Welling, 2014) and generative adversarial networks (GANs) (Goodfellow et al., 2014), diffusion models excel in high-quality generation while circumventing issues of mode collapse and training instability. Consequently, they serve as the cornerstone of next-generation generative systems like text-to-image (Rombach et al., 2022) and text-to-video (Gupta et al., 2023; Bao et al., 2024) synthesis. The primary bottleneck for integrating diffusion models into downstream tasks lies in their slow inference processes, which gradually remove noise from data with hundreds of network evaluations. The sampling process typically involves simulating the probability flow (PF) ordinary ODE backward in time, starting from noise (Song et al., 2021c). To accelerate diffusion sampling, various training-free samplers have been proposed as specialized solvers of the PF-ODE (Song et al., 2021a; Zhang & Chen, 2022; Lu et al., 2022b), yet they still require over 10 steps to generate satisfactory samples due to the inherent discretization errors present in all numerical ODE solvers. Work done during an internship at Shengshu; Equal contribution; The corresponding author. Published as a conference paper at ICLR 2025 (饾憼, 饾挋饾憼) Data Neural Network Teacher PF-ODE Consistency Function 饾拠饾渻 (Student) 饾拠饾渻饾挋饾憽, 饾憽, 饾憼= 饾浖饾憼,饾憽饾懎饾渻饾挋饾憽, 饾憽, 饾憼+ 饾浗饾憼,饾憽饾挋饾憽 Figure 1: Consistency distillation with preconditioning coefficients 伪, 尾. Recent advancements in few-step or even single-step generation of diffusion models are concentrated on distillation methods (Luhman & Luhman, 2021; Salimans & Ho, 2022; Meng et al., 2022a; Song et al., 2023; Kim et al., 2023; Sauer et al., 2023). Particularly, consistency models (CMs) (Song et al., 2023) have emerged as a prominent method for diffusion distillation and successfully been applied to various data domains including latent space (Luo et al., 2023), audio (Ye et al., 2023) and video (Wang et al., 2023). CMs consider training a student network to map arbitrary points on the PF-ODE trajectory to its starting point, thereby enabling one-step generation that directly maps noise to data. A follow-up work named consistency trajectory models (CTMs) (Kim et al., 2023) extends CMs by changing the mapping destination to encompass not only the starting point but also intermediate ones, facilitating unconstrained backward jumps on the PF-ODE trajectory. This design enhances training flexibility and permits the incorporation of auxiliary losses. In both CMs and CTMs, the mapping function (referred to as the consistency function) must adhere to certain constraints. For instance, in CMs, there exists a boundary condition dictating that the starting point maps to itself. Consequently, the consistency functions are parameterized as a linear combination of the input data and the network output with pre-defined coefficients. This approach ensures that boundary conditions are naturally satisfied without constraining the form or expressiveness of the neural network. We term this parameterization technique preconditioning in consistency distillation (Figure 1), aligning with the terminology in EDM (Karras et al., 2022). The preconditionings in CMs and CTMs are intuitively crafted but may be suboptimal. Besides, despite efforts, CTMs have struggled to identify any distinct preconditionings that outperform the original one. In this work, we take the first step towards designing and enhancing preconditioning in consistency distillation to learn better trajectory jumpers . We elucidate the design criteria of preconditioning by linking it to the discretization of the teacher ODE trajectory. We further convert the teacher PF-ODE into a generalized form involving free parameters, which induces a novel family of preconditionings. Through theoretical analyses, we unveil the significance of the consistency gap (referring to the gap between the teacher denoiser and the optimal student denoiser) in achieving good initialization and facilitating learning. By minimizing a derived bound of the consistency gap, we can optimize the preconditioning within our proposed family. We name the optimal preconditioning under our principle as Analytic-Precond, as it can be analytically computed according to the teacher model without manual design or hyperparameter tuning. Moreover, the computation is efficient with less than 1% time cost of the training process. We demonstrate the effectiveness of Analytic-Precond by applying it to CMs and CTMs on standard benchmark datasets, including CIFAR-10, FFHQ 64 64 and Image Net 64 64. While the vanilla preconditioning closely approximates Analytic-Precond and yields similar results in CMs, Analytic Precond exhibits notable distinctions from its original counterpart in CTMs, particularly concerning intermediate jumps on the trajectory. Remarkably, Analytic-Precond achieves 2 to 3 training acceleration in CTMs in multi-step generation. 2 BACKGROUND 2.1 DIFFUSION MODELS Diffusion models (Song et al., 2021c; Sohl-Dickstein et al., 2015; Ho et al., 2020) transform a d- Published as a conference paper at ICLR 2025 dimensional data distribution q0(x0) into Gaussian noise distribution through a forward stochastic differential equation (SDE) starting from x0 q0: dxt = f(t)xtdt + g(t)dwt (1) where t [0, T] for some finite horizon T, f, g : [0, T] R is the scalar-valued drift and diffusion term, and wt Rd is a standard Wiener process. The forward SDE is accompanied by a series of marginal distributions {qt}T t=0 of {xt}T t=0, and f, g are properly designed so that the terminal distribution is approximately a pure Gaussian, i.e., q T (x T ) N(0, 蟽2 T I). An intriguing characteristic of this SDE lies in the presence of the probability flow (PF) ODE (Song et al., 2021c) dxt = [f(t)xt 1 2g2(t) xt log qt(xt)]dt whose solution trajectories at time t, when solved backward from time T to time 0, are distributed exactly as qt. The only unknown term xt log qt(xt) is the score function and can be learned by denoising score matching (DSM) (Vincent, 2011). A prevalent noise schedule f = 0, g = 2t is proposed by EDM (Karras et al., 2022) and followed in recent text-to-image generation (Esser et al., 2024), video generation (Blattmann et al., 2023), as well as consistency distillation. In this case, the forward transition kernel of the forward SDE (Eqn. (1)) owns a simple form q(xt|x0) = N(x0, t2I), and the terminal distribution q T N(0, T 2I). Besides, the PF-ODE can be represented by the denoiser function D蠒(xt, t): dt = xt D蠒(xt, t) where the denoiser function is trained to predict x0 given noisy data xt = x0 + t系, 系 N(0, I) at any time t, i.e., minimizing Et Epdata(x0)q(xt|x0)[w(t) D蠒(xt, t) x0 2 2] for some weighting w(t). This denoising loss is equivalent to the DSM loss (Song et al., 2021c). In EDM, another key insight is to employ preconditioning by parameterizing D蠒 as D蠒(x, t) = cskip(t)x + cout(t)F蠒(x, t)1, where cskip(t) = 蟽2 data 蟽2 data + t2 , cout(t) = 蟽datat p 蟽2 data + t2 , (3) F蠒 is a free-form neural network and 蟽2 data is the variance of the data distribution. 2.2 CONSISTENCY DISTILLATION Denote 蠒 as the parameters of the teacher diffusion model, and 胃 as the parameters of the student network. Given a trajectory {xt}T t=系 with a fixed initial timestep 系 of a teacher PF-ODE2, consistency models (CMs) (Song et al., 2023) aim to learn a consistency function f胃 : (xt, t) 7 x系 which maps the point xt at any time t on the trajectory to the initial point x系. The consistency function is forced to satisfy the boundary condition f胃(x, 系) = x. To ensure unrestricted form and expressiveness of the neural network, f胃 is parameterized as f胃(x, t) = 蟽2 data 蟽2 data + (t 系)2 x + 蟽data(t 系) p 蟽2 data + t2 F胃(x, t) (4) which naturally satisfies the boundary condition for any free-form network F胃(xt, t). We refer to this technique as preconditioning in consistency distillation, aligning with the terminology in EDM. The student 胃 can be distilled from the teacher by the training objective: Et [系,T ],s [系,t)Eq0(x0)q(xt|x0) w(t)d f胃(xt, t), fsg(胃)(Solver蠒(xt, t, s), s) , (5) where w( ) is a positive weighting function, d( , ) is a distance metric, sg is the (exponential moving average) stop-gradient and Solver蠒 is any numerical solver for the teacher PF-ODE. Consistency trajectory models (CTMs) (Kim et al., 2023) extend CMs by changing the mapping destination to not only the initial point but also any intermediate ones, enabling unconstrained backward jumps on the PF-ODE. Specifically, the consistency function is instead defined as f胃 : (xt, t, s) 7 xs, which maps the point xt at time t on the trajectory to the point xs at any 1More precisely, D蠒(x, t) = cskip(t)x + cout(t)F蠒(cin(t)x, cnoise(t)). Since cin(t) and cnoise(t) take effects inside the network, we absorb them into the definition of F蠒 for simplicity. 2In EDM, the range of timesteps is typically chosen as 系 = 0.002, T = 80. Published as a conference paper at ICLR 2025 previous time s < t. The boundary condition is f胃(x, t, t) = x, which is forced by the following preconditioning: f胃(x, t, s) = s D胃(x, t, s) (6) where D胃(xt, t, s) = cskip(t)x + cout(t)F胃(x, t, s) is the student denoiser function, and F胃(x, t, s) is a free-form network with an extra timestep input s. The student network is trained by minimizing Et [系,T ],s [系,t]u [s,t)Eq0(x0)q(xt|x0) w(t)d fsg(胃)(f胃(xt, t, s), s, 系), fsg(胃)(fsg(胃)(Solver蠒(xt, t, u), u, s), s, 系) (7) An important property of CTM s precondtioning is that when s t, the optimal denoiser satisfies D胃 (x, t, s) D蠒(xt, t), i.e. the diffusion denoiser. Consequently, the DSM loss in diffusion models can be incorporated to regularize the training of 胃, which enhances the sample quality as the number of sampling steps increases, enabling speed-quality trade-off. Beyond the hand-crafted preconditionings outlined in Eqn. (4) and Eqn. (6), we seek a general paradigm of preconditioning design in consistency distillation. We first analyze their key ingredients and relate them to the discretization of the teacher ODE. Then we derive a generalized ODE form, which can induce a novel family of preconditionings. Finally, we propose a principled way to analytically obtain optimized preconditioning by minimizing the consistency gap. 3.1 ANALYZING THE PRECONDITIONING IN CONSISTENCY DISTILLATION We examine the form of consistency function f胃(x, t, s) in CTMs, wherein it subsumes CMs as a special case by setting the jumping destination s as the initial timestep 系. Assume f胃 is parameterized as the following form of skip connection: f胃(x, t, s) = f(t, s)x + g(t, s)D胃(x, t, s) (8) where D胃(x, t, s) = cskip(t)x + cout(t)F胃(x, t, s) represents the student denoiser function in alignment with EDM, and f(t, s), g(t, s) are coefficients that linearly combine x and D胃. We identify two essential constraints on the coefficients f and g. Boundary Condition For any free-form network F胃 or D胃, the consistency function f胃 must adhere to f胃(x, t, t) = x (in-place jumping retains the original data point). Therefore, f and g should meet the conditions f(t, t) = 1 and g(t, t) = 0 for any time t. Alignment with the Denoiser Denote the optimal consistency function that precisely follows the teacher PF-ODE trajectory as f胃 (x, t, s), and the optimal denoiser as D胃 (x, t, s) = f胃 (x,t,s) f(t,s)x g(t,s) according to Eqn. (8). In CTMs, f and g are properly designed so that the limit D胃 (x, t, t) = lims t f胃 (x,t,s) f(t,s)x g(t,s) = D蠒(x, t). Thus, the student denoiser at s = t, i.e. D胃(x, t, t), ideally aligns with the teacher denoiser D蠒. This alignment offers two advantages: (1) D胃(xt, t, t) acts as a valid diffusion denoiser and is amenable to regularization with the DSM loss. (2) The teacher model D蠒 serves as an effective initializer of the student D胃 at s = t, implying that D胃 solely at s < t is suboptimal and requires further optimization. Precondionings satisfying these constraints can be derived by discretizing the teacher PF-ODE. Suppose the discretization from time t to time s is expressed as xs = f(t, s)xt + g(t, s)D蠒(xt, t), then f, g naturally satisfy the conditions: the discretization from t to t must be xt = xt; as s t, the discretization error tends to 0, and the optimal student for conducting infinitesimally small jumps is just D蠒(xt, t). For instance, applying Euler method to the PF-ODE in Eqn. (2) yields: xs xt = (s t)xt D蠒(xt, t) D蠒(xt, t) (9) which exactly matches the preconditioning used in CTMs by replacing D蠒(xt, t) with D胃(xt, t, s). Elucidating preconditioning as ODE discretization also closely approximates CMs choice in Published as a conference paper at ICLR 2025 Eqn. (4). For t 系, we have t 系 t, therefore f胃 in Eqn. (4) approximately equals the denoiser D胃. On the other hand, as 系 t 0, CTMs choice in Eqn. (6) also indicates f胃 D胃. Therefore, CMs preconditioning is only distinct from ODE discretization when t is close to 系, which is not the case in one-step or few-step generation. 3.2 INDUCED PRECONDITIONING BY GENERALIZED ODE Based on the analyses above, the preconditioning can be induced from ODE discretization. Drawing inspirations from the dedicated ODE solvers in diffusion models (Lu et al., 2022b; Zheng et al., 2023a), we consider a generalized representation of the teacher ODE in Eqn. (2), which can give rise to alternative preconditionings that satisfy the restrictions. Firstly, we modulate the ODE with a continuous function Lt to transform it into an ODE with respect to Ltxt rather than xt. Leveraging the chain rule of derivatives, we obtain d(Ltxt) dt = Lt dxt dt xt, where dxt dt can be substituted by the original teacher ODE, resulting in 1 + d log Lt dt t xt D蠒(xt, t) (10) By changing the time variable from t to 位t = log t, the ODE can be further simplified to d位t = Ltg蠒(xt, t), g蠒(xt, t) := D蠒(xt, t) (1 lt)xt (11) where we denote lt := d log Lt位 d位 , and t位 = e 位 is the inverse function of 位t. Moreover, Lt can be represented by lt as Lt = e R 位t 位T lt位d位. Secondly, instead of using t or 位t as the time variable in the ODE (i.e., formulate the ODE as d( ) d位t ), we can employ a generalized time representation 畏t = R 位t 位T Lt位St位d位, where St is any positive continuous function. This transformation ensures that 畏 monotonically increases with respect to 位, enabling one-to-one inverse mappings t畏, 位畏. To align with Lt, we express St as e R 位t 位T st位d位, where we denote st := d log St位 d位 . Using 畏t as the new time variable, we have d畏 = Lt位St位d位, and the ODE in Eqn. (11) is further generalized to d畏t = g蠒(xt, t) The final generalized ODE in Eqn. (12) is theoretically equivalent to the original teacher PF-ODE in Eqn. (2), albeit with a set of introduced free parameters {lt, st}T t=系. Applying the Euler method leads to different discretizations from Eqn. (9): Lsxs Ltxt = (畏s 畏t)g蠒(xt, t) which can be rearranged as xs = Lt St + (lt 1)(畏s 畏t) Ls St xt + 畏s 畏t Ls St D蠒(xt, t) (14) Hence, the induced preconditioning can be expressed by Eqn. (8) with a novel set of coefficients f(t, s) = Lt St+(lt 1)(畏s 畏t) Ls St , g(t, s) = 畏s 畏t Ls St . Originating from the Euler discretization of an equivalent teacher ODE, these coefficients adhere to the constraints outlined in Section 3.1 under any parameters {lt, st}T t=系, thus opening avenues for further optimization. The induced preconditioning can also degenerate to CTM s case f(t, s) = s t , g(t, s) = 1 s t under specific selections lt = 0, st = 1 for t [系, T]. 3.3 PRINCIPLES FOR OPTIMIZING THE PRECONDITIONING Derived from the generalized teacher ODE presented in Eqn. (12), a range of preconditionings is now at our disposal with coefficients f, g from Eqn. (14), governed by the free parameters {lt, st}T t=系. Our aim is to establish guiding principles for discerning the optimal sets of {lt, st}T t=系, thereby attaining superior preconditioning compared to the original one in Eqn. (6). Published as a conference paper at ICLR 2025 Table 1: Comparison between different preconditionings used in consistency distillation. Method CM (Song et al., 2023) BCM (Li & He, 2024) CTM (Kim et al., 2023) Analytic-Precond (Ours) Free-form Network F胃(xt, t) F胃(xt, t, s) Denoiser Function D胃(x, t) = cskip(t)x + cout(t)F胃(x, t) D胃(x, t, s) = cskip(t)x + cout(t)F胃(x, t, s) Consistency Function f胃(x, t) = f(t, 系)x + g(t, 系)F胃(x, t) f胃(x, t, s) = f(t, s)x + g(t, s)F胃(x, t, s) f胃(x, t, s) = f(t, s)x + g(t, s)D胃(x, t, s) f(t, s) 蟽2 data 蟽2 data + (t s)2 蟽2 data + ts 蟽2 data + t2 s t Lt Ss Ls Ss + (1 ls)(畏s 畏t) g(t, s) 蟽data(t s) p 蟽2 data + t2 蟽data(t s) p 蟽2 data + t2 1 s t 畏s 畏t Ls Ss + (1 ls)(畏s 畏t) Firstly, drawing from the insights of Rosenbrock-type exponential integrators and their relevance in diffusion models (Hochbruck & Ostermann, 2010; Hochbruck et al., 2009; Zheng et al., 2023a), it is suggested that the parameter lt be chosen to restrict the gradient of Eqn. (12) s right-hand side term with respect to xt. This choice ensures the robustness of the resulting ODE against errors in xt. An analytical solution for lt is derived as follows: lt = argmin l Eq(xt) [ xtg蠒(xt, t) F ] lt = 1 Eq(xt) [tr( xt D蠒(xt, t))] where d is the data dimensionality, F denotes the Frobenius norm and tr( ) represents the trace of a matrix. Secondly, to determine the optimal value of st, we dive deeper into the relationship between the teacher denoiser D蠒(xt, t) and the student denoiser D胃(xt, t, s). As elucidated in Section 3.1, the preconditioning is properly crafted to ensure that the optimal student denoiser satisfies D胃 (xt, t, t) = D蠒(xt, t). We further explore the scenario where s < t by examining the gap D胃 (xt, t, s) D蠒(xt, t) 2, which we refer to as the consistency gap. Minimizing this gap extends the alignment of D蠒 and D胃 to cases where s < t, ensuring that the teacher denoiser also serves as a good trajectory jumper. In the subsequent proposition, we derive a bound depicting the asymptotic behavior of the consistency gap: Proposition 3.1 (Bound for the Consistency Gap, proof in Appendix A.1). Suppose there exists some constant C > 0 so that the parameters {lt, st}T t=系 are bounded by |lt|, |st| C, then the optimal student denoiser function D胃 under the preconditioning f(t, s) = Lt St+(lt 1)(畏s 畏t) Ls St , g(t, s) = 畏s 畏t Ls St satisfies D胃 (xt, t, s) D蠒(xt, t) 2 (t/s)3C 1 3C max s 蟿 t d位蟿 s蟿g蠒(x蟿, 蟿) 2 (16) The proposition conforms to the constraint D胃 (xt, t, t) = D蠒(xt, t) when s = t. Moreover, con- sidering s in a local neighborhood of t, by Taylor expansion we have (t/s)3C 1 3C = e3C(log t log s) 1 3C = 1 + O(log t log s). Therefore, the consistency gap for s (t 未, t), when 未 is small, is roughly maxs 蟿 t g蠒(x蟿 ,蟿) d位蟿 s蟿g蠒(x蟿, 蟿) 2. Minimizing this yields an analytic solution for st: st = argmin s Eq(xt) d位t stg蠒(xt, t) 2 st = Eq(xt) h g蠒(xt, t) dg蠒(xt,t) Eq(xt) [ g蠒(xt, t) 2 2] (17) We term the resulting preconditioning as Analytic-Precond, as lt, st are analytically determined by the teacher 蠒 using Eqn. (15) and Eqn. (17). Though lt, st are defined over continuous timesteps, we can compute them on hundreds of discretized ones, while obtaining reasonable estimations of their related terms Lt, St, 畏t. The computation is highly efficient utilizing automatic differentiation in modern deep learning frameworks, requiring less than 1% of the total training time (Appendix B.1). Backward Euler Method for Training Stability Despite the approximation (t/s)3C 1 ing true in local neighborhoods of t, the coefficient (t/s)3C 1 3C in the bound exhibits exponential Published as a conference paper at ICLR 2025 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Training iterations ( 10000) CM CM + Ours 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Training iterations ( 10000) CTM CTM + Ours 6 4 2 0 2 4 log t f(t, 系) (CM) f(t, 系) (CTM) f(t, 系) (Ours) g(t, 系) (CM) g(t, 系) (CTM) g(t, 系) (Ours) (c) Coefficients f(t, 系), g(t, 系) Figure 2: Training curves for single-step generation, and visualization of preconditionings for singlestep jump on CIFAR-10 (conditional). behavior when t s 1. In practice, directly applying the preconditioning derived from Eqn. (14) may cause training instability, especially on long jumps with large step sizes. Drawing inspiration from the stability of the backward Euler method, known for its efficacy in handling stiff equations without step size restrictions, we propose a backward rewriting of Eqn. (14) from s to t as xt = 藛f(s, t)xs + 藛g(s, t)D蠒, where 藛f, 藛g are the original coefficients from Eqn. (14). Rearranging this equation yields xs = 1 藛 f(s,t)xt 藛g(s,t) 藛 f(s,t)D蠒, giving rise to the backward coefficients f(t, s) = 1 藛 f(s,t), g(t, s) = 藛g(s,t) We summarize different preconditionings in Table 1, where we also included a concurrent work called bidirectional consistency models (BCMs) (Li & He, 2024) which proposed an alternative preconditioning to CTMs. 4 RELATED WORK Fast Diffusion Sampling Fast sampling of diffusion models can be categorized into training-free and training-based methods. The former typically seek implicit sampling processes (Song et al., 2021a; Zheng et al., 2024a;b) or dedicated numerical solvers to the differential equations corresponding to diffusion generation, including Heun s methods (Karras et al., 2022), splitting numerical methods (Wizadwongsa & Suwajanakorn, 2022), pseudo numerical methods (Liu et al., 2021) and exponential integrators (Zhang & Chen, 2022; Lu et al., 2022b; Zheng et al., 2023a; Gonzalez et al., 2023). They typically require around 10 steps for high-quality generation. In contrast, training-based methods, particularly adversarial distillation (Sauer et al., 2023) and consistency distillation (Song et al., 2023; Kim et al., 2023), have gained prominence for their ability to achieve high-quality generation with just one or two steps. While adversarial distillation proves its effectiveness in one-step generation of text-to-image diffusion models (Sauer et al., 2024), it is theoretically less transparent than consistency distillation due to its reliance on adversarial training. Diffusion models can also be accelerated using quantized or sparse attention (Zhang et al., 2025a; 2024; 2025b). Parameterization in Diffusion Models Parameterization is vital in efficient training and sampling of diffusion models. Initially, the practice involved parameterizing a noise prediction network (Ho et al., 2020; Song et al., 2021c), which outperformed direct data prediction. A notable subsequent enhancement is the introduction of v prediction (Salimans & Ho, 2022), which predicts the velocity along the diffusion trajectory, and is proven effective in applications like text-to-image generation (Esser et al., 2024) and density estimation (Zheng et al., 2023b). EDM (Karras et al., 2022) further advances the field by proposing a preconditioning technique that expresses the denoiser function as a linear combination of data and network, yielding state-of-the-art sample quality alongside other techniques. However, the parameterization in consistency distillation remains unexplored. 5 EXPERIMENTS In this section, we demonstrate the impact of Analytic-Precond when applied to consistency distillation. Our experiments encompass various image datasets, including CIFAR-10 (Krizhevsky, 2009), FFHQ (Karras et al., 2019) 64 64, and Image Net (Deng et al., 2009) 64 64, under both uncon- Published as a conference paper at ICLR 2025 0 5 10 15 20 Training iterations ( 10000) CTM CTM + Ours (a) CIFAR-10 (Unconditional) 5 10 15 Training iterations ( 10000) 2.8 speedup CTM CTM + Ours (b) CIFAR-10 (Conditional) 5 10 15 Training iterations ( 10000) CTM CTM + Ours (c) FFHQ 64 64 (Unconditional) 2 4 6 Training iterations ( 10000) CTM CTM + Ours (d) Image Net 64 64 (Conditional) Figure 3: Training curves for two-step generation. ditional and class-conditional settings. We deploy Analytic-Precond across two paradigms: consistency models (CMs) (Song et al., 2023) and consistency trajectory models (CTMs) (Kim et al., 2023), wherein we solely substitute the preconditioning while retaining other training procedures. For further experiment details, please refer to Appendix B. Our investigation aims to address two primary questions: Can Analytic-Precond yield improvements over the original preconditioning of CMs and CTMs, across both single-step and multi-step generation? How does Analytic-Precond differ from prior preconditioning across datasets, concerning the coefficients f(t, s) and g(t, s)? 5.1 TRAINING ACCELERATION Effects on CMs and Single-Step CTMs We first apply Analytic-Precond to CMs, where the consistency function f胃(xt, t) is defined to map xt on the teacher ODE trajectory to the starting point x系 at fixed time 系. The models are trained with the consistency loss defined in Eqn. (5) and on the CIFAR-10 dataset, with class labels as conditions. As depicted in Figure 2 (a), we observe that Analytic-Precond yields training curves similar to original CM, measured by FID. Since multistep consistency sampling in CMs only involves evaluating f胃(xt, t) multiple times, the results remain comparable even with an increase in sampling steps. Similar phenomena emerge in CTMs with single-step generation, as illustrated in Figure 2 (b). The commonality between these two scenarios lies in the utilization of only the jumping destination at 系. To investigate further, we plot the preconditioning coefficients f(t, 系) and g(t, 系) in CMs, CTMs and Analytic-Precond as a function of log t, as illustrated in Figure 2 (c). It is evident that across varying t, different preconditioning coefficients f and g exhibit negligible discrepancies when s is fixed to 系. This elucidates the rationale behind the comparable performance, suggesting that the original preconditionings for t 系 are already quite optimal with minimal room for further optimization. Effects on Two-Step CTMs We further track sample quality during the training process on CTMs, particularly focusing on two-step generation where an intermediate jump is involved (T t0 系). The models are training with both the consistency trajectory loss in Eqn. (7) and the denoising score matching (DSM) loss Et Epdata(x0)q(xt|x0)[w(t) D胃(xt, t, t) x0 2 2], following CTMs3. As shown in Figure 3, across diverse datasets, Analytic-Precond enjoys superior initialization and up to 3 training acceleration compared to CTM s preconditioning. This observation indicates the suboptimality of the original intermediate trajectory jumps t s > 系. We provided the generated samples in Appendix C. 5.2 GENERATION WITH MORE STEPS Apart from the superiority of CTMs over CMs in single-step generation (Figure 2), another notable advantage of CTMs is the regularization effects of the DSM loss. This ensures that D胃(xt, t, t) functions as a valid denoiser in diffusion models, facilitating sample quality enhancement with additional sampling steps. To evaluate the effectiveness of Analytic-Precond with more steps, we 3CTMs also propose to combine the GAN loss for further enhancing quality, which we will discuss later. Published as a conference paper at ICLR 2025 6 4 2 0 2 4 6 4 2 0 2 4 (b) CIFAR-10 6 4 2 0 2 4 (c) FFHQ 64 64 6 4 2 0 2 4 (d) Image Net 64 64 Figure 4: Visualizations of the preconditioning coefficient g(t, s) for CTM, and for Analytic Precond under different datasets. Table 2: FID results in multi-step generation with different number of function evaluations (NFEs). FID 2 3 5 8 10 2 3 5 8 10 CIFAR-10 (Unconditional) CIFAR-10 (Conditional) CTM 3.83 3.58 3.43 3.33 3.22 3.00 2.82 2.59 2.67 2.56 CTM + Ours 3.77 3.54 3.38 3.30 3.25 2.92 2.75 2.62 2.60 2.65 FFHQ 64 64 (Unconditional) Image Net 256 256 (Conditional) CTM 5.96 5.80 5.53 5.39 5.23 5.95 6.16 5.43 5.44 5.98 CTM + Ours 5.71 5.56 5.47 5.31 5.12 5.73 5.67 5.34 5.43 5.70 employ the deterministic procedure in CTMs, which employs the consistency function to jump on consecutively decreasing timesteps from T to 系. As shown in Table 2, Analytic-Precond brings consistent improvement over CTMs as the number of steps increases, indicating better alignment between the consistency function and the denoiser function. 5.3 ANALYSES AND DISCUSSIONS 0 2 4 6 8 10 12 14 16 Training iterations ( 10000) CTM (NFE=1) CTM (NFE=2) CTM + BCM (NFE=1) CTM + BCM (NFE=2) Figure 5: Effects of BCM s preconditioning on CTMs. Visualizations To intuitively understand the distinctions between Analytic-Precond and the original preconditioning in CTMs, we investigate the variations in coefficients f(t, s), g(t, s). We find that Analytic-Precond yields f(t, s) close to that of CTMs, denoted as f CTM(t, s), with |f CTM(t, s) f(t, s)| < 0.03 across various t and s. However, g(t, s) produced by Analytic-Precond tends to be smaller, with disparities of up to 0.25 compared to g CTM(t, s). This distinction is visually demonstrated in Figure 4, where we depict g(t, s) as a binary function of log t and log s. Notably, the distinction is more pronounced for short jumps t s where |t s|/t is small. Comparison to BCMs In a concurrent work called bidirectional consistency models (BCMs) (Li & He, 2024), a novel preconditioning is derived from EDM s first principle (specified in Table 1). BCM s preconditioning also accommodates flexible transitions from t to s along the trajectory. However, as shown in Figure 5, replacing CTM s preconditioning with BCM s fails to bring improvements in both one-step and two-step generation. 2 4 6 8 10 12 14 16 Training iterations ( 10000) CTM + GAN (NFE=1) CTM + GAN (NFE=2) CTM + GAN + Ours (NFE=1) CTM + GAN + Ours (NFE=2) Figure 6: Effects of Analytic Precond with GAN loss. Compatibility with GAN loss CTMs introduce GAN loss to further enhance the one-step generation quality, employing a discriminator and adopting an alternative optimization approach akin to GANs. As shown in Figure 6, when GAN loss is incorporated on CIFAR-10, Analytic-Precond demonstrates comparable performance. However, in this scenario, the consistency function no longer faithfully adheres to the teacher ODE trajectory, and onestep generation is even better than two-step, deviating from our theoretical foundations. Nevertheless, the utilization of Analytic Precond does not lead to performance degradation. Published as a conference paper at ICLR 2025 (a) Teacher Figure 7: Visualizations of the trajectory alignment, comparing teacher and 3-step student. Enhancement of the Trajectory Alignment We observe that our method also leads to lower mean square error (MSE) in the multi-step generation of CTM, when compared to the teacher diffusion model under the same initial noise, indicating enhanced fidelity to the teacher s trajectory. To better illustrate the effect of Analytic-Precond in improving trajectory alignment, we adopt a toy example where the data distribution is a simple 1-D Gaussian mixture 1 3N( 2, 1)+ 2 3N(1, 0.25). In this case, we can analytically derive the optimal denoiser and visualize the ground-truth teacher trajectory. We initialize the consistency function with the optimal denoiser and apply different preconditionings. As shown in Figure 7, our preconditioning produces few-step trajectories that better align with the teacher s and yields a more accurate final distribution. 6 CONCLUSION In this work, we elucidate the design criteria of the preconditioning in consistency distillation for the first time and propose a novel and principled preconditioning that accelerates the training of CTMs in multi-step generation by 2 to 3 . The crux of our approach lies in our theoretical insights, connecting preconditioning to ODE discretization, and emphasizing the alignment between the consistency function and the denoiser function. Minimizing the consistency gap fosters coordination between the consistency loss and the denoising score-matching loss, thereby facilitating speed-quality trade-offs. Our method provides the first guidelines for designing improved trajectory jumpers on the diffusion ODE, with potential applications in other types of ODE trajectories such as the dynamics of control systems or robotic path planning. Limitations and Broader Impact Despite notable training acceleration in multi-step generation, the final FID improvement is relatively insignificant. Besides, Analytic-Precond fails to differ from previous preconditionings on long jumps, resulting in comparable performance in single-step generation. Achieving accelerated distillation in generative modeling may also raise concerns about the potential misuse for generating fake and malicious media content. Furthermore, it may amplify undesirable social bias that could already exist in the training dataset. ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China (Nos. 62350080, 62106120, 92270001), Tsinghua Institute for Guo Qiang, and the High Performance Computing Center, Tsinghua University; J. Zhu was also supported by the XPlorer Prize. Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-tovideo generator with diffusion models. ar Xiv preprint ar Xiv:2405.04233, 2024. Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. ar Xiv preprint ar Xiv:2311.15127, 2023. Published as a conference paper at ICLR 2025 Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021. Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2022. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255. IEEE, 2009. Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pp. 8780 8794, 2021. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. ar Xiv preprint ar Xiv:2403.03206, 2024. Martin Gonzalez, Nelson Fernandez, Thuy Tran, Elies Gherbi, Hatem Hajri, and Nader Masmoudi. Seeds: Exponential sde solvers for fast high-quality sampling from diffusion models. ar Xiv preprint ar Xiv:2305.14267, 2023. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, pp. 2672 2680, 2014. Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos e Lezama. Photorealistic video generation with diffusion models. ar Xiv preprint ar Xiv:2312.06662, 2023. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851, 2020. Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. ar Xiv preprint ar Xiv:2210.02303, 2022. Marlis Hochbruck and Alexander Ostermann. Exponential integrators. Acta Numerica, 19:209 286, 2010. Marlis Hochbruck, Alexander Ostermann, and Julia Schweitzer. Exponential rosenbrock-type methods. SIAM Journal on Numerical Analysis, 47(1):786 803, 2009. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401 4410, 2019. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. In Advances in Neural Information Processing Systems, 2022. Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022. Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. In The Twelfth International Conference on Learning Representations, 2023. Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014. Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In Advances in Neural Information Processing Systems, 2021. Published as a conference paper at ICLR 2025 Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Liangchen Li and Jiajun He. Bidirectional consistency models. ar Xiv preprint ar Xiv:2403.18035, 2024. Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. In International Conference on Learning Representations, 2021. Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International Conference on Machine Learning, pp. 14429 14460. PMLR, 2022a. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems, 2022b. Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. ar Xiv preprint ar Xiv:2101.02388, 2021. Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. ar Xiv preprint ar Xiv:2310.04378, 2023. Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Neur IPS 2022 Workshop on Score-Based Methods, 2022a. Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022b. Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp. 16784 16804. PMLR, 2022. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with CLIP latents. ar Xiv preprint ar Xiv:2204.06125, 2022. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022. Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. ar Xiv preprint ar Xiv:2311.17042, 2023. Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. ar Xiv preprint ar Xiv:2403.12015, 2024. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a. Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of scorebased diffusion models. In Advances in Neural Information Processing Systems, volume 34, pp. 1415 1428, 2021b. Published as a conference paper at ICLR 2025 Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021c. Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, pp. 32211 32252. PMLR, 2023. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011. Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, and Nong Sang. Videolcm: Video latent consistency model. ar Xiv preprint ar Xiv:2312.09109, 2023. Suttisak Wizadwongsa and Supasorn Suwajanakorn. Accelerating guided diffusion sampling with splitting numerical methods. In The Eleventh International Conference on Learning Representations, 2022. Zhen Ye, Wei Xue, Xu Tan, Jie Chen, Qifeng Liu, and Yike Guo. Comospeech: One-step speech and singing voice synthesis via consistency model. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 1831 1839, 2023. Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization. ar Xiv preprint ar Xiv:2411.10958, 2024. Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. In International Conference on Learning Representations (ICLR), 2025a. Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference. ar Xiv preprint ar Xiv:2502.18137, 2025b. Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2022. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586 595, 2018. Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion odes. In International Conference on Machine Learning, pp. 42363 42389. PMLR, 2023b. Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. ar Xiv preprint ar Xiv:2409.02908, 2024a. Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, and Jun Zhu. Diffusion bridge implicit models. ar Xiv preprint ar Xiv:2405.15885, 2024b. Published as a conference paper at ICLR 2025 A.1 PROOF OF PROPOSITION 3.1 Proof. Denote {x蟿}t 蟿=s as data points on the same teacher ODE trajectory. The generalized ODE in Eqn. (12) can be reformulated as an integral: Lsxs Ltxt = Z 畏s 畏t h蠒(xt位畏 , t位畏)d畏 (18) where h蠒(xt, t) := g蠒(xt,t) St , and g蠒 is defined by the teacher denoiser D蠒 in Eqn. (11). On the other hand, by replacing the teacher denoiser D蠒 with the student denoiser D胃 in the Euler discretization (Eqn. (13)), the optimal student 胃 should satisfy Lsxs Ltxt = (畏s 畏t)h胃 (xt, t, s) (19) where h胃, g胃 are defined similarly to h蠒, g蠒 as h胃(xt, t, s) = g胃(xt, t, s) St , g胃(xt, t, s) = D胃(xt, t, s) (1 lt)xt (20) Combining the above equations, we have D胃 (xt, t, s) D蠒(xt, t) = St(h胃 (xt, t, s) h蠒(xt, t)) 畏s 畏t h蠒(xt, t) 畏t h蠒(xt位畏 , t位畏) h蠒(xt, t)d畏 According to the mean value theorem, there exists some 蟿 [t, t位畏] satisfying h蠒(xt位畏 , t位畏) h蠒(xt, t) 2 (畏 畏t) dh蠒(x蟿, 蟿) Besides, the derivative dh蠒 d畏 can be calculated as d畏蟿 = dh蠒(x蟿, 蟿) d位蟿 d log S蟿 d位蟿 s蟿g蠒(x蟿, 蟿) where we have used d log S蟿 d位蟿 = s蟿 and d畏蟿 d位蟿 = L蟿S蟿. Therefore, D胃 (xt, t, s) D蠒(xt, t) 2 St 畏s 畏t max s 蟿 t d位蟿 s蟿g蠒(x蟿, 蟿) d位蟿 s蟿g蠒(x蟿, 蟿) (24) Since we assumed |lt|, |st| c, according to 蟿 [t, t位], we have R 位 位蟿 lt 位d位 ec(位 位t), St位 S蟿 ec(位 位t), St S蟿 ec(位 位t) (25) Therefore, Z 位s L蟿S2蟿 d位 Z 位s 位t e3c(位 位t)d位 = e3c(位s 位t) 1 3c = (t/s)3c 1 Substituting Eqn. (26) in Eqn. (24) completes the proof. Published as a conference paper at ICLR 2025 Table 3: Experimental configurations. Configuration CIFAR-10 FFHQ 64 64 Image Net 64 64 Uncond Cond Uncond Cond Learning rate 0.0004 0.0004 0.0004 0.0004 Student s stop-grad EMA parameter 0.999 0.999 0.999 0.999 N 18 18 18 40 ODE solver Heun Heun Heun Heun Max. ODE steps 17 17 17 20 EMA decay rate 0.999 0.999 0.999 0.999 Training iterations 200K 150K 150K 60K Mixed-Precision (FP16) True True True True Batch size 256 512 256 2048 Number of GPUs 4 8 8 32 Training Time (A800 Hours) 490 735 900 6400 B EXPERIMENT DETAILS B.1 COEFFICIENTS COMPUTING At every time t, the parameters lt and st can be directly computed according to Eqn. (15) and Eqn. (17), relying solely on the teacher denoiser model D蠒. The computation of lt involves evaluating tr( xt D蠒(xt, t)), which is the trace of a Jacobian matrix. Utilizing Hutchinson s trace estimator, it can be unbiasedly estimated as 1 N PN n=1 v xt D蠒(xt, t)v, where v obeys a d-dimensional distribution with zero mean and unit covariance. Thus, only the Jacobian-vector product (JVP) D蠒(xt, t)v is required, achievable in O(d) computational cost via automatic differentiation. Once lt is obtained, the function g蠒(xt, t) = D蠒(xt, t) (1 lt)xt is determined. The computation of st involves evaluating dg蠒(xt,t) d位t , which expands as follows: d位t = d D蠒(xt, t) d位t xt (1 lt)dxt = d D蠒(xt, t) d位t xt (1 lt)(D蠒(xt, t) xt) (27) where d D蠒(xt,t) d位t can also be calculated in O(d) time by by automatic differentiation. For the CIFAR-10 and FFHQ 64 64 datasets, we compute lt and st across 120 discrete timesteps uniformly distributed in log space, with 4096 samples used to estimate the expectation Eq(xt). For Image Net 64 64, computations are performed across 160 discretized timesteps following EDM s scheduling (t1/蟻 max + i N (t1/蟻 min t1/蟻 max))蟻, using 1024 samples to estimate the expectation Eq(xt). The total computation times for CIFAR-10, FFHQ 64 64, and Image Net 64 64 on 8 NVIDIA A800 GPU cards are approximately 38 minutes, 54 minutes, and 38 minutes, respectively. B.2 TRAINING DETAILS Throughout the experiments, we follow the training procedures of CTMs. The teacher models are the pretrained diffusion models on the corresponding dataset, provided by EDM. The network architecture of the student models mirrors that of their respective teachers, with the addition of a time-conditioning variable s as input. Training of the student models involves minimizing the consistency loss outlined in Eqn. 7 and the denoising score matching loss Et Epdata(x0)q(xt|x0)[w(t) D胃(xt, t, t) x0 2 2]. For the consistency loss, we use LPIPS (Zhang et al., 2018) as the distance metric d( , ), which is also the choice of CMs. t and s in the consistency loss are chosen from N discretized timesteps determined by EDM s scheduling (t1/蟻 max + i N (t1/蟻 min t1/蟻 max))蟻. The Heun sampler in EDM is employed as the solver in Eqn. 7. The number of sampling steps, determined by the gap between t and s, is restricted to avoid excessive training time. For CIFAR-10 and FFHQ 64 64, we select N = 18 and the maximum number of sampling steps as 17, i.e., not restricting the range of jumping from t to s. For Image Net 64 64, we set N = 40 and the maximum number of sampling steps to 20, so that the jumping range is at most half Published as a conference paper at ICLR 2025 of the trajectory length. sg(胃) in Eqn. 7 is an exponential moving average stop-gradient version of 胃, updated by sg(胃) = stop-gradient(碌sg(胃) + (1 碌)胃) (28) We follow the hyperparameters used in EDM, setting 蟽min = 系 = 0.002, 蟽max = T = 80.0, 蟽data = 0.5 and 蟻 = 7. The training configurations are summarized in Table 3. We run the experiments on a cluster of NVIDIA A800 GPU cards. For CIFAR-10 (unconditional), we train the model with a batch size of 256 for 200K iterations, which takes 5 days on 4 GPU cards. For CIFAR-10 (conditional), we train the model with a batch size of 512 for 150K iterations, which takes 4 days on 8 GPU cards. For FFHQ 64 64 (unconditional), we train the model with a batch size of 256 for 150K iterations, which takes 5 days on 8 GPU cards. For Image Net 64 64 (conditional), we train the model with a batch size of 2048 for 60K iterations, which takes 8 days on 32 GPU cards. B.3 EVALUATION DETAILS For single-step as well as multi-step sampling of CTMs, we utilize their deterministic sampling procedure by jumping along a set of discrete timesteps T = t0 t1 . . . t N 1 t N = 系 with the consistency function, formulated as the updating rule xtn = f胃(xtn 1, tn 1, tn). The timesteps {ti}N i=0 are distributed according to EDM s scheduling (t1/蟻 max + i N (t1/蟻 min t1/蟻 max))蟻, where tmin = 系, tmax = T. We generate 50K random samples with the same seed and report the FID on them. B.4 LICENSE Table 4: The used datasets, codes and their licenses. Name URL Citation License CIFAR-10 https://www.cs.toronto.edu/ kriz/cifar.html (Krizhevsky et al., 2009) \ FFHQ https://github.com/NVlabs/ffhq-dataset (Karras et al., 2019) CC BY-NC-SA 4.0 Image Net https://www.image-net.org (Deng et al., 2009) \ EDM https://github.com/NVlabs/edm (Karras et al., 2022) CC BY-NC-SA 4.0 CM https://github.com/openai/consistency_models_cifar10 (Song et al., 2023) Apache-2.0 CTM https://github.com/sony/ctm (Kim et al., 2023) MIT We list the used datasets, codes and their licenses in Table 4. C ADDITIONAL SAMPLES Published as a conference paper at ICLR 2025 (a) CTM (CIFAR-10, Uncond) (b) CTM + Ours (CIFAR-10, Uncond) (c) CTM (CIFAR-10, Cond) (d) CTM + Ours (CIFAR-10, Cond) (e) CTM (FFHQ 64 64, Uncond) (f) CTM + Ours (FFHQ 64 64, Uncond) (g) CTM (Image Net 64 64, Cond) (h) CTM + Ours (Image Net 64 64, Cond) Figure 8: Random samples produced by CTM and CTM + Analytic-Precond (Ours) with NFE=2.