# datafree_distillation_of_diffusion_models_with_bootstrapping__bc7ecad9.pdf Data-free Distillation of Diffusion Models with Bootstrapping Jiatao Gu 1 Chen Wang 2 Shuangfei Zhai 1 Yizhe Zhang 1 Lingjie Liu 2 Joshua M. Susskind 1 Figure 1. Samples of our distilled single-step model with prompts from diffusiondb. Abstract Diffusion models have demonstrated great potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as a remedy which can reduce the number of inference steps to one or a few, without significant quality 1Apple 2University of Pennsylvania. Correspondence to: Jiatao Gu . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). degradation. However, existing distillation methods either require significant amounts of offline computation for generating synthetic training data from the teacher model, or need to perform expensive online learning with the help of real data. In this work, we present a novel technique called BOOT, that overcomes these limitations with an efficient data-free distillation algorithm. The core idea is to learn a time-conditioned model that predicts the output of a pre-trained diffusion model teacher given any time-step. Such a model can be efficiently trained based on bootstrapping from two consecutive sampled steps. Furthermore, our method can be easily adapted to large-scale text- Data-free Distillation of Diffusion Models with Bootstrapping to-image diffusion models, which are challenging for previous methods given the fact that the training sets are often large and difficult to access. We demonstrate the effectiveness of our approach on several benchmark datasets in the DDIM setting, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher. The text-to-image results show that the proposed approach is able to handle highly complex distributions, shedding light on more efficient generative modeling. 1. Introduction Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Nichol & Dhariwal, 2021; Song et al., 2020) have become the standard tools for generative applications, such as image (Dhariwal & Nichol, 2021; Rombach et al., 2021; Ramesh et al., 2022; Saharia et al., 2022), video (Ho et al., 2022b;a), 3D (Poole et al., 2022; Gu et al., 2023; Liu et al., 2023b; Chen et al., 2023), audio (Liu et al., 2023a), and text (Li et al., 2022; Zhang et al., 2023) generation. Diffusion models are considered more stable for training compared to alternative approaches like GANs (Goodfellow et al., 2014a) or VAEs (Kingma & Welling, 2013), as they don t require balancing two modules, making them less susceptible to issues like mode collapse or posterior collapse. Despite their empirical success, standard diffusion models often have slow inference times (around 50 1000 slower than single-step models like GANs), which poses challenges for deployment on consumer devices. This is mainly because diffusion models use an iterative refinement process to generate samples. To address this issue, previous studies have proposed using knowledge distillation to improve the inference speed (Hinton et al., 2015). The idea is to train a faster student model that can replicate the output of a pre-trained diffusion model. In this work, we focus on learning efficient single-step models that only require one neural function evaluation (NFE). However, previous methods, such as Luhman & Luhman (2021), require executing the full teacher sampling to generate synthetic targets for every student update, which is impractical for distilling large diffusion models like Stable Diffusion (SD, Rombach et al., 2021). Recently, several techniques have been proposed to avoid sampling using the concept of bootstrap . For example, Salimans & Ho (2022) gradually reduces the number of inference steps based on the previous stage s student, while Song et al. (2023) and Berthelot et al. (2023) train single-step denoisers by enforcing self-consistency between adjacent student outputs along the same diffusion trajectory. However, these approaches rely on the availability of real data to simulate the intermediate diffusion states as input, which limits their applicability in scenarios where the desired real data is not accessible. In this paper, we propose BOOT, a data-free knowledge distillation method for denoising diffusion models, with single-step inference via bootstrapping. Our inspiration for BOOT partially draws from the insight presented by consistency model (CM, Song et al., 2023) that all points on the same diffusion trajectory, a.k.a., PF-ODE (Song et al., 2020), have a deterministic mapping between each other. We identify two advantages of the proposed method: (i) Similar to CM, BOOT enjoys efficient single-step inference which dramatically facilitates the model deployment on scenarios demanding low resource/latency. (ii) Different from CM, which seeks self-consistency from any xt to x0, thus being data-dependent, BOOT predicts all possible xt given the same noise point ϵ and a time indicator t. Consequently, BOOT gθ always reads pure Gaussian noise, making it data-free. Moreover, learning all xt from the same ϵ enables bootstrapping: it is easier to predict xt if the model has already learned to generate xt where t > t. However, formulating bootstrapping in this way presents additional non-trivial challenges, such as noisy sample prediction. To address this, we learn the student model from a novel Signal-ODE derived from the original PF-ODE. We also design objectives and boundary conditions to enhance the sampling quality and diversity. This enables efficient inference of large diffusion models in scenarios where the original training corpus is inaccessible due to privacy or other concerns. For example, we can obtain an efficient model for synthesizing images of raccoon astronaut by distilling the text-to-image model with the corresponding prompts (shown in Figure 2), even though collecting such real data is difficult. In the experiments, we first demonstrate the efficacy of BOOT on various challenging image generation benchmarks, including unconditional and class-conditional settings. Next, we show that the proposed method can be easily adopted to distill text-to-image diffusion models. An illustration of sampled images from our distilled text-to-image model is shown in Figure 1. 2. Preliminaries 2.1. Diffusion Models Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) belong to a class of deep generative models that generate data by progressively removing noise from the initial input. In this work, we focus on continuous-time diffusion models (Song et al., 2020; Kingma et al., 2021; Karras et al., 2022) in the variancepreserving formulation (Salimans & Ho, 2022). Given a data point x RN, we model a series of time-dependent latent variables {xt|t [0, T], x0 = x} based on a given Data-free Distillation of Diffusion Models with Bootstrapping noise schedule {αt, σt}: q(xt|xs) = N(xt; αt|sxs, σ2 t|s I), and q(xt|x) = N(xt; αtx, σ2 t I), where αt|s = αt/αs and σ2 t|s = σ2 t α2 t|sσ2 s for s < t. By default, the signal-to-noise ratio (SNR, α2 t/σ2 t ) decreases monotonically with t. A diffusion model fϕ learns to reverse the diffusion process by denoising xt. After training, one can use ancestral sampling (Ho et al., 2020) to synthesize new data from the learned model. While the conventional method is stochastic, DDIM (Song et al., 2021) demonstrates that one can follow a deterministic sampler to generate the final sample x0, which follows the update rule: xs = (σs/σt) xt+(αs αtσs/σt) fϕ(xt, t), s < t, (1) with the boundary condition x T = ϵ N(0, I). As noted in Lu et al. (2022), Eq. (1) is equivalent to the firstorder ODE solver for the underlying probability-flow (PF) ODE (Song et al., 2020). Therefore, the step size δ = t s needs to be small to mitigate error accumulation. Additionally, using higher-order solvers such as Runge-Kutta (Süli & Mayers, 2003), Heun (Ascher & Petzold, 1998), and other solvers (Lu et al., 2022; Jolicoeur-Martineau et al., 2021) can further reduce the number of function evaluations (NFEs), which, however, are not applicable in single-step. 2.2. Knowledge Distillation Orthogonal to the development of ODE solvers, distillationbased techniques have been proposed to learn faster student models from a pre-trained diffusion teacher. The most straightforward approach is to perform direct distillation (Luhman & Luhman, 2021), where a student model gθ is trained to learn from the output of the diffusion model, which is computationally expensive itself: LDirect θ = Eϵ N(0,I) gθ(ϵ) ODE-Solver(fϕ, ϵ, T 0) 2 2, (2) Here, ODE-solver refers to any solvers like DDIM as mentioned above. While this naive approach shows promising results, it typically requires over 50 steps of evaluations to obtain reasonable distillation targets, which becomes a bottleneck when learning large-scale models. Alternatively, recent studies (Salimans & Ho, 2022; Song et al., 2023; Berthelot et al., 2023) have proposed methods to avoid running the full diffusion path during distillation. For instance, the consistency model (CM, Song et al., 2023) trains a time-conditioned student model gθ(xt, t) to predict self-consistent outputs along the diffusion trajectory: LCM θ = Ext q(xt|x),s,t [0,T ],s