# datafree_distillation_of_diffusion_models_with_bootstrapping__bc7ecad9.pdf

Data-free Distillation of Diffusion Models with Bootstrapping

Jiatao Gu 1 Chen Wang 2 Shuangfei Zhai 1 Yizhe Zhang 1 Lingjie Liu 2 Joshua M. Susskind 1

Figure 1. Samples of our distilled single-step model with prompts from diffusiondb.

Abstract Diffusion models have demonstrated great potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as a remedy which can reduce the number of inference steps to one or a few, without significant quality

1Apple 2University of Pennsylvania. Correspondence to: Jiatao Gu <jiatao@apple.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

degradation. However, existing distillation methods either require significant amounts of offline computation for generating synthetic training data from the teacher model, or need to perform expensive online learning with the help of real data. In this work, we present a novel technique called BOOT, that overcomes these limitations with an efficient data-free distillation algorithm. The core idea is to learn a time-conditioned model that predicts the output of a pre-trained diffusion model teacher given any time-step. Such a model can be efficiently trained based on bootstrapping from two consecutive sampled steps. Furthermore, our method can be easily adapted to large-scale text-

Data-free Distillation of Diffusion Models with Bootstrapping

to-image diffusion models, which are challenging for previous methods given the fact that the training sets are often large and difficult to access. We demonstrate the effectiveness of our approach on several benchmark datasets in the DDIM setting, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher. The text-to-image results show that the proposed approach is able to handle highly complex distributions, shedding light on more efficient generative modeling.

1. Introduction

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Nichol & Dhariwal, 2021; Song et al., 2020) have become the standard tools for generative applications, such as image (Dhariwal & Nichol, 2021; Rombach et al., 2021; Ramesh et al., 2022; Saharia et al., 2022), video (Ho et al., 2022b;a), 3D (Poole et al., 2022; Gu et al., 2023; Liu et al., 2023b; Chen et al., 2023), audio (Liu et al., 2023a), and text (Li et al., 2022; Zhang et al., 2023) generation. Diffusion models are considered more stable for training compared to alternative approaches like GANs (Goodfellow et al., 2014a) or VAEs (Kingma & Welling, 2013), as they don t require balancing two modules, making them less susceptible to issues like mode collapse or posterior collapse. Despite their empirical success, standard diffusion models often have slow inference times (around 50 1000 slower than single-step models like GANs), which poses challenges for deployment on consumer devices. This is mainly because diffusion models use an iterative refinement process to generate samples.

To address this issue, previous studies have proposed using knowledge distillation to improve the inference speed (Hinton et al., 2015). The idea is to train a faster student model that can replicate the output of a pre-trained diffusion model. In this work, we focus on learning efficient single-step models that only require one neural function evaluation (NFE). However, previous methods, such as Luhman & Luhman (2021), require executing the full teacher sampling to generate synthetic targets for every student update, which is impractical for distilling large diffusion models like Stable Diffusion (SD, Rombach et al., 2021). Recently, several techniques have been proposed to avoid sampling using the concept of bootstrap . For example, Salimans & Ho (2022) gradually reduces the number of inference steps based on the previous stage s student, while Song et al. (2023) and Berthelot et al. (2023) train single-step denoisers by enforcing self-consistency between adjacent student outputs along the same diffusion trajectory. However, these approaches rely on the availability of real data to simulate the intermediate diffusion states as input, which limits their applicability in scenarios where the desired real data is not accessible.

In this paper, we propose BOOT, a data-free knowledge distillation method for denoising diffusion models, with single-step inference via bootstrapping. Our inspiration for BOOT partially draws from the insight presented by consistency model (CM, Song et al., 2023) that all points on the same diffusion trajectory, a.k.a., PF-ODE (Song et al., 2020), have a deterministic mapping between each other. We identify two advantages of the proposed method: (i) Similar to CM, BOOT enjoys efficient single-step inference which dramatically facilitates the model deployment on scenarios demanding low resource/latency. (ii) Different from CM, which seeks self-consistency from any xt to x0, thus being data-dependent, BOOT predicts all possible xt given the same noise point ϵ and a time indicator t. Consequently, BOOT gθ always reads pure Gaussian noise, making it data-free. Moreover, learning all xt from the same ϵ enables bootstrapping: it is easier to predict xt if the model has already learned to generate xt where t > t. However, formulating bootstrapping in this way presents additional non-trivial challenges, such as noisy sample prediction. To address this, we learn the student model from a novel Signal-ODE derived from the original PF-ODE. We also design objectives and boundary conditions to enhance the sampling quality and diversity. This enables efficient inference of large diffusion models in scenarios where the original training corpus is inaccessible due to privacy or other concerns. For example, we can obtain an efficient model for synthesizing images of raccoon astronaut by distilling the text-to-image model with the corresponding prompts (shown in Figure 2), even though collecting such real data is difficult.

In the experiments, we first demonstrate the efficacy of BOOT on various challenging image generation benchmarks, including unconditional and class-conditional settings. Next, we show that the proposed method can be easily adopted to distill text-to-image diffusion models. An illustration of sampled images from our distilled text-to-image model is shown in Figure 1.

2. Preliminaries

2.1. Diffusion Models

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) belong to a class of deep generative models that generate data by progressively removing noise from the initial input. In this work, we focus on continuous-time diffusion models (Song et al., 2020; Kingma et al., 2021; Karras et al., 2022) in the variancepreserving formulation (Salimans & Ho, 2022). Given a data point x RN, we model a series of time-dependent latent variables {xt|t [0, T], x0 = x} based on a given

Data-free Distillation of Diffusion Models with Bootstrapping

noise schedule {αt, σt}:

q(xt|xs) = N(xt; αt|sxs, σ2 t|s I), and

q(xt|x) = N(xt; αtx, σ2 t I),

where αt|s = αt/αs and σ2 t|s = σ2 t α2 t|sσ2 s for s < t. By default, the signal-to-noise ratio (SNR, α2 t/σ2 t ) decreases monotonically with t. A diffusion model fϕ learns to reverse the diffusion process by denoising xt. After training, one can use ancestral sampling (Ho et al., 2020) to synthesize new data from the learned model. While the conventional method is stochastic, DDIM (Song et al., 2021) demonstrates that one can follow a deterministic sampler to generate the final sample x0, which follows the update rule:

xs = (σs/σt) xt+(αs αtσs/σt) fϕ(xt, t), s < t, (1)

with the boundary condition x T = ϵ N(0, I). As noted in Lu et al. (2022), Eq. (1) is equivalent to the firstorder ODE solver for the underlying probability-flow (PF) ODE (Song et al., 2020). Therefore, the step size δ = t s needs to be small to mitigate error accumulation. Additionally, using higher-order solvers such as Runge-Kutta (Süli & Mayers, 2003), Heun (Ascher & Petzold, 1998), and other solvers (Lu et al., 2022; Jolicoeur-Martineau et al., 2021) can further reduce the number of function evaluations (NFEs), which, however, are not applicable in single-step.

2.2. Knowledge Distillation

Orthogonal to the development of ODE solvers, distillationbased techniques have been proposed to learn faster student models from a pre-trained diffusion teacher. The most straightforward approach is to perform direct distillation (Luhman & Luhman, 2021), where a student model gθ is trained to learn from the output of the diffusion model, which is computationally expensive itself:

LDirect θ = Eϵ N(0,I) gθ(ϵ) ODE-Solver(fϕ, ϵ, T 0) 2 2, (2) Here, ODE-solver refers to any solvers like DDIM as mentioned above. While this naive approach shows promising results, it typically requires over 50 steps of evaluations to obtain reasonable distillation targets, which becomes a bottleneck when learning large-scale models.

Alternatively, recent studies (Salimans & Ho, 2022; Song et al., 2023; Berthelot et al., 2023) have proposed methods to avoid running the full diffusion path during distillation. For instance, the consistency model (CM, Song et al., 2023) trains a time-conditioned student model gθ(xt, t) to predict self-consistent outputs along the diffusion trajectory:

LCM θ = Ext q(xt|x),s,t [0,T ],s<t gθ(xt, t) gθ (xs, s) 2 2, (3) where xs = ODE-Solver(fϕ, xt, t s), typically with a single-step evaluation using Eq. (1). In this case, θ

represents an exponential moving average (EMA) of the student parameters θ, which is important to prevent the self-consistency objectives from collapsing into trivial solutions by always predicting similar outputs. After training, samples can be generated by executing gθ(x T , T) with a single NFE. It is worth noting that Eq. (3) requires sampling xt from the real data sample x, which is the essence of bootstrapping: the model learns to denoise increasingly noisy inputs until x T . However, in many tasks, the original training data x for distillation is inaccessible. For example, text-to-image generation models require billions of paired data for training. One possible solution is to use a different dataset for distillation; however, the mismatch in the distributions of the two datasets would result in suboptimal distillation performance.

In this section, we present BOOT, a novel distillation approach inspired by the concept of bootstrapping without requiring target domain data during training. We begin by introducing signal-ODE, a modeling technique focused exclusively on signals ( 3.1), and its corresponding distillation process ( 3.2). Subsequently, we explore the application of BOOT in text-to-image generation ( 3.3). The training pipeline is depicted in Figure 2.

3.1. Signal-ODE

We utilize a time-conditioned student model gθ(ϵ, t) in our approach. Similar to direct distillation (Luhman & Luhman, 2021), BOOT always takes random noise ϵ as input and approximates the intermediate diffusion model variable: gθ(ϵ, t) xt = ODE-Solver(fϕ, ϵ, T t), ϵ N(0, I). This approach eliminates the need to sample from real data during training. The final sample can be obtained as gθ(ϵ, 0) x0. However, it poses a challenge to train gθ effectively, as neural networks struggle to predict partially noisy images (Berthelot et al., 2023), leading to out-ofdistribution (OOD) problems and additional complexities in learning gθ accurately.

To overcome the aforementioned challenge, we propose an alternative approach where we predict yt = (xt σtϵ)/αt. In this case, yt represents the low-frequency signal component of xt, which is easier for neural networks to learn. The initial noise for diffusion is denoted by ϵ. This prediction target is reasonable since it aligns with the boundary condition of the teacher model, where y0 = x0. Furthermore, we can derive an iterative equation from Eq. (1) for consecutive timesteps:

ys = 1 eλs λt fϕ(xt, t) + eλs λtyt, (4)

where xt = αtyt + σtϵ, and λt = log(αt/σt) represents the negative half log-SNR . Notably, the noise term ϵ auto-

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 2. Training pipeline of BOOT. s and t are two consecutive timesteps where s < t. From a noise map ϵ, the objective of BOOT minimizes the difference between the output of a student model at timestep s, and the output of stacking the same student model and a teacher model at an earlier time t. The whole process is data-free.

matically cancels out in Eq. (4), indicating that the model always learns from the signal space. Moreover, Eq. (4) demonstrates an interpolation between the current model prediction and the diffusion-denoised output. Similar to the connection between DDIM and PF-ODE (Song et al., 2020), we can also obtain a continuous version of Eq. (4) by letting s t as follows:

dt = λ t (fϕ(xt, t) yt) , y T pϵ (5)

where λ t = dλ/dt, and pϵ epresents the boundary distribution of yt. It s important to note that Eq. (5) differs from the PF-ODE, which directly relates to the score function of the data. In our case, the ODE, which we refer to as Signal ODE , is specifically defined for signal prediction. At each timestep t, a fixed noise ϵ is injected and denoised by the diffusion model fϕ. The Signal-ODE implies a groundtruth trajectory for sampling new data. For example, one can initialize a reasonable y T = ϵ N(0, I) and solve the Signal-ODE to obtain the final output y0. Although the computational complexity remains the same as conventional DDIM, we will demonstrate in the next section how we can efficiently approximate yt using bootstrapping objectives.

3.2. Learning with Bootstrapping

Our objective is to learn yθ(ϵ, t) yt as a single-step prediction model using neural networks, rather than solving the signal-ODE with Eq. (5). By matching both sides of Eq. (5), we obtain the objective:

LDE θ = Eϵ N(0,I),t [0,T ] dyθ(ϵ,t)

dt + λ t (fϕ(ˆxt, t) yθ(ϵ, t)) 2

2 . (6) In Eq. (6), we use yθ(ϵ, t) to estimate yt, and ˆxt = αtyθ(ϵ, t) + σtϵ represents the corresponding noisy image. Instead of using forward-mode auto-differentiation, which can be computationally expensive, we can approximate the above equation with finite differences due to the 1-dimensional nature of t. The approximate form is similar to Eq. (4), which can be found in the supplemental.

LBS θ = Eϵ,t

yθ(ϵ, s) SG

yθ(ϵ, t) + δλ t ((fϕ(ˆxt, t)) yθ(ϵ, t)) | {z } incremental improvement

(7) where s = t δ and δ is the discrete step size. wt represents the time-dependent loss weighting, which can be chosen uniformly. We use SG[.] as the stop-gradient operator for training stability.

Unlike CM-based methods, such as those mentioned in Eq. (3), we do not require an exponential moving average (EMA) copy of the student parameters to avoid collapsing. This avoids potential slow convergence and sub-optimal solutions. The proposed objective is unlikely to degenerate because there is an incremental improvement term in the training target, which is mostly non-zero. In other words, we can consider yθ as an exponential moving average of fϕ, with a decaying rate of 1 δλ t. This ensures that the student model always receives distinguishable signals for different values of t.

Error Accumulation One critical challenge in learning BOOT is the error accumulation issue, where imperfect predictions of yθ on large t can propagate to subsequent timesteps. While similar challenges exist in other bootstrapping-based approaches, it becomes more pronounced in our case due to the possibility of out-ofdistribution inputs ˆxt for the teacher model, resulting from error accumulation and leading to incorrect learning signals. To mitigate this, we employ two methods: (1) We uniformly sample t throughout the training time, despite the potential slowdown in convergence. (2) We use a higher-order solver (e.g., Heun s method (Ascher & Petzold, 1998)) to compute the bootstrapping target with better estimation.

Boundary Condition In theory, the boundary y T can have arbitrary values since αT = 0, and the value of y T does not affect the value x T = ϵ. However, λ t is unbounded at t = T, leading to numerical issues in optimization. As a result, the student model must be learned

Data-free Distillation of Diffusion Models with Bootstrapping

within a truncated range t [tmin, tmax]. This necessitates additional constraints at the boundaries to ensure that αtmaxyθ(ϵ, tmax) + σtmaxϵ follows the same distribution as the diffusion model. In this work, we address this through an auxiliary boundary loss:

LBC θ = Eϵ N(0,I) fϕ(ϵ, tmax) yθ(ϵ, tmax) 2 2 . (8)

Here, we enforce the student model to match the initial denoising output. In our early exploration, we found that the boundary condition is crucial for the single-step student to fully capture the modeling space of the teacher, especially in text-to-image scenarios. Failure to learn the boundaries tends to result in severe mode collapse and colorsaturation problems. The overall learning objective combines Lθ = LBS θ + βLBC θ , where β is a hyper-parameter. The algorithm for student model distillation is presented in Appendix Algorithm 1.

3.3. Distillation of Text-to-Image Models

Our approach can be readily applied for distilling conditional diffusion models, such as text-to-image generation (Ramesh et al., 2022; Rombach et al., 2021; Balaji et al., 2022), where a conditional denoiser fϕ(xt, t, c) is learned with the same objective given an aligned dataset. In practice, inference of these models requires necessary postprocessing steps for amplifying the conditional generation. For instance, classifier-free guidance (CFG, Ho & Salimans, 2022) can be applied as:

fϕ(xt, t, c) = fϕ(xt, t, n)+w (fϕ(xt, t, c) fϕ(xt, t, n)) , (9) where n is the negative prompt (or empty), and w is the guidance weight (by default w = 7.5) over the denoised signals. We directly use the modified fϕ to replace the original fϕ in the training objectives in Equations (6) and (8). Similar to Meng et al. (2022), we can also learn student model condition on w to reflect different guidance strength.

Our method can be easily adopted in either pixel (Saharia et al., 2022) or latent space (Rombach et al., 2021) models without any change in implementation. For pixel-space models, it is sometimes critical to apply clipping or dynamic thresholding (Saharia et al., 2022) over the denoised targets to avoid over-saturation. Similarly, we also clip the targets in our objectives Equations (6) and (8). Pixel-space models (Saharia et al., 2022) typically involve learning cascaded models (one base model + a few super-resolution (SR) models) to increase the output resolutions progressively. We can also distill the SR models with BOOT into one step by conditioning both the SR teacher and the student with the output of the distilled base model.

Steps FFHQ 64 64 Image Net 64 64 FID fps FID fps

DDPM 250 5.4 0.2 11.0 0.1

DDIM 50 7.6 1.2 13.7 0.6 10 18.3 5.3 18.3 3.3 1 225 54 237 34

Ours 1 9.0 54 12.3 34

Table 1. Comparison for image generation benchmarks on FFHQ and class-conditioned Image Net. For Image Net, numbers are reported without using CFG (w = 1).

4. Experiments

4.1. Experimental Setups

Diffusion Model Teachers We begin by evaluating the performance of BOOT on diffusion models trained on standard image generation benchmarks: CIFAR-10 32 32 (Krizhevsky et al., 2009), FFHQ 64 64 (Karras et al., 2017), class-conditional Image Net 64 64 (Deng et al., 2009). On CIFAR-10, we compare with other established methods, including PD (Salimans & Ho, 2022), CM (Song et al., 2023) trained with 800K iterations, as well as fast sampling solvers. For these experiments, we adopt the EDM teacher (Karras et al., 2022). For other datasets, we train the teacher diffusion models separately using signal prediction. We test the performance of CFG where the student models are trained with random conditioning on w [1, 5] (See the effects in Fig.8.

For text-to-image generation scenarios, we directly apply BOOT to open-sourced diffusion models in both pixelspace (Deep Floyd-IF (IF), Saharia et al., 2022) and latents space (Stable Diffusion (SD), Rombach et al., 2021). Thanks to the data-free nature of BOOT, we do not require access to the original training set, which may consist of billions of text-image pairs with unknown preprocessing steps. Instead, we only need the prompt conditions to distill both models. In this work, we consider general-purpose prompts generated by users. Specifically, we utilize diffusiondb (Wang et al., 2022), a large-scale prompt dataset that contains 14 million images generated by Stable Diffusion using prompts provided by real users.

Implementation Details Similar to previous research (Song et al., 2023), we use student models with architectures similar to those of the teachers, having nearly identical numbers of parameters. A more comprehensive architecture search is left for future work. We initialize the majority of the student yθ parameters with the teacher model fϕ, except for the newly introduced conditioning modules (target timestep t and potentially the CFG weight w), which are incorporated into the U-Net architecture in a similar manner as how class labels were incorporated. It is important to note that the target timestep t is different from

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 3. Comparison between the generated outputs of DDIM/Signal-ODE and our distilled model given the same prompt. By definition, signal-ODE converges to the same final sample as the original DDIM, while the distilled single-step model does not necessarily follow.

Figure 4. Uncurated samples of {50, 10, 1} DDIM sampling steps and the proposed BOOT from (a) FFHQ (b) Image Net benchmarks, respectively, given the same set of initial noise input.

the original timestep used for conditioning the diffusion model, which is always set to tmax for the student model. Additionally, for CIFAR-10 experiments, we also train our models coupled with LPIPS loss (Zhang et al., 2018).

Evaluation Metrics For image generation, results are compared according to Fréchet Inception Distance (FID, Heusel et al., 2017), Precision and Recall (Kynkäänniemi et al., 2019) over 50, 000 real samples from the corresponding datasets. For text-to-image tasks, we measure the zeroshot CLIP score (Radford et al., 2021) for measuring the faithfulness of generation given 5000 randomly sampled captions from COCO2017 (Lin et al., 2014) validation set. We also compare with LCM (Luo et al., 2023) on 30, 000

randomly sampled prompts on diffusiondb and compare the FID against images generated by SD. In addition, we report the speed by fps on a single A100 GPU.

4.2. Results

Quantitative Results We first evaluate the proposed method on standard image generation benchmarks. The quantitative comparison with the standard diffusion inference methods like DDPM (Ho et al., 2020) and the deterministic DDIM (Song et al., 2021) are shown in Table 1. Despite lagging behind the 50-step DDIM inference, BOOT significantly improves the performance 1-step inference, and achieves better performance against DDIM with around

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 5. Uncurated samples of 8 and 1 DDIM sampling steps, LCM (Luo et al., 2023) and the proposed BOOT from SD2.1-base, given the same initial noise and prompts from diffusiondb. All faces presented are synthesized by the models, and are not real identities.

Figure 6. Ablation Study. (a) vs. (b): The additional boundary loss in 3.2 alleviates the mode collapsing issue and prompts diversity in generation. (c) vs. (d): Uniform time training yields better generation compared with progressive time training. All faces presented are synthesized by the models, and are not real identities.

10 denoising steps, while maintaining 10 speed-up.

Also, we show quantitative results on CIFAR-10 compared against existing methods in Table 2. It is important to highlight that none of the considered distillation approaches can be categorized as completely data-free. These methods either necessitate the generation of expansive synthetic datasets or depend on real data sources. Our approach surpasses PD and boasts comparable results when contrasted with CM (which was trained much longer than our mod-

els). Note that our approach is the first to achieve data-free training to enable highly efficient single-step generation.

Additionally, we conduct quantitative evaluation on text-toimage tasks. Using the SD teacher, we obtain a CLIP-score of 0.254 on COCO2017, a slight degradation compared to the 50-step DDIM results (0.262), while it generates 2 orders of magnitude faster, rendering real-time applications. On diffusiondb, BOOT also surpasses LCM significantly in terms of FID (30.21 vs 111.55).

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 7. With fixed noise, we can perform controllable generation by swapping the keywords from the prompts. The prompts are chosen from the combination of portrait of a {owl, raccoon, tiger, fox, llama, gorilla, panda} wearing {a t-shirt, a jacket, glasses} {drinking a latte, eating a pizza, reading a book} cinematic, hdr. All images are generated from the student distilled from IF teacher.

Figure 8. The distilled student is able to trade generation quality with diversity based on CFG weights.

Method NFE FID

Diffusion Model+Solver

DDPM (Ho et al., 2020) 1000 3.17 DDIM (Song et al., 2021) 50 4.67 DPM-solver-2 (Lu et al., 2022) 10 5.94 DEIS (Zhang & Chen, 2022) 10 4.17 EDM (Karras et al., 2022) 35 2.04

Distillation

Direct* (Luhman & Luhman, 2021) 1 9.36 DFNO* (Zheng et al., 2023) 1 4.12 Re Flow* (Liu et al., 2022) 1 6.18 PD (Salimans & Ho, 2022) 1 8.34 CM (Song et al., 2023) 1 3.55

Data-free Distillation

Ours (L2 loss) 1 6.88 Ours (LPIPS loss) 1 4.38

Table 2. Unconditional image generation on the CIFAR-10 dataset. * indicates methods requiring synthesizing additional dataset.

Visual Results We show the qualitative comparison in Figure 4 and Figure 5 for image generation and text-toimage, respectively. For both cases, naïve 1-step inference fails completely, and the diffusion generally outputs almost

empty and ill-structured images with fewer than 10 NFEs. In contrast, BOOT is able to synthesize high-quality images that are visually close (Figure 4) or semantically similar (Figure 5) to teacher s results with much more steps. Unlike the standard benchmarks, distilling text-to-image models (e.g., SD) typically leads to noticeably different generation from the original diffusion model, even starting with the same initial noise. We hypothesize it is a combined effect of highly complex underlying distribution and CFG. We show more results including pixel-space models in the appendix.

4.3. Analysis

Importance of Boundary Condition The significance of incorporating the boundary loss is demonstrated in Figure 6 (a) and (b). When using the same noise inputs, we compare the student outputs based on different target timesteps. As yθ(ϵ, t) tracks the signal-ODE output, it produces more averaged results as t approaches 1. However, without proper boundary constraints, the student outputs exhibit consistent sharpness across timesteps, resulting in over-saturated and non-realistic images. This indicates a complete failure of the learned student model to capture the distribution of the

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 9. Latent space interpolation of the student model distilled from the IF teacher. We randomly sample two noises to generate images (shown in red boxes) given the same text prompts, and then linearly interpolate the noises to synthesize images shown in the middle.

teacher model, leading to severe mode collapse.

Progressive v.s. Uniform Time Training We also compare different training strategies in Figure 6 (c) and (d). In contrast to the proposed approach of uniformly sampling t, one can potentially achieve additional efficiency with a fixed schedule that progressively decreases t as training proceeds. This progressive training strategy seems reasonable considering that the student is always initialized from tmax and gradually learns to predict the clean signals (small t) during training. However, progressive training tends to introduce more artifacts (as observed in the visual comparison in Figure 6). We hypothesize that progressive training is more prone to accumulating irreversible errors.

Controllable Generation In Figure 7, we provide an example of text-controlled generation by fixing the noise input and only modifying the prompts. Similar to the original diffusion teacher model, the BOOT distilled student retains the ability of disentangled representation, enabling fine-grained control while maintaining consistent styles. Additionally, in Figure 9, we visualize the results of latent space interpolation, where the student model is distilled from the pretrained IF teacher. The smooth transition of the generated images demonstrates that the distilled student model has successfully learned a continuous latent space.

5. Related Work

Improving Efficiency of Diffusion Models Speeding up inference of diffusion models is a broad area. Recent works and also our work (Luhman & Luhman, 2021; Salimans & Ho, 2022; Meng et al., 2022; Song et al., 2023; Berthelot et al., 2023) aim at reducing the number of diffusion model inference steps via distillation. Aside from distillation methods, other representative approaches include advanced ODE solvers (Karras et al., 2022; Lu et al., 2022), low-dimension space diffusion (Rombach et al., 2021; Vahdat et al., 2021; Jing et al., 2022; Gu et al., 2022), and improved diffusion targets (Lipman et al., 2023; Liu et al., 2022). BOOT is orthogonal and complementary, and can theoretically benefit

from improvements made in these approaches.

Knowledge Distillation for Generative Models Knowledge distillation (Hinton et al., 2015) has seen successful applications in learning efficient generative models, including model compression (Kim & Rush, 2016; Aguinaldo et al., 2019; Fu et al., 2020; Hsieh et al., 2023) and nonautoregressive sequence generation (Gu et al., 2017; Oord et al., 2018; Zhou et al., 2019). We believe that BOOT could inspire a new paradigm of distilling powerful generative models without requiring access to the training data.

6. Discussion and Conclusion

Limitations BOOT may produce lower quality samples compared to other distillation methods (Song et al., 2023; Berthelot et al., 2023) which require ground-truth data for training. This issue can potentially be remedied by combining BOOT with these methods. Another limitation is that the current design only focuses on data-free distillation into a single-step student model and cannot support multi-step generation as did in previous work (Song et al., 2023) for further quality improvement. As future research, we aim to investigate the possibility of jointly training the teacher and the student models in a manner that incorporates the concept of diffusion into the distillation process. Furthermore, we find it intriguing to explore the training of a single-step diffusion model from scratch where a pre-trained model is not available. Finally, extending BOOT to multi-step generation is also feasible, which can be achieved by training the student with multiple timesteps coupled with restart sampling (Xu et al., 2023) approaches.

Conclusion In summary, this paper introduced a novel technique BOOT to distill diffusion models into single-step models. The method did not require the presence of any real or synthetic data by learning a time-conditioned student model with bootstrapping objectives. The proposed approach achieved comparable generation quality while being significantly faster, compared to the diffusion teacher, and was also applicable to large-scale text-to-image generation, showcasing its versatility.

Data-free Distillation of Diffusion Models with Bootstrapping

Impact Statement

The introduction of Data-free Distillation of Diffusion Models with Bootstrapping represents a significant advancement in the field of generative modeling, offering far-reaching implications for both technology and society. By streamlining the image generation process through diffusion models, this technique opens new possibilities for efficient and scalable applications in various sectors, including healthcare, entertainment, and automated content creation. The ability to generate high-quality images swiftly and without extensive data requirements can revolutionize industries reliant on visual content, enhancing creative processes and potentially reducing operational costs.

However, the broader societal and ethical implications of this technology must be carefully considered. The ease of generating realistic images could lead to challenges in distinguishing between real and AI-generated content, raising concerns about misinformation and the authenticity of digital media. This underscores the need for robust policies and ethical guidelines to govern the use of such advanced generative models, ensuring they are used responsibly and transparently. Additionally, the democratization of image generation could impact job markets, particularly in creative fields, necessitating a reevaluation of skill sets and job roles in the era of AI-assisted creation.

Aguinaldo, A., Chiang, P.-Y., Gain, A., Patil, A., Pearson, K., and Feizi, S. Compressing gans using knowledge distillation. ar Xiv preprint ar Xiv:1902.00159, 2019.

Ascher, U. M. and Petzold, L. R. Computer methods for ordinary differential equations and differential-algebraic equations, volume 61. Siam, 1998.

Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. ar Xiv preprint ar Xiv:2211.01324, 2022.

Berthelot, D., Autef, A., Lin, J., Yap, D. A., Zhai, S., Hu, S., Zheng, D., Talbot, W., and Gu, E. Tract: Denoising diffusion models with transitive closure time-distillation. ar Xiv preprint ar Xiv:2303.04248, 2023.

Chen, H., Gu, J., Chen, A., Tian, W., Tu, Z., Liu, L., and Su, H. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction, 2023.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei Fei, L. Image Net: A Large-scale Hierarchical Image Database. IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255, 2009.

Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780 8794, 2021.

Fu, Y., Chen, W., Wang, H., Li, H., Lin, Y., and Wang, Z. Autogan-distiller: Searching to compress generative adversarial networks. ar Xiv preprint ar Xiv:2006.08198, 2020.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 27, pp. 2672 2680. Curran Associates, Inc., 2014a. URL https://proceedings. neurips.cc/paper/2014/file/ 5ca3e9b122f61f8f06494c97b1afccf3-Paper. pdf.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Neur IPS, 2014b.

Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. Non-autoregressive neural machine translation. ar Xiv preprint ar Xiv:1711.02281, 2017.

Gu, J., Zhai, S., Zhang, Y., Bautista, M. A., and Susskind, J. f-dm: A multi-stage diffusion model via progressive signal transformation. ar Xiv preprint ar Xiv:2210.04955, 2022.

Gu, J., Trevithick, A., Lin, K.-E., Susskind, J., Theobalt, C., Liu, L., and Ramamoorthi, R. Nerfdiff: Singleimage view synthesis with nerf-guided distillation from 3d-aware diffusion. ar Xiv preprint ar Xiv:2302.10109, 2023.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

Ho, J. and Salimans, T. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation

Data-free Distillation of Diffusion Models with Bootstrapping

with diffusion models. ar Xiv preprint ar Xiv:2210.02303, 2022a.

Ho, J., Salimans, T., Gritsenko, A. A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022b.

Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Ratner, A., Krishna, R., Lee, C.-Y., and Pfister, T. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023.

Jing, B., Corso, G., Berlinghieri, R., and Jaakkola, T. Subspace diffusion generative models. ar Xiv preprint ar Xiv:2205.01490, 2022.

Jolicoeur-Martineau, A., Li, K., Piché-Taillefer, R., Kachman, T., and Mitliagkas, I. Gotta go fast when generating data with score-based models. ar Xiv preprint ar Xiv:2105.14080, 2021.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017.

Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. ar Xiv preprint ar Xiv:2206.00364, 2022.

Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. ar Xiv preprint ar Xiv:1606.07947, 2016.

Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Advances in neural information processing systems, 34:21696 21707, 2021.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.

Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328 4343, 2022.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft COCO: Common Objects in Context. European Conference on Computer Vision, pp. 740 755, 2014.

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=Pqv MRDCJT9t.

Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., and Plumbley, M. D. Audioldm: Text-to-audio generation with latent diffusion models. ar Xiv preprint ar Xiv:2301.12503, 2023a.

Liu, R., Wu, R., Hoorick, B. V., Tokmakov, P., Zakharov, S., and Vondrick, C. Zero-1-to-3: Zero-shot one image to 3d object, 2023b.

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. ar Xiv preprint ar Xiv:2206.00927, 2022.

Luhman, E. and Luhman, T. Knowledge distillation in iterative generative models for improved sampling speed. ar Xiv preprint ar Xiv:2101.02388, 2021.

Luo, S., Tan, Y., Huang, L., Li, J., and Zhao, H. Latent consistency models: Synthesizing high-resolution images with few-step inference. ar Xiv preprint ar Xiv:2310.04378, 2023.

Meng, C., Gao, R., Kingma, D. P., Ermon, S., Ho, J., and Salimans, T. On distillation of guided diffusion models. ar Xiv preprint ar Xiv:2210.03142, 2022.

Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162 8171. PMLR, 2021.

Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G., Lockhart, E., Cobo, L., Stimberg, F., et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning, pp. 3918 3926. PMLR, 2018.

Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. ar Xiv preprint ar Xiv:2209.14988, 2022.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. ar Xiv preprint ar Xiv:2103.00020, 2021.

Data-free Distillation of Diffusion Models with Bootstrapping

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485 5551, 2020.

Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686 707, 2019.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models, 2021.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., et al. Photorealistic text-to-image diffusion models with deep language understanding. ar Xiv preprint ar Xiv:2205.11487, 2022.

Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models. ar Xiv preprint ar Xiv:2202.00512, 2022.

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. Laion-5b: An open large-scale dataset for training next generation image-text models. ar Xiv preprint ar Xiv:2210.08402, 2022.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015.

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020.

Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consistency models. ar Xiv preprint ar Xiv:2303.01469, 2023.

Süli, E. and Mayers, D. F. An introduction to numerical analysis. Cambridge university press, 2003.

Vahdat, A., Kreis, K., and Kautz, J. Score-based generative modeling in latent space. In Neural Information Processing Systems (Neur IPS), 2021.

Wang, Z. J., Montoya, E., Munechika, D., Yang, H., Hoover, B., and Chau, D. H. Diffusion DB: A largescale prompt gallery dataset for text-to-image generative models. ar Xiv:2210.14896 [cs], 2022. URL https://arxiv.org/abs/2210.14896.

Xu, Y., Deng, M., Cheng, X., Tian, Y., Liu, Z., and Jaakkola, T. Restart sampling for improving generative processes. ar Xiv preprint ar Xiv:2306.14878, 2023.

Zhang, Q. and Chen, Y. Fast sampling of diffusion models with exponential integrator. ar Xiv preprint ar Xiv:2204.13902, 2022.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.

Zhang, Y., Gu, J., Wu, Z., Zhai, S., Susskind, J., and Jaitly, N. Planner: Generating diversified paragraph via latent language diffusion model. ar Xiv preprint ar Xiv:2306.02531, 2023.

Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., and Anandkumar, A. Fast sampling of diffusion models via operator learning. In International Conference on Machine Learning, pp. 42390 42402. PMLR, 2023.

Zhou, C., Neubig, G., and Gu, J. Understanding knowledge distillation in non-autoregressive machine translation. ar Xiv preprint ar Xiv:1911.02727, 2019.

1 Introduction 2

2 Preliminaries 2 2.1 Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Method 3 3.1 Signal-ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Learning with Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Error Accumulation Boundary Condition 3.3 Distillation of Text-to-Image Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Experiments 5 4.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Related Work 9 Improving Efficiency of Diffusion Models Knowledge Distillation for Generative Models

6 Discussion and Conclusion 9

Appendices 13

A Algorithm Details 14 A.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Derivation of Signal-ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.3 Bootstrapping Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.4 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

B Connections to Existing Literature 16 B.1 Physics Informed Neural Networks (PINNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Consistency Models / TRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.3 Single-step Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

C Additional Experimental Settings 17 C.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.2 Text-to-Image Teachers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.3 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C.4 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

D Additional Samples from BOOT 19

Data-free Distillation of Diffusion Models with Bootstrapping

A. Algorithm Details

A.1. Notations

In this paper, we use fϕ(x, t) to represent the diffusion model that denoises the noisy sample x into its clean version, and we derive the DDIM sampler (Eq. (1)) following the definition of Song et al. (2021): we deterministically synthesize xs based on the following update rule:

xs = ODE-Solver(fϕ, ϵ, T s)

= αsfϕ(xt, t) + σs

xt αtfϕ(xt, t)

σt xt + αs αt σs σt

where 0 s < t T. Here we use ODE-Solver to represent the DDIM sampling from a random noise x T = ϵ N(0, I), and iteratively obtain the sample at step s. In practice, we can generalize to higher-order ODE-solvers for better efficiency.

For distillation, we define the student model with gθ(ϵ, t) which approximates xt along the diffusion trajectory above. To avoid directly predicting the noisy samples xt with neural networks, we re-parameterize gθ(ϵ, t) = αtyθ(ϵ, t) + σtϵ where the noise part is constant throughout t except the scale factor σt. In this way, the learning goal yθ(ϵ, t) is to predict a new variable yt: the signal part of the original variable yt = (xt σtϵ)/αt.

A.2. Derivation of Signal-ODE

Based on the definition of yt = (xt σtϵ)/αt, we can derive the following equations from Eq. (10):

σt xt + αs αt σs σt

αsys + σsϵ = σs

σt (αtyt + σtϵ) + αs αt σs σt

αsys + σsϵ = αt σs σt yt + σsϵ + αs αt σs σt

σtαs yt + 1 αtσs

= 1 eλs λt fϕ(xt, t) + eλs λtyt,

where we use the auxiliary variable λt = log(αt/σt) for simplifying the equations. As mentioned in 3.1, we can further obtain the continuous form of Eq. (11) by assigning t s 0. That is, Eq. (11) is equivalent to that shown in the following:

ys = 1 eλs λt fϕ(xt, t) + eλs λtyt

yt ys = 1 eλs λt (fϕ(xt, t) yt)

t s = eλt eλs

t s e λt (fϕ(xt, t) yt)

dt = eλt λ t e λt (fϕ(xt, t) yt)

where λ t = dλt/dt. Given a fixed noise input ϵ, Eq. (12) defines an ODE over yθ w.r.t t, which we call Signal-ODE, as both sides of the equation only operate in low-frequency signal space.

Data-free Distillation of Diffusion Models with Bootstrapping

Algorithm 1 Distillation using BOOT for Conditional Diffusion Models.

Require: pretrained diffusion model fϕ, initial student parameter from the teacher θ ϕ, step size δ, learning rate η, CFG weight w, context dataset D, negative condition n = , tmin, tmax, β. 1: while not converged do 2: Sample noise input ϵ N(0, I) 3: Sample context input c D 4: Sample t (tmin, tmax), s = min (t δ, tmin)) 5: Compute noise schedule αt, σt, αs, σs 6: Compute λ t (1 αtσs

7: Generate the model predictions: 8: yt = yθ(ϵ, t, c), ys = yθ(ϵ, s, c), ytmax = yθ(ϵ, tmax, c) 9: Generate the noisy sample ˆxt = αtyt + σtϵ 10: Compute the denoised target: 11: ft = fϕ(ˆxt, t, n) + w (fϕ(ˆxt, t, c) fϕ(ˆxt, t, n)) 12: ftmax = fϕ(ϵ, tmax, n) + w (fϕ(ϵ, tmax, c) fϕ(ϵ, tmax, n))

13: Compute the bootstrapping loss LBS θ = 1 (δλ t)2 ys SG(yt + δλ t( ft yt)) 2 2

14: Compute the boundary loss LBC θ = ytmax ftmax 2 2 15: Update model parameters θ θ η θ LBS θ + βLBC θ

16: end while 17: return Trained model parameters θ

A.3. Bootstrapping Objectives

The bootstrapping objectives in Eq. (6) can be easily derived by taking the finite difference of Eq. (3). Here we use yθ(ϵ, t) to estimate yt, and use ˆxt to represent the noisy image obtained from yθ(ϵ, t).

dt + λ t (fϕ(ˆxt, t) yθ(ϵ, t))

ωt yθ(ϵ, s) yθ(ϵ, t)

δ λ t (fϕ(ˆxt, t) yθ(ϵ, t)) 2 2

δ2 yθ(ϵ, s)

yθ(ϵ, t) + δλ t (fϕ(ˆxt, t) yθ(ϵ, t)) | {z } incremental improvement

δ2 yθ(ϵ, s) ˆyθ(ϵ, s) 2 2

where s = t δ, and ˆyθ(ϵ, s) is the approximated target. ωt is the additional weight, where by default ωt = 1. To stabilize training, a stop-gradient operation SG(.) is typically included:

δ2 yθ(ϵ, s) SG(ˆyθ(ϵ, s)) 2 2

In our experiments, we also find that it helps use ωt = 1/λ 2 t for text-to-image generation.

We can take advantage of higher-order solvers for a more accurate target that reduces the discretization error. For example, one can use Heun s method (Ascher & Petzold, 1998) to first calculate the intermediate value yθ(ϵ, s), and then the final approximation ˆyθ(ϵ, s):

yθ(ϵ, s) = yθ(ϵ, t) + δλ t (fϕ(ˆxt, t) yθ(ϵ, t)) , xs = αs yθ(ϵ, s) + σsϵ

ˆyθ(ϵ, s) = yθ(ϵ, t) + δλ t 2 [(fϕ(ˆxt, t) yθ(ϵ, t)) + (fϕ( xs, s) yθ(ϵ, s))] . (15)

Data-free Distillation of Diffusion Models with Bootstrapping

Using Heun s method essentially doubles the evaluations of the teacher model during training, while the add-on overheads are manageable as we stop the gradients to the teacher model.

A.4. Training Algorithm

We summarize the training algorithm of BOOT in Algorithm 1, where by default we assume conditional diffusion model with classifier-free guidance and DDIM solver. Here, for simplicity, we write λ t (1 αtσs

σtαs )/δ. For unconditional models, we can simply remove the context sampling part.

B. Connections to Existing Literature

B.1. Physics Informed Neural Networks (PINNs)

Physics-Informed Neural Networks (PINNs, Raissi et al., 2019) are powerful approaches that combine the strengths of neural networks and physical laws to solve ODEs. Unlike traditional numerical methods, which rely on discretization and iterative solvers, PINNs employ machine learning techniques to approximate the solution of ODEs. The key idea behind PINNs is to incorporate physics-based constraints directly into the training process of neural networks. By embedding the governing equations and available boundary or initial conditions as loss terms, PINNs can effectively learn the underlying physics while simultaneously discovering the solution. This ability makes PINNs highly versatile in solving a wide range of ODEs, including those arising in fluid dynamics, solid mechanics, and other scientific domains. Moreover, PINNs offer several advantages, such as automatic discovery of spatio-temporal patterns and the ability to handle noisy or incomplete data.

Although motivated from different perspectives, BOOT shares similarities with PINNs at a high level, as both aim to learn ODE/PDE solvers directly through neural networks. In the domain of PINNs, solving ODEs can also be simplified into two objectives: the differential equation (DE) loss (Eq. (6)) and the boundary condition (BC) loss (Eq. (8)). The major difference lies in the focus of the two approaches. PINNs primarily focus on learning complex ODEs/PDEs for single problems, where neural networks serve as universal approximators to address the discretization challenges faced by traditional solvers. Moreover, the data space in PINNs is relatively low-dimensional. In contrast, BOOT aims to learn single-step generative models capable of synthesizing data in high-dimensional spaces (e.g., millions of pixels) from random noise inputs and conditions (e.g., labels, prompts). To the best of our knowledge, no existing work has applied similar methods in generative modeling. Additionally, while standard PINNs typically compute derivatives (Eq. (6)) directly using auto-differentiation, in this paper, we employ the finite difference method and propose a bootstrapping-based algorithm.

B.2. Consistency Models / TRACT

The most related previous works to our research are Consistency Models (Song et al., 2023) and concurrently TRACT (Berthelot et al., 2023), which propose bootstrapping-style algorithms for distilling diffusion models. These approaches map an intermediate noisy training example at time step t to the teacher s t-step denoising outputs using the DDIM inference procedure. The training target for the student is constructed by running the teacher model with one step, followed by the self-teacher with t 1 steps. As illustrated in Figure 10, BOOT takes a different approach to bootstrapping. It starts from the Gaussian noise prior and directly maps it to an intermediate step t in one shot. This change has significant modeling implications, as it does not require any training data and can achieve data-free distillation, a capability that none of the prior works possess.

B.3. Single-step Generative Models

BOOT is also related to other single-step generative models, including VAEs (Kingma & Welling, 2013) and GANs (Goodfellow et al., 2014b), which aim to synthesize data in a single forward pass. However, BOOT does not require an encoder network like VAEs. Thanks to the power of the underlying diffusion model, BOOT can produce higher-contrast and more realistic samples. In comparison to GANs, BOOT does not require a discriminator or critic network. Furthermore, the distillation process of BOOT enables better-controlled exploration of the text-image joint space, which is explored by the pretrained diffusion models, resulting in more coherent and realistic samples in text-guided generation. Additionally, BOOT is more stable to learn compared to GANs, which are challenging to train due to the adversarial nature of maintaining a balance between the generator and discriminator networks.

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 10. Comparison of Consistency Model (Song et al., 2023) (red ) and BOOT (black ) highlighting the opposing prediction pathways.

C. Additional Experimental Settings

C.1. Datasets

While the proposed method is data-free, we list the additional dataset information that used to train our teacher diffusion models:

FFHQ (https://github.com/NVlabs/ffhq-dataset) contains 70k images of real human faces in resolution of 1024 1024. In most of our experiments, we resize the images to a low resolution at 64 64 for early-stage benchmarking.

Image Net-1K (https://image-net.org/download.php) contains 1.28M images across 1000 classes. We directly merge all the training images with class labels and train a class-conditioned diffusion teacher. All images are resized to 64 64 with center-crop. To support test-time classifier-free guidance, the teacher model is trained with 0.2 unconditional probability.

As we do not need to train our own teacher models for text-to-image experiments, no additional text-image pairs are required in this paper. However, our distillation still requires the text conditions for querying the teacher diffusion. To better capture and generalize the real user preference of such diffusion models, we choose to adopt the collected prompt datasets:

Diffusion DB (https://poloclub.github.io/diffusiondb/) contains 14M images generated by Stable Diffusion using prompts and hyperparameters specified by users. For the purpose of our experiments, we only keep the text prompts and discard all model-generated images as well as meta-data and hyperparameters so that they can be used for different teacher models. We use the same prompts for both latent and pixel space models.

C.2. Text-to-Image Teachers

We directly choose the recently open-sourced large-scale diffusion models as our teacher models. More specifically, we looked into the following models:

Stable Diffusion (SD) (https://github.com/Stability-AI/stablediffusion) is an open-source text-toimage latent diffusion model (Rombach et al., 2021) conditioned on the penultimate text embeddings of a CLIP Vi TH/14 (Radford et al., 2021) text encoder. Different standard diffusion models, SD performs diffusion purely in the latent space. In this work, we use the checkpoint of SD v2.1-Base (https://huggingface.co/stabilityai/ stable-diffusion-2-1-base) as our teacher which first generates in 64 64 latent space, and then directly upscaled to 512 512 resolution with the pre-trained VAE decoder. The teacher model was trained on subsets of LAION5B (Schuhmann et al., 2022) with noise prediction objective.

Deep Floyd IF (IF) (https://github.com/deep-floyd/IF) is a recently open-source text-to-image model with

Data-free Distillation of Diffusion Models with Bootstrapping

Image Generation Text-to-Image Hyperparameter FFHQ CIFAR10 Image Net SD-Base IF-I-L IF-II-M

Architecture

Denosing resolution 64 64 32 32 64 64 64 64 64 64 256 256 Base channels 128 256 192 Multipliers 1,2,3,4 1,2,4 1,2,3,4 # of Resblocks 1 1 2 Attention resolutions 8,16 8,16 8,16 Default Noise schedule cosine cosine cosine Model Prediction signal signal signal Text Encoder - - - CLIP T5 T5

Loss weighting uniform uniform uniform λ 2 t λ 2 t λ 2 t Bootstrapping step size 0.04 0.04 0.04 0.01 0.04 0.04 CFG weight - - 1 5 7.5 7.0 4.0 Learning rate 1e-4 1e-4 3e-4 2e-5 2e-5 2e-5 Batch size 128 128 1024 64 64 32 EMA decay rate 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 Training iterations 500k 500k 300k 500k 500k 100k

Table 3. Hyperparameters used for training BOOT. The CFG weights for text-to-image models are determined based on the default value of the open-source codebase.

a high degree of photorealism and language understanding. IF is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules, similar to Imagen (Saharia et al., 2022): a base model that generates 64 64 image based on text prompt and two super-resolution models (256 256, 1024 1024). All stages of the model utilize a frozen text encoder based on the T5 (Raffel et al., 2020) to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. Models were trained on 1.2B text-image pairs (based on LAION (Schuhmann et al., 2022) and few additional internal datasets) with noise prediction objective. In this paper, we conduct experiments on the first two resolutions (64 64, 256 256) with the checkpoints of IF-I-L-v1.0 (https://huggingface.co/ Deep Floyd/IF-I-L-v1.0) and IF-II-M-v1.0 (https://huggingface.co/Deep Floyd/IF-II-M-v1.0).

C.3. Model Architectures

We follow the standard U-Net architecture (Nichol & Dhariwal, 2021) for image generation benchmarks and adopt the hyperparameters similar in f-DM (Gu et al., 2022). For text-to-image applications, we keep the default architecture setups from the teacher models unchanged. As mentioned in the main paper, we initialize the weights of the student models directly from the pretrained checkpoints and use zero initialization for the newly added modules, such as target time and CFG weight embeddings. We include additional architecture details in the Table 3.

C.4. Training Details

All models for all the tasks are trained on the same resources of 8 NVIDIA A100 GPUs for 500K updates. Training roughly takes 3 7 days to converge depending on the model sizes. We train all our models with the Adam W (Loshchilov & Hutter, 2017) optimizer, with no learning rate decay or warm-up, and no weight decay. Standard EMA to the weights is also applied for student models. Since our methods are data-free, there is no additional overhead on data storage and loading except for the text prompts, which are much smaller and can be efficiently loaded into memory.

Learning the boundary loss requires additional NFEs during each training step. In practice, we apply the boundary loss less frequently (e.g., computing the boundary condition every 4 iterations and setting the loss to be 0 otherwise) to improve the overall training efficiency. Note that distilling from the class-conditioned / text-to-image teachers requires multiple forward passes due to CFG, which relatively slows down the training compared to unconditional models.

Distilling from the Deep Floyd IF teacher requires learning from two stages. In this paper, we can easily achieve that by first

Data-free Distillation of Diffusion Models with Bootstrapping

distilling the first-stage model into single-step with BOOT, and then distilling the upscaler model based on the output of the first-stage student. Following the original paper (Saharia et al., 2022), noise augmentation is also applied on the first-stage output where we set the noise-level as 250 *. For more training hyperparameters, please refer to Table 3.

D. Additional Samples from BOOT

Finally, we provide additional qualitative comparisons for the unconditional models of CIFAR-10 32 32 (Figure 11), FFHQ 64 64 (Figure 12), the class-conditional model of Image Net 64 64 (Figure 13), and comparisons for text-to-image generation based on Deep Floyd-IF (64 64 in Figure 14, 256 256 in Figure 1) and Stable Diffusion (512 512 in Figure 16).

*https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_ if/pipeline_if_superresolution.py#L715

Data-free Distillation of Diffusion Models with Bootstrapping

(a) Random samples from EDM teacher.

(b) Random samples from BOOT student.

Figure 11. Qualitative comparison between EDM teacher and BOOT samples on CIFAR-10 datasets.

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 12. Uncurated samples from FFHQ 64 64. All corresponding samples use the same initial noise for the DDIM teacher and the single-step BOOT student. All faces presented are synthesized by the models, and are not real identities.

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 13. Uncurated class-conditioned samples from Image Net 64 64. All corresponding samples use the same initial noise for the DDIM teacher and the single-step BOOT student. Classes from top to bottom: cowboy boot, volcano, golden retriever, teapot, daisy. The diffusion model uses CFG with w = 3, and our student model conditions on the same weight.

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 14. Uncurated text-conditioned image generation distilled from Deep Floyd IF (the first stage model, images are at 64 64). All corresponding samples use the same initial noise for the DDIM teacher and the single-step BOOT student. The specific prompts are shown above the images.

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 15. Given the 64 64 outputs from Figure 14, we also show comparison for the second-stage models which upscale the images to 256 256. All corresponding samples use the same initial noise for the DDIM teacher and the single-step BOOT student.

Data-free Distillation of Diffusion Models with Bootstrapping

Figure 16. Uncurated text-conditioned image generation distilled from Stable Diffusion (latent diffusion in 64 64, images are upscaled to 512 512 with the pre-trained VAE decoder). All corresponding samples use the same initial noise for the DDIM teacher and the single-step BOOT student. We use the same prompts as in Figure 14 for better comparison.