# sparsetosparse_training_of_diffusion_models__da2080d3.pdf

Published in Transactions on Machine Learning Research (09/2025)

Sparse-to-Sparse Training of Diffusion Models

Inês Cardoso Oliveira i.oliveira@uni.lu University of Luxembourg

Decebal Constantin Mocanu decebal.mocanu@uni.lu University of Luxembourg

Luis A. Leiva luis.leiva@uni.lu University of Luxembourg

Reviewed on Open Review: https: // openreview. net/ forum? id= i Rupdo PLJa

Diffusion models (DMs) are a powerful type of generative models that have achieved state-ofthe-art results in various image synthesis tasks and have shown potential in other domains, such as natural language processing and temporal data modeling. Despite their stable training dynamics and ability to produce diverse high-quality samples, DMs are notorious for requiring significant computational resources, both in the training and inference stages. Previous work has focused mostly on increasing the efficiency of model inference. This paper introduces, for the first time, the paradigm of sparse-to-sparse training to DMs, with the aim of improving both training and inference efficiency. We focus on unconditional generation and train sparse DMs from scratch (Latent Diffusion and Chiro Diff) on six datasets using three different methods (Static-DM, Rig L-DM, and Mag Ran-DM) to study the effect of sparsity in model performance. Our experiments show that sparse DMs are able to match and often outperform their Dense counterparts, while substantially reducing the number of trainable parameters and FLOPs. We also identify safe and effective values to perform sparse-to-sparse training of DMs.

1 Introduction

Diffusion models (DMs) are a class of deep generative models that exhibit extraordinary performance to produce diverse and high-quality data. DMs currently dominate the generative field in computer vision, having been applied to a wide range of tasks such as (un)conditional image generation (Ho et al., 2020b; Rombach et al., 2021; Nichol & Dhariwal, 2021; Dhariwal & Nichol, 2021; Nichol et al., 2022; Blattmann et al., 2022; Das et al., 2023), image super-resolution (Saharia et al., 2021; Chung et al., 2022), and image inpainting (Nichol et al., 2022; Chung et al., 2022; Saharia et al., 2022), among others. DMs have also shown incredible potential in other domains, including speech generation (Liu et al., 2023a), text generation (Li et al., 2022; Gong et al., 2023), and time-series prediction and imputation (Rasul et al., 2021; Tashiro et al., 2021).

Despite these advantages, DMs are notorious for their slow training, demanding significant computational resources and resulting in a considerable carbon footprint (Strubell et al., 2020). Due to the extensive number of diffusion timesteps required to produce a single sample (e.g., Rombach et al. (2021) mentioned up to 500 steps), DMs also suffer from slow sampling speed (Song et al., 2021). Even though progress has been made in improving inference speed, DMs are still considerably slower than other generative approaches such as GANs and VAEs (Rombach et al., 2021). This inefficiency impacts not only end users, but also the research community, by hindering further developments due to the lengthy process of model training and evaluation.

Reducing the computational costs and memory requirements of DMs is a critical challenge for the broad implementation and adoption of these models, and an active field of research. Much of the existent literature

Published in Transactions on Machine Learning Research (09/2025)

has addressed this challenge through improvements to the inference stage (Song et al., 2021; Nichol & Dhariwal, 2021; Fang et al., 2023; Shang et al., 2023; Li et al., 2023; Salimans & Ho, 2022; Meng et al., 2023). Efforts have also been made in the direction of training efficiency, exploring different architectures and training strategies (Wang et al., 2023; Ding et al., 2023; Rombach et al., 2021; Phung et al., 2022), but training DMs is still an extensive and costly process.

In the last few years, sparse-to-sparse training has emerged as a promising approach to significantly reduce the computational cost of deep learning models, by training sparse networks from scratch (Mocanu et al., 2018; Bellec et al., 2018; Dettmers & Zettlemoyer, 2019; Evci et al., 2020; Zhang et al., 2024b). Interestingly, sparse neural networks have been shown to match, or even outperform, their Dense counterparts in classification tasks (Mocanu et al., 2018; Liu et al., 2021a), generative modeling using GANs (Liu et al., 2023b), and Reinforcement Learning (Sokar et al., 2022), all while requiring less memory and reducing the number of floating-point operations (FLOPs). We should note that, currently, most sparse neural networks require roughly the same amount of time to train as their dense counterparts, since today s hardware is optimized for dense matrix operations. However, growing interest in sparse models is reshaping the landscape; see Appendix A for a discussion in this regard.

We propose to lower the computational cost of DMs by incorporating, for the first time, the paradigm of sparse-to-sparse training for unconditional generation. As such, we introduce three different methods, Static-DM (static strategy), Rig L-DM, and Mag Ran-DM (both dynamic strategies), that can be easily integrated with existing DMs. Since our goal is to study the effect of these techniques on the performance of DMs, we experiment using two state-of-the-art DMs in two domains: Latent Diffusion (Rombach et al., 2021) for image generation (continuous, pixel-level data) and Chiro Diff (Das et al., 2023) for sketch generation (discrete, spatiotemporal sequence data). In sum, we make the following contributions:

We introduce sparse-to-sparse training to unconditional DMs, with both static and dynamic strategies. We consider various sparsity levels, two state-of-the-art models (Latent Diffusion and Chiro Diff), and six datasets in total. We also perform some experiments using conditional DMs.

Our experiments show great promise of sparse-to-sparse training for DMs, as we were able to train a sparse DM for each model/dataset case with comparable performance to their respective Dense counterpart, while significantly reducing the parameters count and FLOPs. In most cases, at least one sparse DM outperformed its Dense version.

We identify safe and effective values to perform sparse-to-sparse training of DMs. Higher performance is achieved using dynamic sparse training with 25 50% sparsity levels. For models with higher sparsity ratio, a conservative prune and regrowth ratio of 0.05 provides better results.

2 Background and Related Work

2.1 Diffusion Models

DMs (Sohl-Dickstein et al., 2015; Ho et al., 2020a; Song et al., 2020) are probabilistic models designed to learn a data distribution q(x) through two processes: a forward noising process and a reverse denoising process. The forward process is defined as a Markov Chain of length T in which Gaussian noise is added at each timestep t, producing a sequence of increasingly noisier samples:

q(xt|xt 1) = N(xt; p

1 βtxt 1, βt I) (1)

q(x1:T |x0) =

t=1 q(xt|xt 1) (2)

where x0 is the original data point, xt is the data point at timestep t, and βt is the pre-defined amount of noise added at timestep t.

The reverse denoising process q(xt 1|xt), attempts to recover the original data, but it is intractable as it depends on the entire data distribution q(x). As such, we need to parameterize a neural network pθ

Published in Transactions on Machine Learning Research (09/2025)

to approximate it. This network pθ can be optimized by training with the simplified objective, L = Et [1,T ],x0,ϵ N(0,1)||ϵ ϵθ(xt, t)||2, where x T is a noisy version of input x at the final timestep T, and ϵθ the prediction of the neural network pθ.

2.2 Efficiency in Diffusion Models

Increasing the efficiency of DMs has been primarily addressed through accelerating the sampling process, by reducing the number of diffusion steps through faster sampling (Song et al., 2021; Karras et al., 2022) and model distillation (Salimans & Ho, 2022; Meng et al., 2023; Yin et al., 2024). As for training acceleration, some works have proposed shifting the diffusion process to the latent space (Rombach et al., 2021; Vahdat et al., 2021). Interestingly, Phung et al. (2022) used discrete wavelet transforms to decompose images into sub-bands, employing these sub-bands to perform the diffusion more efficiently.

Previous studies have also presented refinements to the training process of DMs. For example, Wang et al. (2023) introduced a plug-and-play training strategy that utilizes patches instead of the full images, to improve training speed. Hang et al. (2024) proposed treating DMs as a multitask learning problem and introduced a weighting strategy to balance the different timesteps, achieving a significant improvement in training convergence speed.

From the perspective of network compression, prior works have explored techniques such as structural pruning (Fang et al., 2023), post-training quantization (Shang et al., 2023; Li et al., 2023), knowledge distillation (Yang et al., 2023), and the lottery ticket hypothesis (Frankle & Carbin, 2019; Jiang et al., 2023). Very recently, Wang et al. (2024) proposed the incorporation of sparse masks into pre-trained DMs before fine-tuning, and achieved a 50% reduction in multiply-accumulate operations (MACs) with only a slight average decrease of image quality (as measured by the FID score). Although these techniques work in increasing efficiency, they still require pre-training of full DMs. Our work proposes training sparse DMs from scratch, which has the potential to both accelerate training and inference, and reduce the memory footprint.

2.3 Sparse-to-Sparse Training

Nowadays most computational models are what is referred to as Dense networks, comprising a stack of layers containing multiple neurons, each connected to all neurons in the following layer. Sparse-to-sparse training techniques aim to train sparse neural networks from scratch, thus reducing the number of parameters and computations. If we define the connectivity graph of a Dense neural network as G(V, E), where V represents the set of neurons (vertices), and E the set of connections between them (edges), a sparse version of that neural network would be defined as G(V , E ), with V and E being a subset of the neurons and connections of the Dense network. Sparse networks can be obtained using structured methods, where V = V , and unstructured methods, where V = V . Overall, sparse-to-sparse training techniques can be divided into static sparse training (SST) and dynamic sparse training (DST).

Static Sparse Training. In SST methods, the connectivity pattern between neurons is set at initialization, and remains fixed during training. This concept was first introduced by Mocanu et al. (2016), who proposed a non-uniform scale-free topology for Restricted Boltzmann machines, with the sparse models achieving better results than their Dense counterparts. Later, Liu et al. (2022) investigated the efficacy of random pruning at initialization, and found that, using appropriate layer-wise sparsity ratios, a randomly pruned subnetwork of Wide Res Net-50 can outperform a dense Wide Res Net-50 on Image Net. Many other criteria have been proposed to set layer-wise sparsity ratios before training, by trying to identify important connections using information such as connection sensitivity, as in SNIP (Lee et al., 2019), gradient flow (Wang et al., 2020), as in Gra SP. Very recently, two new initialization criteria have been proposed that utilize concepts from network science theory: Bipartite Scale-Free and Bipartite Small-World (Zhang et al., 2024a;b).

Dynamic Sparse Training. In DST methods, the network is initialized with a connectivity pattern and dynamically explores different connections throughout training (Mocanu et al., 2018; Bellec et al., 2018). This was first proposed by Mocanu et al. (2018) through Sparse Evolutionary Training (SET), an algorithm that adjusts the connections using a prune-and-grow scheme every N training steps. In SET, weights are

Published in Transactions on Machine Learning Research (09/2025)

dropped based on their magnitude (ensuring an equal amount of positive and negative weights) and regrown randomly. Rig L (Evci et al., 2020) proposes an alternative method that prunes the weights based on the absolute magnitude, and regrows them based on the gradients by calculating the dense gradients only at the update step. Although further pruning methods have been proposed (Lee et al., 2019; Yuan et al., 2021), a study by Nowak et al. (2023) found only minor differences between the tested criteria. The contrast was higher in lower density patterns, with magnitude pruning giving the best performance. Other growing criteria have been proposed based on randomness (Mostafa & Wang, 2019) and momentum (Dettmers & Zettlemoyer, 2019).

Recently, Zhang et al. (2024b) proposed Epitopological Sparse Meta-deep Learning (ESML), a brain-inspired, gradient-free method, which aims to shift the focus from the weights to the network topology, and uses concepts from network theory. By leveraging ESML, the authors train a sparse network that using just 1% of the connections, is able to surpass dense networks, as well as other DST methods, in several image classification tasks.

DST has also been applied to the field of generative modelling: Liu et al. (2023b) proposed STU-GAN, comprised of a generator with high sparsity and a denser discriminator. STU-GAN was able to outperform a dense Big GAN on CIFAR-10 with a 80% sparse generator and 70% sparse discriminator.

3 Methodology

Our study aims to understand the effect of sparse-to-sparse training techniques on DMs. We focus on unstructured sparsity due to its ability to maintain high performance even at very high levels of sparsity (Evci et al., 2020). Thus, our experiments cannot rely on current hardware to accelerate sparse computations; for example, NVIDIA A100 and Ampere cards only support 2:4 structured sparsity, which requires to enforce a fixed sparsity level of 50%. In the following sections, we present three methods of introducing sparsity in DMs: one SST technique, Static-DM, and two DST techniques, Mag Ran-DM and Rig L-DM.

3.1 Static Sparse Training: Static-DM

Static-DM is a sparse DM trained from scratch, with fixed connectivity between neurons. The pseudocode for Static-DM is shown in Algorithm 1. The training process closely resembles that of a dense DM, with the addition of a sparse initialization step. In this step, the graph underlying the neural network is sparsified by setting a fraction of the neuron connections to zero.

Algorithm 1 Static-DM

1: Input: Dataset D, Network fθ, Number of Epochs N, Diffusion steps Td, Sparsity ratio S 2: θ sparse initialization using S 3: for i = 1 to N do 4: x0 D 5: t U({1, 2, . . . , Td}) 6: ϵ N (0, I)

7: θi = Adam W( θ, LDIF(fθ(x0, t), ϵ)) 8: end for

Following the findings of Liu et al. (2022), we randomly prune the connections at initialization using the Erdõs Rényi (ER) (Mocanu et al., 2018) strategy to allocate the non-zero weights to non-convolutional layers. With this strategy, larger layers get assigned higher sparsity than smaller layers. The sparsity of each layer scales with sl 1 nl+nl 1

nl nl 1 , where nl and nl 1 represent the number of neurons in layer l and l 1 respectively.

For convolutional layers, we use a modification of ER, ERK (Evci et al., 2020), which takes into account the size of the kernels, sl 1 nl+nl 1+wl+hl

nl nl 1 wl hl , where nl and nl 1 represent the number of neurons in layer l and l 1 respectively, and wl and hl the width and height of the corresponding convolutional kernel.

Published in Transactions on Machine Learning Research (09/2025)

3.2 Dynamic Sparse Training: Mag Ran-DM and Rig L-DM

The key aspect of DST algorithms lies with the process of pruning and regrowing weights. We opted to test the two most common regrowth methods, random growth and gradient growth, combined with the magnitude pruning criteria. Magnitude pruning is a simple criteria, that has been shown to perform well in high sparsity regimes for supervised classification, as well as in other generative models (Nowak et al., 2023; Liu et al., 2023b)

Rig L, proposed by Evci et al. (2020), combines gradient growth and magnitude pruning, thus the name of our model Rig L-DM. The combination of random growth and magnitude pruning closely resembles the SET algorithm (Mocanu et al., 2018), and has been studied before for other types of models (Nowak et al., 2023), although it has never been named. For simplicity, we refer to this method as Mag Ran-DM.

Algorithm 2 Rig L-DM and Mag Ran-DM

1: Input: Dataset D, Network fθ, Number of Epochs N, Diffusion steps Td, Sparsity ratio S, exploration frequency Te, Pruning rate p, Sparse method method 2: θ sparse initialization using S

3: for i = 1 to N do 4: x0 D 5: t U ({1, 2, . . . , Td}) 6: ϵ N(0, I) 7: θi = Adam W( θ, LDIF(fθ(x0, t), ϵ))

8: if i mod Te then 9: θip = Top Mag(|θi|, 1 p) // Magnitude pruning 10: if method is Rig L-DM then 11: θig = Top Grad(| θLDIF|, p) // Gradient growth 12: else if method is Mag Ran-DM then 13: θig = Random(p) // Random growth

14: end if 15: θi update activated weights using θig and θip 16: end if 17: end for

The full pseudocode for the training process of Mag Ran-DM and Rig L-DM can be found in Algorithm 2. At the start of the training process, the network is sparsely initialized using the same strategy as described for Static-DM. After every Te training iterations, a cycle of connection pruning and growth is performed. First, we drop (i.e. set to zero) a fraction of the activated weights with the lowest magnitude from the network, determined using Top Mag(|θi|, 1 p), which returns the indices of the top 1 p of weights by magnitude. After pruning, we regrow new weights in the same proportion in order to maintain the sparsity level. For Rig L-DM, the connections to regrow are given by Top Grad(| θLDIF |, p), that returns the indices of the top p of weights with highest magnitude gradients. For Mag Ran-DM the regrowth is determined by Random(p), which outputs the indices of random p of connections.

3.3 Experimental Setup

Note that our goal is not to directly compare performance between models or datasets, but to compare the performance of Dense and sparse versions of the same models across different datasets, to gain insights into the impact of sparsity in DM training.

3.3.1 Models and Benchmarks

We test Static-DM, Mag Ran-DM, and Rig L-DM against the Dense baseline, on two different DMs, Latent Diffusion (Rombach et al., 2021) and Chiro Diff (Das et al., 2023), given their popularity among the research literature, on the task of unconditional image generation. Although image generation is the most common application and main direction of current research in DMs, we seek to offer a more extensive look, and

Published in Transactions on Machine Learning Research (09/2025)

(a) Latent Diffusion: Sparsity is applied to the U-Net, leaving the autoencoder parts (E, D) fully dense.

(b) Chiro Diff: Sparsity is applied throughout the whole network.

Figure 1: Sparsification of Latent Diffusion (1a) and Chiro Diff (1b) models.

examined DMs for different modalities, with different backbone architectures. More detailed information about the model architectures and choice of datasets can be found in Appendix B.

Latent Diffusion. Latent Diffusion is a DM that creates high-quality images while reducing computational requirements by training in a compressed lower-dimensional latent space. Although we focus on unconditional generation tasks, Latent Diffusion also allows for conditional generation, by using a general-purpose mechanism based on cross attention (Vaswani et al., 2017). Latent Diffusion first employs pre-trained autoencoders to obtain a latent representation of the input, and then performs the diffusion process on these representations, using a U-Net (Ronneberger et al., 2015). Performing the denoising process in the latent space allows to the model to focus on relevant semantic-wise information about the data. We sparsify only the U-Net model, as shown in Figure 1a, and utilize off-the shelf autoencoders provided by Rombach et al. (2021), keeping them dense. We evaluate on the LSUN-Bedrooms (Yu et al., 2015), Celeb A-HQ (Karras et al., 2018) and Imagenette (Howard, 2019) datasets.

Chiro Diff. Chiro Diff is a DM specifically designed to model continuous-time chirographic data, such as sketches or handwriting, in the form of a sequence of strokes containing both spatial and temporal information. Chiro Diff can handle sequences of variable length and, as a non-autoregressive model, is able to capture holistic concepts, leading to higher quality samples. This model employs a Bidirectional GRU encoder as backbone architecture. The encoder is fed the spatial coordinates, their point-wise velocities, as well as the entire sequence as context, which provides full context of the sequence during the generation process. Sparsity is applied to the entire network, as shown in Figure 1b. We evaluate it on Kanji VG, Quick Draw (Ha & Eck, 2018), and VMNIST (Das et al., 2022). Following the original paper, we use a preprocessed version of Kanji VG.1 For Quick Draw we use the following categories: crab, cat, and mosquito; and all results are averaged.

3.3.2 Experimental Details

We train the models on a set of sparsity rates S {0.1, 0.25, 0.5, 0.75, 0.9}. For DST methods, we set the exploration frequency Te = 1100 for all Latent Diffusion datasets, and Te = 800 for all Chiro Diff datasets.

1https://github.com/hardmaru/sketch-rnn-datasets/tree/master/kanji

Published in Transactions on Machine Learning Research (09/2025)

The weight prune and regrowth ratio was set to p = 0.5 for all main experiments. These values of Te and p were based on a small random search experiment.

Due to computing limitations, we use 12500/500 training/validation images for Celeb A-HQ and 10598/2500 images for LSUN-Bedrooms. In Appendix C we conduct experiments using a selection of models with the full Celeb A-HQ dataset to demonstrate that using more data does not greatly influence the results. Further, in Appendix D we perform experiments using the Image Net-1k dataset which contains over 1M images.

To be able to compare the performance of different methods and different sparsity levels, we train the models for a predefined amount of epochs: 150 for Latent Diffusion datasets, and 600 for Chiro Diff datasets. For a complete description of training details please refer to Appendix B.3. Given the extensive number of experiments we conducted, we opted for a shorter training regime. For sampling, we use DDIM sampling (Song et al., 2021) with 100 steps for Latent Diffusion, and 50 steps for Chiro Diff, following the guidance provided in the original papers.

For completeness, we also conducted some experiments on conditional DMs, specifically on class-conditional Image Net and class-conditional Quick Draw. See Appendix E for details on this setup.

For our experiments, we performed approximately 620 training runs of Dense, Static-DM, Rig L-DM, and Mag Ran-DM models, using two high-performance computer (HPC) clusters equipped with NVIDIA Tesla V100 SXM2 and A100 GPUs. Each DM was trained on only one GPU. All experiments consumed around 6, 900 GPU hours.

3.3.3 Evaluation Metrics

We follow common practice and calculate the FID score (Heusel et al., 2017) to assess the performance of all models. Refer to Appendix B.4 for more information on FID calculation. For evaluation completeness, we also report the Kernel Inception Distance (KID) (Bińkowski et al., 2018), with results presented in Appendix I. To evaluate the computational savings of the sparse methods, we report the network size (number of parameters) as a proxy for memory requirement, and the FLOPs, to estimate the computational cost of training and inference. We follow the method of FLOPs calculation described by Evci et al. (2020).

4 Experimental Results

We analyze the performance of Static-DM, Mag Ran-DM, and Rig L-DM across various sparsity levels, and compare the results against the original Dense baseline. Additionally, we perform experiments regarding the training dynamics of Dense vs sparse models. Later on, in Section 4.4 we present experiments comparing a selection of DST vs. Dense models across various diffusion timesteps. Examples of the generated samples can be found in Appendix J. Results from the class-conditional experiments are provided in Appendix E.

4.1 Latent Diffusion

The results of the studied sparse methods for Latent Diffusion are shown in Figure 2. For Celeb A-HQ, 50% of the connections can be removed with minimal to no loss in image quality. With a higher sparsity level of 75%, the three methods still perform comparably to the Dense model, especially Static-DM. However, when the network is very sparse, S = 0.9, all models fail to generate high-quality data.

On LSUN-Bedrooms, a similar overall trend can be observed: performance steadily increases with decrease in sparsity level until 25%. Interestingly, Mag Ran-DM with S = 0.1 shows worse performance than the Dense model, and also a significant decrease compared to Mag Ran-DM with S = 0.25. While this goes against the general expectation that more sparsity leads to increasingly worse performance, our intuition is that this might be related to the balance between regularization and expressivenes of the model. When the sparsity is low, the regularization benefits are not very strong, and the model might suffer from a loss of expressiveness due to reduction in parameters, thus obtaining worse results. As such, Mag Ran-DM with S = 0.25 is likely striking a better balance between these two factors. This behaviour can be observed in all three datasets, although less pronounced in Celeb A-HQ. However, exploring this topic in depth is beyond the scope of this paper.

Published in Transactions on Machine Learning Research (09/2025)

Figure 2: FID score comparisons between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with various sparsity levels, for Latent Diffusion, with prune and regrowth ratio p = 0.5. Values are averaged over 3 runs.

Imagenette experiments exhibit the same overall tradeoff between sparsity and performance, with the best results being found in 10% and 25% sparse models.

For all datasets, we successfully trained at least one sparse DM that outperforms the original Dense version. Table 1 presents the metrics for the best sparse models for each method. In Celeb A-HQ, only Rig L-DM at S = 0.25 surpasses Dense performance. In LSUN-Bedrooms, both Static-DM and Mag Ran-DM were able to outperform it. In Imagenette, all methods were able to achieve superior performance, albeit at different sparsity levels. We note that the variance observed in the models is similar when comparing dense and sparse versions in all cases.

Table 1: Performance and cost of training and testing of Dense and best Static-DM, Rig L-DM, and Mag Ran DM versions for Latent Diffusion. Values are averaged over 3 runs. The FLOPs of sparse DMs are normalized with the FLOPs of their Dense versions. Test FLOPS were calculated for one sample. Sparse models that outperform the Dense version are marked in bold. The top-performing sparse model is underlined.

Dataset Approach FID SD ( ) Params Train FLOPs Test FLOPs

Dense 32.74 3.68 274.1M 9.00e16 1.92e13 Static-DM, S = 0.5 33.19 2.39 0.50 0.68 0.68 Rig L-DM, S = 0.25 32.12 3.10 0.75 0.91 0.91 Mag Ran-DM, S = 0.5 32.83 1.68 0.50 0.67 0.67

Dense 31.09 12.42 274.1M 7.64e16 1.92e13 Static-DM, S = 0.25 28.79 12.65 0.75 0.91 0.91 Rig L-DM, S = 0.10 37.80 13.55 0.90 0.97 0.97 Mag Ran-DM, S = 0.25 28.20 7.64 0.75 0.91 0.91

Dense 123.42 4.25 274.1M 6.83e16 1.92e13 Static-DM, S = 0.10 119.92 5.94 0.90 0.97 0.97 Rig L-DM, S = 0.10 121.59 6.91 0.90 0.97 0.97 Mag Ran-DM, S = 0.25 117.32 8.52 0.75 0.91 0.91

Memory and computational savings. In Table 1, we can observe that the top-performing sparse DM on Celeb A-HQ, Rig L-DM with S = 0.25, is able to outperform Dense performance, while reducing by 25% the number of parameters and 10% the number of FLOPs. Although Static-DM S = 0.5 and Mag Ran-DM S = 0.5 achieve slightly inferior performance, they are able reduce FLOPS and number of parameters more significantly, by 30% and 50%, respectively. On LSUN-Bedrooms and Imagenette, the top-performing sparse DM reduces number of FLOPs by 10%, and number of parameters by 25%.

Prune and regrowth rate experiments. In all datasets, Static-DM has better performance than the dynamic methods in higher sparsity setups, S > 0.5. This is interesting, as it departs from the usual patterns

Published in Transactions on Machine Learning Research (09/2025)

found in sparse-to-sparse training for supervised learning applications and even other generative models such as GANs, where DST usually outperforms SST (Mocanu et al., 2018; Liu et al., 2023b). Liu et al. (2021c) found that, in image classification tasks, DST models consistently achieve better performance over SST with appropriate parameter exploration, i.e., exploration frequency Te and prune and regrowth ratio p. To provide insights on the importance of p for DST experiments, we conducted an experiment using a prune and regrowth rate p {0.05, 0.1, 0.2, 0.3, 0.5}. The results are provided in Figure 8 in Appendix H. The best results were obtained with p = 0.05.

Following this experiment, we repeated all experiments for DST methods presented in Figure 2, using p = 0.05, and show the results in Figure 9 and Appendix H. One particularly interesting finding is that, in high sparsity regimes, such as S = 0.9 and S = 0.75, DST methods have consistently better performance when p = 0.05, even outperforming Static-DM. However, this performance advantage disappears when using the more aggressive prune and regrowth rate of p = 0.5. Please refer to Appendix H for a more in-depth analysis.

4.2 Chiro Diff

Figure 3: FID score comparisons between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with various sparsity levels, for Chiro Diff, with prune and regrowth rate p = 0.5. Values averaged over 3 runs.

Figure 3 shows the FID scores of the studied sparse methods for Chiro Diff. For Quick Draw, we observe that both Static-DM and Rig L-DM exhibit variations around the performance of the Dense model, with only a subtle tendency to deteriorate as sparsity increases. Mag Ran-DM consistently matches the FID of the Dense model, and is able to outperform it at 90% sparsity. These results suggest that this model is overparameterized, which would explain why it benefits significantly from sparsity, even when removing 90% of the weights. We repeated the same experiments using a smaller version of the model, the results of which are shown in Appendix F. Although most sparse models still perform comparably to the dense version, the trend of diminishing performance with increased sparsity is apparent, in contrast to the results observed with the larger model.

On Kanji VG, the impact of sparsity is more pronounced, as all three methods demonstrate a downward trend in performance as sparsity increases. Dynamic methods have consistently better performance than Static-DM, and Rig L-DM exhibits top performance in all sparsity levels except for S = 0.75.

In VMNIST experiments, there is, again, a pattern of better performance as sparsity decreases. Similarly to Latent Diffusion experiments, SST has better performance in higher sparsity settings, S > 0.5. In this dataset, there is a slighter larger gap in performance between the sparse and dense models.

We successfully trained at least one sparse DM from each method that demonstrates a comparable performance to the Dense counterpart, and show the results on Table 2. Rig L-DM was the top-performing method on Quick Draw, with S = 0.1, and on Kanji VG, with S = 0.25, while in VMNIST, the top method was Mag Ran-DM, with S = 0.10. For Quick Draw, the top sparse DM was able to outperform the Dense network.

Memory and computational savings. Table 2 shows that the top-performing sparse DM on Kanji VG achieves a reduction in the number of parameters and FLOPs of about 30%, while achieving a similar FID score. On Quickdraw, Mag Ran-DM with 90% sparsity achieves an considerable reduction of 88%, and even

Published in Transactions on Machine Learning Research (09/2025)

Table 2: Performance and cost of training and testing of the Dense and best Static-DM, Rig L-DM, and Mag Ran-DM for Chiro Diff. Values averaged over 3 runs. The FLOPs of sparse DMs are normalized with the FLOPs of the dense versions, and test FLOPS were calculated for one sample. Sparse models that outperform the Dense version are marked in bold. The top-performing sparse model is underlined.

Dataset Approach FID SD ( ) Params Train FLOPs Test FLOPs

Dense 29.78 0.59 736027 5.12 e14 1.29 e10 Static-DM, S = 0.25 29.39 0.24 0.75 0.75 0.75 Rig L-DM, S = 0.10 29.38 0.27 0.89 0.89 0.89 Mag Ran-DM, S = 0.90 29.45 0.39 0.10 0.10 0.10

Dense 21.10 0.25 416859 1.80e13 7.35 e9 Static-DM, S = 0.5 22.36 0.87 0.50 0.51 0.51 Rig L-DM, S = 0.25 21.14 0.71 0.70 0.70 0.70 Mag Ran-DM, S = 0.25 21.73 1.18 0.39 0.39 0.39

Dense 44.21 0.62 65019 1.69e12 7.11 e8 Static-DM, S = 0.25 47.29 1.96 0.75 0.74 0.74 Rig L-DM, S = 0.10 46.81 1.98 0.90 0.90 0.89 Mag Ran-DM, S = 0.10 46.00 1.71 0.90 0.89 0.89

though it is not the top-performing sparse model, it also outperforms the Dense model. The top sparse model on VMNIST, provides a reduction in FLOPs of about 89%.

Prune and regrowth rate experiments. Similar to Latent Diffusion, we repeated the DST experiments using the more conservative prune and regrowth rate of 0.05. The biggest improvement was seen in the Quickdraw dataset, where DST methods obtained considerably higher performances, as compared with Figure 3. Akin to the Latent Diffusion results, in higher sparsity regimes DST models mostly obtain better performance when using p = 0.05. Please refer to Appendix H for a more in-depth analysis.

4.3 Training Dynamics

In addition to final FID scores, we analysed training dynamics to better understand the behaviour of sparse models during training. In order to perform this analysis, we selected the sparsity level that achieved the overall best results in previous experiments, S = 0.25, and plotted the FID scores across several epochs.

In general, sparse models appear to follow the trend of the corresponding dense model, which appears to indicate that they retain the stable training behaviour of dense diffusion models.

Figure 4: Comparison of training dynamics between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with S = 0.25, for Latent Diffusion.

Published in Transactions on Machine Learning Research (09/2025)

Figure 5: Comparison of training dynamics between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with S = 0.25, for Chiro Diff.

4.4 Impact of Diffusion Steps

The number of timesteps is an important parameter in DMs, as too few can lead to insufficient denoising, and low quality images, while too many might increase computational complexity without improving output quality. We explored the relationship between the number of timesteps and model sparsity, aiming to determine whether a very sparse model (S = 0.75) with an increased number of sampling steps can achieve performance comparable to that of a dense model, with less sampling steps. We perform experiments using Celeb A-HQ for Latent Diffusion, and Kanji VG for Chiro Diff, the results of which are presented in Figure 6.

Figure 6: FID score comparisons between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with S = 0.75, using varied diffusion timesteps for Latent Diffusion (Celeb A-HQ), and Chiro Diff (Kanji VG). Values averaged over 3 runs.

In general, the number of sampling steps does not affect when comparing sparse and dense versions within the same number of timesteps. More experiments are presented in Appendix G. In Kanji VG, no sparse model is able to match any version of the dense model, and varying the number of timesteps appears to have little influence on the quality of the output. In Celeb A-HQ, when comparing different numbers of timesteps, we observe that Mag Ran-DM and Static-DM with both 100 and 150 timesteps are able to outperform the Dense model using 50 timesteps. As an example, in Static-DM, S = 0.75 with 100 timesteps vs. the dense model with 50 timesteps, Static-DM offers a theoretical speedup of 0.29 over the dense model s Training FLOPs, and 0.57 of the Testing FLOPs, while creating better quality samples.

Published in Transactions on Machine Learning Research (09/2025)

4.5 Limitations and Future Work

Apart from the previously mentioned computational limitations when training on Celeb A-HQ and LSUNBedrooms, our findings demonstrate systematic trends that prompt for further investigation. Training for longer epochs could provide deeper insights into the capabilities of sparse models. Additionally, there is potential in exploring other pruning strategies and other DST hyperparameters such as Te. Another interesting direction is to adjust DST hyperparameters based on the training phase, in response to changes in training dynamics. Furthermore, employing multiple sparsity masks with varying sparsity levels and dynamically changing them during training, according to the denoising timestep, is a promising line of research.

Table 3: Overview of dense vs sparse Latent Diffusion (LD) and Chiro Diff (CD) models. Experiments where the sparse model outperforms the dense version are marked in grey.

Dense Best Sparse model FLOPs

Model Dataset FID Model (S, p) FID Train & Test

LD Celeb A-HQ 32.74 Rig L-DM (25%, 0.5) 32.12 0.91 LD Bedrooms 31.09 Mag Ran-DM (10%, 0.05) 25.12 0.97 LD Imagenette 123.42 Mag Ran-DM (25%, 0.5) 117.32 0.91

CD Quick Draw 29.78 Rig L-DM (25%, 0.05) 24.91 0.75 CD Kanji-VG 21.10 Mag Ran-DM (50%, 0.05) 20.32 0.51 CD VMNIST 44.21 Mag Ran-DM (10%, 0.5) 46.00 0.89

5 Conclusion

We have introduced sparse-to-sparse training of DMs. Our experiments show that both SST and DST methods are able to match and often outperform the dense DMs, as shown in Table 3, while reducing memory and computational costs. We highlight the importance of choosing the correct method and sparsity level, depending on the model (and even the dataset) that is being used. Taken together, our findings show the great potential of sparse-to-sparse training in improving the efficiency of both training and sampling from DMs.

Open Science: Our code and models are available at https://github.com/iclbo/sparse_to_sparse_ diffusion

Acknowledgements

Research supported by Fonds National de la Recherche Luxembourg - FNR (SCRIPTOR project, grant AFR/22/17177001) and the European Innovation Council Pathfinder program (SYMBIOTIK project, grant 101071147).

The experiments presented in this paper were carried out using the HPC facilities of the University of Luxembourg (https://hpc.uni.lu) and Luxembourg s national supercomputer Melu Xina. The authors gratefully acknowledge the ULHPC and Lux Provide teams for their expert support.

Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, and Mark Kurtz. Enabling high-sparsity foundational llama models with efficient pretraining and deployment. ar Xiv, abs/2405.03594, 2024.

Published in Transactions on Machine Learning Research (09/2025)

Zahra Atashgahi, Ghada Sokar, Tim van der Lee, Elena Mocanu, Decebal Constantin Mocanu, Raymond Veldhuis, and Mykola Pechenizkiy. Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders. Machine Learning, 2022.

Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. In Proc. International Conference on Learning Representations (ICLR), 2018.

Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations, 2018.

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Semi-parametric neural image synthesis. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2022.

Hyungjin Chung, Byeongsu Sim, and Jong Chul Ye. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Selima Curci, Decebal Constantin Mocanu, and Mykola Pechenizkiyi. Truly sparse neural networks at scale. ar Xiv, abs/2102.01732, 2022.

Ayan Das, Yongxin Yang, Timothy Hospedales, Tao Xiang, and Yi-Zhe Song. Sketch ODE: Learning neural sketch representation in continuous time. In Proc. International Conference on Learning Representations (ICLR), 2022.

Ayan Das, Yongxin Yang, Timothy Hospedales, Tao Xiang, and Yi-Zhe Song. Chiro Diff: Modelling chirographic data with diffusion models. In Proc. International Conference on Learning Representations (ICLR), 2023.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. ar Xiv, abs/1907.04840, 2019.

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2021.

Zheng Ding, Mengqi Zhang, Jiajun Wu, and Zhuowen Tu. Patched denoising diffusion models for highresolution image synthesis. ar Xiv, abs/2308.01316, 2023.

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: making all tickets winners. In Proc. International Conference on Machine Learning (ICML), 2020.

Gongfan Fang, Xinyin Ma, and Xinchao Wang. Structural pruning for diffusion models. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2023.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proc. International Conference on Learning Representations (ICLR), 2019.

Songwei Ge, Vedanuj Goswami, C. Lawrence Zitnick, and Devi Parikh. Creative sketch generation. ar Xiv, abs/2011.10039, 2020.

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. In Proc. International Conference on Learning Representations (ICLR), 2023.

David Ha and Douglas Eck. A neural representation of sketch drawings. In Proc. International Conference on Learning Representations (ICLR), 2018.

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. ar Xiv, abs/2303.09556, 2024.

Published in Transactions on Machine Learning Research (09/2025)

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2017.

Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models. Ar Xiv, abs/2006.11239, 2020a.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2020b.

Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from imagenet, 2019. URL

https://github.com/fastai/imagenette.

Chao Jiang, Bo Hui, Bohan Liu, and Da Yan. Successfully applying lottery ticket hypothesis to diffusion model. ar Xiv, abs/2310.18823, 2023.

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In Proc. International Conference on Learning Representations (ICLR), 2018.

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2022.

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot network pruning based on connection sensitivity. In Proc. International Conference on Learning Representations (ICLR), 2019.

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. Diffusion-LM improves controllable text generation. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2022.

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV), 2023.

Sean Lie. Cerebras architecture deep dive: First look inside the hw/sw co-design for deep learning: Cerebras systems. In Proc. IEEE Hot Chips 34 Symposium (HCS), 2022.

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audio LDM: Text-to-audio generation with latent diffusion models. In Proc. International Conference on Machine Learning (ICML), 2023a.

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Zahra Atashgahi, Lu Yin, Huanyu Kou, Li Shen, Mykola Pechenizkiy, Zhangyang Wang, and Decebal Constantin Mocanu. Sparse training via boosting pruning plasticity with neuroregeneration. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2021a.

Shiwei Liu, Decebal Constantin Mocanu, Amarsagar Reddy Ramapuram Matavalam, Yulong Pei, and Mykola Pechenizkiy. Sparse evolutionary deep learning with over one million artificial neurons on commodity hardware. ar Xiv, abs/1901.09181, 2021b.

Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, and Mykola Pechenizkiy. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. ar Xiv, abs/2102.02887, 2021c.

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, and Mykola Pechenizkiy. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. In Proc. International Conference on Learning Representations (ICLR), 2022.

Shiwei Liu, Yuesong Tian, Tianlong Chen, and Li Shen. Don t be so dense: Sparse-to-sparse gan training without sacrificing performance. International Journal of Computer Vision, 2023b.

Published in Transactions on Machine Learning Research (09/2025)

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Decebal Mocanu, Elena Mocanu, Phuong Nguyen, Madeleine Gibescu, and Antonio Liotta. A topological insight into restricted boltzmann machines. Machine Learning, 2016.

Decebal Mocanu, Elena Mocanu, Peter Stone, Phuong Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 2018.

Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. ar Xiv, abs/1902.05967, 2019.

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Proc. International Conference on Machine Learning (ICML), 2021.

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proc. International Conference on Machine Learning (ICML), 2022.

Aleksandra Nowak, Bram Grooten, Decebal Constantin Mocanu, and Jacek Tabor. Fantastic weights and how to find them: Where to prune in dynamic sparse training. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2023.

Hao Phung, Quan Dao, and A. Tran. Wavelet diffusion models are fast and scalable image generators. In Pro. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In Proc. International Conference on Machine Learning (ICML), 2021.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. ar Xiv, abs/2112.10752, 2021.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. ar Xiv, abs/2104.07636, 2021.

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In Proc. ACM SIGGRAPH, 2022.

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In Proc. International Conference on Learning Representations (ICLR), 2022.

Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. ar Xiv, abs/1503.03585, 2015.

Ghada Sokar, Elena Mocanu, Decebal Constantin Mocanu, Mykola Pechenizkiy, and Peter Stone. Dynamic sparse training for deep reinforcement learning. In Proc. International Joint Conference on Artificial Intelligence (IJCAI), 2022.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Proc. International Conference on Learning Representations (ICLR), 2021.

Published in Transactions on Machine Learning Research (09/2025)

Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. Ar Xiv, abs/2011.13456, 2020.

Emma Strubell, Ananya Ganesh, and Andrew Mc Callum. Energy and policy considerations for modern deep learning research. Proc. AAAI Conference on Artificial Intelligence, 2020.

Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2021.

Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2017.

Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In Proc. International Conference on Learning Representations (ICLR), 2020.

Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, and Jun Zhu. Sparsedm: Toward sparse efficient diffusion models. Ar Xiv, abs/2404.10445, 2024.

Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, and Mingyuan Zhou. Patch diffusion: Faster and more data-efficient training of diffusion models. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2023.

Xingyi Yang, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Diffusion probabilistic model made slim. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. Ar Xiv, abs/1506.03365, 2015.

Geng Yuan, Xiaolong Ma, Wei Niu, Zhengang Li, Zhenglun Kong, Ning Liu, Yifan Gong, Zheng Zhan, Chaoyang He, Qing Jin, Siyue Wang, Minghai Qin, Bin Ren, Yanzhi Wang, Sijia Liu, and Xue Lin. Mest: Accurate and fast memory-economic sparse training framework on the edge. In Proc. Advances in Neural Information Processing Systems (Neur IPS), 2021.

Yingtao Zhang, Jialin Zhao, Ziheng Liao, Wenjing Wu, Umberto Michieli, and Carlo Vittorio Cannistraci. Brain-inspired sparse training in mlp and transformers with network science modeling via cannistraci-hebb soft rule. Preprints, 2024a.

Yingtao Zhang, Jialin Zhao, Wenjing Wu, Alessandro Muscoloni, and Carlo Vittorio Cannistraci. Epitopological learning and cannistraci-hebb network shape intelligence brain-inspired theory for ultra-sparse advantage in deep learning. In Proc. International Conference on Learning Representations (ICLR), 2024b.

Aojun Zhou, Yukun Ma, Junnan Zhu, Jianbo Liu, Zhijie Zhang, Kun Yuan, Wenxiu Sun, and Hongsheng Li. Learning n:m fine-grained structured sparse neural networks from scratch. Ar Xiv, abs/2102.04010, 2021.

Published in Transactions on Machine Learning Research (09/2025)

A Hardware and Software Support

One of the main challenges in sparse neural networks research is that most hardware optimized for deep learning is designed for dense matrix operations. As a result, most of current research attempts to mimic sparsity by using a binary mask over weights, which results in sparse networks offering, in practice, no better training efficiency than dense networks. However, industry is catching up and so it is a matter of time for hardware to truly leverage sparse operations.

There is a growing trend towards developing hardware that better supports sparse operations. In 2021, NVIDIA released the A100 GPU, which supports accelerating operations in a 2:4 sparsity pattern. Several works have already leveraged this feature (Zhou et al., 2021; Wang et al., 2024). In order to use this capability, the sparse matrices must follow a specific structure: among each group of four contiguous values, two values must be zero, thereby fixing the sparsity level at 50%. While this structure enables significant acceleration, it supports only one static sparsity level, and makes it impossible to vary the sparsity ratio between layers.

More recently, Cerebras introduced the CS-3 AI accelerator (Lie, 2022), capable of accelerating sparse training and supporting unstructured sparsity. Using Cerebras CS-3 AI to accelerate training, and Neural Magic s inference server to accelerate inference, Agarwalla et al. (2024) trained an accurate sparse Llama-2 7B model. Its accelerated training closely matched the theoretical speedup, while achieving 91.8% accuracy recovery of Llama Evaluation metrics, with 70% sparsity. This significant finding underscores the potential of sparse training to produce more efficient neural networks in practice, not just in theory.

In parallel, there have also been advancements in creating software implementations that support truly sparse-to-sparse neural network training, mostly for supervised learning tasks (Liu et al., 2021b; Curci et al., 2022). In addition, a sparse-to-sparse denoising autoencoder has been developed by Atashgahi et al. (2022), to perform fast and robust feature selection.

These developments in both hardware and software point towards a future where sparse-to-sparse training may become the de facto approach for developing neural networks, enabling faster, more memory-efficient, and energy-efficient deep learning models.

B Experiments setup

B.1 Model Architectures

In Latent Diffusion experiments, the model architecture is the same for LSUN-Bedrooms, Celeb A-HQ and Imagenette datasets. The DM follows the architecture proposed by Rombach et al. (2021). For the autoconder, we utilize a pre-trained model released by the Latent Diffusion authors on the project s Git Hub,2 with spatial size 64x64x3, VQ-reg regularization, and downsampling factor f = 4.

In Chiro Diff experiments, we adopt the architecture proposed by Das et al. (2023). The backbone network is a bidirectional GRU encoder with 3 layers, with 96 hidden units for Kanji VG, and 128 hidden units for Quick Draw. For VMNIST, the backbone network is a 2-layer bidirectional GRU encoder with 48 hidden units. We also use the code available on the project s Git Hub repository.3

B.2 Choice of Datasets

We evaluated Latent Diffusion on LSUN-Bedrooms and Celeb A-HQ, following their use in the original paper. Additionally, we included Imagenette, a subset of the popular Image Net (Deng et al., 2009) dataset. For Chiro Diff, we used the same datasets evaluated as the original study: Quick Draw, Kanji VG and VMNIST. While the authors of Chiro Diff analysed seven categories of Quick Draw, namely {cat, crab, mosquito, bus, fish, yoga, flower}, we opted to reduce the number of categories to {cat, crab, mosquito} given the large number of experiments involved in our investigation. In Appendix C below we demonstrate that dataset

2https://github.com/Comp Vis/latent-diffusion 3https://github.com/dasayan05/chirodiff

Published in Transactions on Machine Learning Research (09/2025)

size does not change the main outcomes. Ultimately, our goal is to compare and contrast sparse and dense models, independent of dataset size.

B.3 Training Regime

We follow the configurations provided in the Git Hub repositories of the original papers and present the main aspects below. The only alterations made were in the batch size and learning rate. The only exception is Imagenette, which was not included in the original paper; for this dataset, we applied the same configuration settings as those used for LSUN-Bedrooms.

Latent Diffusion on LSUN-Bedrooms: We use a batch size of 12, Adam W optimizer with weight decay 1e-2 and static learning rate 2.4e-5. We train for 150 epochs. We use 1000 Denoising steps (T), linear noise schedule from 0.0015 to 0.0195, and sinusoidal embeddings for the timestep.

Latent Diffusion on Celeb A-HQ: We use a batch size of 12, Adam W optimizer with weight decay 1e-2 and static learning rate 2.0e-06. We train for 150 epochs. We use 1000 Denoising steps (T), linear noise schedule from 0.0015 to 0.0195, and sinusoidal embeddings for the timestep.

Latent Diffusion on Imagenette: We use a batch size of 12, Adam W optimizer with weight decay 1e-2 and static learning rate 2.4e-5. We train for 150 epochs. We use 1000 Denoising steps (T), linear noise schedule from 0.0015 to 0.0195, and sinusoidal embeddings for the timestep.

Chiro Diff on Quick Draw: We use a batch size of 128, Adam W optimizer with weight decay 1e-2 and static learning rate 1e-3. We train for 600 epochs. We use 1000 Denoising steps (T), linear noise schedule from 1e-4 to 2e-2, and random Fourier features for the timestep embedding.

Chiro Diff on Kanji VG: We use a batch size of 128, Adam W optimizer with weight decay 1e-2 and static learning rate 1e-3. We train for 600 epochs. We use 1000 Denoising steps (T), linear noise schedule from 1e-4 to 2e-2, and random Fourier features for the timestep embedding.

Chiro Diff on VMNIST: We use a batch size of 128, Adam W optimizer with weight decay 1e-2 and static learning rate 1e-3. We train for 600 epochs. We use 1000 Denoising steps (T), linear noise schedule from 1e-4 to 2e-2, and random Fourier features for the timestep embedding.

Each setup was trained for 5 sparsity values [0.1, 0.25, 0.5, 0.75, 0.9], and we perform 3 runs for each model/dataset/sparsity combination. For Chiro Diff on Quick Draw, we trained each category {cat, crab, mosquito} for 3 runs, resulting in a total of 9 runs per sparsity level.

B.4 FID calculation

For Latent Diffusion, FID is calculated using the torch-fidelity Python package, and estimated based on 10k samples and the entire training set, as in the original work. For Chiro Diff, following the original paper, we plot and save the chirographic sequences as images, and calculate the FID using the inception model provided by Ge et al. (2020), pre-trained on the Quick Draw dataset, using 10k generated samples and 20k real samples.

C Experiments using the Full Celeb A-HQ Dataset

We conducted experiments using Static-DM, Mag Ran-DM, and Rig L-DM with S = 0.5 on the full Celeb A-HQ dataset, for 150 epochs, and compare the results with the previous models trained on 50% of the dataset. As shown in Table 4, the FID scores are similar across both datasets for each respective method. This supports our decision to focus on a subset of the dataset for our main experiments, to save valuable computational resources. Interestingly, all sparse models are able to outperform their dense version when trained on the full dataset.

Published in Transactions on Machine Learning Research (09/2025)

Table 4: Comparison of FID and KID scores for Latent Diffusion on Celeb A-HQ using full dataset vs. reduced dataset. Results are based on the first run. Sparse models that outperform their Dense version are marked in bold. The top-performing sparse model is underlined.

Methods FID ( ) KID ( ) Full dataset Reduced dataset Full dataset Reduced dataset

Dense 32.20 29.68 0.0259 0.0241 Static-DM, S = 0.50 29.71 29.91 0.0242 0.0243 Rig L-DM, S = 0.50 30.98 30.82 0.0250 0.0254 Mag Ran-DM, S = 0.50 26.70 30.71 0.0208 0.0253

D Experiments using the Full Image Net-1k Dataset

In this section, we present the results for the most promising sparse DM trained with the Image Net-1k dataset, comprising 1000 classes spanning 1, 281, 167 training images, 50, 000 validation images and 100, 000 test images.

Table 5: FID and KID scores for Latent Diffusion on Image Net.

Methods FID ( ) KID ( )

Dense 63.95 0.0538 Mag Ran-DM, S = 0.50 77.39 0.0714

E Experiments using Conditional Models

While the main focus of this paper is investigating how sparse-to-sparse training affects unconditional models, we also performed some initial experiments on conditional models to explore its applicability in this setting. Specifically we conducted experiments on class-conditional Image Net using Latent Diffusion, and class-conditional Quick Draw using Chiro Diff.

For these experiments, models were trained for 50 epochs on Image Net and 150 on Quickdraw. For Image Net, all classes were utilized in training and sampling was performed using η=1.0 and a classifier-free guidance scale of 3.0. For Quick Draw, seven classes were used: cat, crab, mosquito, bus, flower, yoga and fish. Examples of generated samples can be found in Figure 21 and Figure 22 in Appendix J.

On class-conditional Image Net, the dense model outperforms the sparse version, with a moderate gap between them. On Quickdraw, models show extremely similar performance, with the best results achieved by Mag Ran-DM with 50% sparsity. These results suggest that sparse-to-sparse training may also be effective for conditional models.

Table 6: FID and KID scores for Latent Diffusion on class-conditional Image Net.

Methods FID ( ) KID ( )

Dense 17.33 0.0071 Mag Ran-DM, S = 0.50 20.33 0.0088

Published in Transactions on Machine Learning Research (09/2025)

Table 7: FID and KID scores for Chiro Diff on class-conditional Quick Draw. Sparse models that outperform their Dense version are marked in bold. The top-performing sparse model is underlined.

Methods FID ( ) KID ( )

Dense 31.23 0.0327 Static-DM, S = 0.50 31.57 0.0310 Mag Ran-DM, S = 0.50 30.20 0.0300 Rig L-DM, S = 0.50 32.12 0.0315

F Experiments using a smaller model

In order to explore the strong results of high sparsity models (S = 0.90) on Quick Draw using Chiro Diff, we repeated the experiments in Figure 3, using a smaller model. In this smaller model, the backbone network is a bidirectional GRU encoder consisting of 3 layers with 96 hidden units each, compared to 128 in the larger model.

In contrast with the larger model, the trend of diminishing performance as sparsity increases, present in the other datasets, can be observed. Most sparse models perform comparably to the dense, although some performance degradation is apparent in high sparsity models.

Figure 7: FID and KID scores comparisons between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with various sparsity levels, for Quick Draw dataset, with prune and regrowth rate p = 0.5.

G Experiments using Various Diffusion Timesteps

In Table 8, we report the results of the models listed in Table 1 using 50, 100 and 200 sampling steps. These experiments confirm that the number of sampling steps typically does not affect whether a sparse model outperforms a dense model. In other words, a sparse model that performs better than a dense model at 100 timesteps also outperforms it at 50 and 200 timesteps.

For Celeb A-HQ, the variation in timesteps does not change the top-performing model, which is consistently Rig L-DM with S = 0.25. However, in LSUN-Bedrooms, the top-performing method varies with different timesteps.

H Prune and Regrowth Rate Experiments

To provide insight on the importance of the prune and regrowth rate for DST experiments, we conducted an experiment using varying values, with the top Mag Ran-DM and Rig L-DM models for the Celeb A-HQ

Published in Transactions on Machine Learning Research (09/2025)

Table 8: Comparison of FID scores for models listed in Table 1 using various DDIM sampling steps. Results based on the first run. Sparse models that outperform the Dense version, in the respective sampling steps, are marked in bold. The top-performing sparse model for each sampling step is underlined.

Dataset FID ( ) 50 steps 100 steps 200 steps

Dense 38.14 29.68 26.47 Static-DM, S = 0.5 38.16 29.91 26.44 Rig L-DM, S = 0.25 36.55 28.00 25.25 Mag Ran-DM, S = 0.5 39.31 30.71 28.02

LSUN-Bedrooms

Dense 20.42 20.14 20.58 Static-DM, S = 0.25 20.01 18.96 19.26 Rig L-DM, S = 0.10 19.01 17.96 20.49 Mag Ran-DM, S = 0.25 20.69 18.23 17.87

dataset, listed in Table 1. We report the results of the experiments in Figure 8. Although all FID values are extremely similar, the best performing models for both algorithms use prune and regrowth ratio p = 0.05, and both outperform the Dense version. This suggests that selecting an optimal ratio can improve model performance, even if only slightly in these lower-sparsity models tested.

Figure 8: FID scores comparison between Dense and DST models with various prune and regrowth ratios, for Latent Diffusion on Celeb A-HQ.

Informed by the results of Figure 8, we conducted experiments for DST methods using the same setup as in Figure 2 and Figure 3, but using a prune and regrowth rate of p = 0.05. As can be observed in Figure 9 and Table 9, the general trend of diminishing performance when sparsity increases still remains, with the exception of Quick Draw, in which all DST models had a significant increase in performance.

When looking at very high sparsity regimes, S = 0.90, we observe that most models continue to suffer from a significant performance drop when compared to their Dense version, except Quickdraw, where the new prune and regrowth provides a remarkable improvement, and LSUN Bedrooms, where Mag Ran-DM has an impressively high performance. However, when S = 0.90 (and to a lower degree S = 0.75) DST methods using p = 0.05 have consistently better performance than their p = 0.5 counterparts. When comparing the DST methods to Static-DM at S = 0.90, in Table 9, we observe that at least one DST method is able to outperform Static-DM in Celeb A-HQ and LSUN-Bedrooms, or closely match it in Imagenette, which did not happen with p = 0.5. Similarly, for Chiro Diff, in Table 10, almost all DST methods in all three datasets are able to outperform Static-DM.

All in all, these findings suggest that a prune and regrowth ratio of 0.5 is too aggressive, and that a more conservative choice of 0.05 is more appropriate for DMs. Previous work has mentioned that DST methods

Published in Transactions on Machine Learning Research (09/2025)

Figure 9: FID score comparisons between Dense and Sparse versions (Static-DM, Mag Ran-DM, Rig L-DM) considering various sparsity levels for Latent Diffusion. DST method use a prune and regrowth rate of 0.05. Values averaged over 3 runs.

Figure 10: FID score comparisons between Dense and Sparse versions (Static-DM, Mag Ran-DM, Rig L-DM) considering various sparsity levels for Chiro Diff. DST method use a prune and regrowth rate of 0.05. Values averaged over 3 runs.

Table 9: Comparison of FID scores for SST (Static-DM) and DST (Rig L-DM, Mag Ran-DM) models, with S = 0.9 using two different prune and regrowth rates (p = 0.5 and p = 0.05) for Latent Diffusion. DST models that outperform SST are marked in bold.

Dataset Static-DM Rig L-DM Mag Ran-DM p = 0.5, p = 0.05 p = 0.5, p = 0.05

Celeb A-HQ 52.48 4.88 65.65 4.32, 46.07 11.08 60.77 6.58, 48.39 14.05 Bedrooms 46.18 13.42 71.45 18.84, 58.64 22.88 46.22 10.11, 33.80 3.98 Imagenette 147.47 7.74 168.48 15.15, 148.93 12.03 167.19 8.20, 159.08 14.68

Table 10: Comparison of FID scores for SST (Static-DM) and DST (Rig L-DM, Mag Ran-DM) models, with S = 0.9 using two different prune and regrowth rates (p = 0.5 and p = 0.05) for Chiro Diff. DST models that outperform SST are marked in bold.

Dataset Static-DM Rig L-DM Mag Ran-DM p = 0.5, p = 0.05 p = 0.5, p = 0.05

Quick Draw 30.25 0.43 30.26 0.63, 28.84 0.37 29.45 0.39, 28.60 0.37 Kanji VG 30.75 2.16 28.54 0.74, 29.12 0.57 33.02 3.28, 29.01 1.48 VMNIST 52.35 0.84 54.08 1.57, 52.25 0.20 53.65 0.69, 51.94 1.12

Published in Transactions on Machine Learning Research (09/2025)

are consistently superior to SST as long as there is appropriate parameter exploration (Liu et al., 2021c), an observations that aligns with our findings.

I Additional evaluation metrics

In this section, we present the KID results corresponding to previously reported experiments, in Figure 2, Figure 3, Figure 9 and Figure 10.

Figure 11: KID comparisons between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with various sparsity levels, for Latent Diffusion, with prune and regrowth rate p = 0.5.

Figure 12: KID comparisons between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with various sparsity levels, for Latent Diffusion, with prune and regrowth rate p = 0.05.

Figure 13: KID comparisons between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with various sparsity levels, for Chiro Diff, with prune and regrowth rate p = 0.5.

Published in Transactions on Machine Learning Research (09/2025)

Figure 14: KID comparisons between Dense, and Static-DM, Mag Ran-DM and Rig L-DM with various sparsity levels, for Chiro Diff, with prune and regrowth rate p = 0.05.

J Examples of Generated Samples

Figures 15 to 20 showcase examples of samples generated by Latent Diffusion and Chiro Diff across the evaluated datasets. Examples are unconditionally sampled from the Dense and the top-performing sparse model in each case. Figures 21 and 22 present examples of class-conditional generated samples on Image Net and Quick Draw.

Published in Transactions on Machine Learning Research (09/2025)

(b) Static-DM, S = 0.25

Figure 15: Samples from Latent Diffusion trained on LSUN-Bedrooms. The top row presents samples generated by Dense models, whereas the bottom row presents samples generated by the top-performing sparse model.

Published in Transactions on Machine Learning Research (09/2025)

(b) Rig L-DM, S = 0.25

Figure 16: Samples from Latent Diffusion trained on Celeb A-HQ. The top row presents samples generated by Dense models, whereas the bottom row presents samples generated by the top-performing sparse model.

Published in Transactions on Machine Learning Research (09/2025)

(b) Mag Ran-DM, S = 0.25

Figure 17: Samples from Latent Diffusion trained on Imagenette. The top row presents samples generated by Dense models, whereas the bottom row presents samples generated by the top-performing sparse model.

Published in Transactions on Machine Learning Research (09/2025)

(b) Rig L-DM, S = 0.10

Figure 18: Samples from Chiro Diff trained on Quickdraw. The top row presents samples generated by Dense models, whereas the bottom row presents samples generated by the top-performing sparse model.

Published in Transactions on Machine Learning Research (09/2025)

(b) Rig L-DM, S = 0.25

Figure 19: Samples from Chiro Diff trained on Kanji. The top row presents samples generated by Dense models, whereas the bottom row presents samples generated by the top-performing sparse model.

Published in Transactions on Machine Learning Research (09/2025)

(b) Mag Ran-DM, S = 0.10

Figure 20: Samples from Chiro Diff trained on VMNIST. The top row presents samples generated by Dense models, whereas the bottom row presents samples generated by the top-performing sparse model.

Published in Transactions on Machine Learning Research (09/2025)

(b) Mag Ran-DM, S = 0.50

Figure 21: Samples from Latent Diffusion trained on class-conditional Image Net. The top row presents samples generated by Dense models, whereas the bottom row presents samples generated by the top-performing sparse model.

Published in Transactions on Machine Learning Research (09/2025)

(b) Mag Ran-DM, S = 0.50

Figure 22: Samples from Chiro Diff trained on class-conditional Quick Draw. The top row presents samples generated by Dense models, whereas the bottom row presents samples generated by the top-performing sparse model.