# on_memorization_in_diffusion_models__f9ba6fd6.pdf Published in Transactions on Machine Learning Research (02/2025) On Memorization in Diffusion Models Xiangming Gu xiangming@u.nus.edu National University of Singapore Chao Du duchao@sea.com Sea AI Lab Tianyu Pang tianyupang@sea.com Sea AI Lab Chongxuan Li chongxuanli@ruc.edu.cn Gaoling School of Artificial Intelligence Renmin University of China Min Lin linmin@sea.com Sea AI Lab Ye Wang wangye@comp.nus.edu.sg National University of Singapore Reviewed on Open Review: https: // openreview. net/ forum? id= D3DBqv SDbj Due to their capacity to generate novel and high-quality samples, diffusion models have attracted significant research interest in recent years. Notably, the typical training objective of diffusion models, i.e., denoising score matching, has a closed-form optimal solution that can only generate training-data replicating samples. This indicates that a memorization behavior is theoretically expected, which contradicts the common generalization ability of state-of-the-art diffusion models, and thus calls for a deeper understanding. Looking into this, we first observe that memorization behaviors tend to occur on smaller-sized datasets, which motivates our definition of effective model memorization (EMM), a metric measuring the maximum size of training data at which a model approximates its theoretical optimum. Then, we quantify the impact of the influential factors on these memorization behaviors in terms of EMM, focusing primarily on data distribution, model configuration, and training procedure. Besides comprehensive empirical results identifying the influential factors, we surprisingly find that conditioning training data on uninformative random labels can significantly trigger the memorization in diffusion models. Our study holds practical significance for diffusion model users and offers clues to theoretical research in deep generative models. Code is available at https://github.com/sail-sg/Diff Memorize. 1 Introduction In the last few years, diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021b) have achieved significant success across diverse domains of generative modeling, including image generation (Dhariwal & Nichol, 2021; Karras et al., 2022), text-to-image synthesis (Rombach et al., 2022; Ramesh et al., 2022), audio / speech synthesis (Kim et al., 2022; Huang et al., 2023), graph generation (Xu et al., 2022; Vignac et al., 2022), and 3D content generation (Poole et al., 2023; Lin et al., 2023). Substantial Work done during an internship at Sea AI Lab. Corresponding authors. Published in Transactions on Machine Learning Research (02/2025) 0.8k 1.6k 2.4k 3.2k 4.0k Training Epochs 0 2 4 6 8 10 12 14 16 Ratio (%) of Memorization 8k 16k 24k 32k 40k Training Epochs 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Figure 1: Overall motivation. Generated images (top row) and their ℓ2-nearest training samples in D (bottom row) by (a) the theoretical optimum defined in equation 2; (b) EDM (Karras et al., 2022). Memorization Ratios (%) of EDM models trained with different |D|; (c) within 4k training epochs; (d) when extending to 40k training epochs. empirical evidence attests to the ability of diffusion models to generate diverse and novel high-quality samples (Dhariwal & Nichol, 2021; Nichol & Dhariwal, 2021; Nichol et al., 2021), underscoring their powerful capability of abstracting and comprehending the characteristics of the training data. Diffusion models posit a forward diffusion process {zt}t (t [0, T]) that gradually introduces Gaussian noise to a data point x, resulting in a transition distribution q(zt|x) = N(zt|αtx, σ2 t I). The coefficients αt and σt are chosen such that the initial distribution q0(z0) aligns with the data distribution P(x) while steering it towards an approximately Gaussian distribution q T (z T ). Sampling from the data distribution P can then be achieved by reversing this process, for which a critical unknown term is the data score zt log qt(zt) (Song et al., 2021b). Diffusion models approximate the data scores with a score model sθ(zt, t), which is typically learned via denoising score matching (DSM) (Vincent, 2011): JDSM(θ) 1 2N sθ(αtxn + σtϵ, t) + ϵ given ϵ N(0, I) and a dataset of N training samples D {xn|xn P(x)}N n=1. Interestingly, it is not difficult to identify the optimal solution of equation 1 (assuming sufficient capacity of θ, see proof in Appendix A.1): s (zt, t) = n =1 exp αtxn zt 2 2 2σ2 t αtxn zt 2 2 2σ2 t which, however, leads the reverse process towards the empirical data distribution, defined as b P(x) = 1 N PN n=1 δ(x xn). Consequently, the optimal score model in equation 2 can only produce samples that replicate the training data, as shown in Figure 1(a), suggesting a memorization behavior (van den Burg & Williams, 2021).1 This evidently contradicts the typical generalization capability exhibited by state-of-the-art diffusion models such as EDM (Karras et al., 2022), as illustrated in Figure 1(b). Such intriguing gap prompts inquiries into (i) the conditions under which the learned diffusion models can faithfully approximate the optimum s (essentially showing memorization) and (ii) the influential 1We also provide a theoretical analysis from the lens of backward process in Appendix A.2. Published in Transactions on Machine Learning Research (02/2025) factors governing memorization behaviors in diffusion models. Besides a clear issue of potential adverse generalization performance (Yoon et al., 2023), it further raises a crucial concern that diffusion models trained with equation 1 might imperceptibly memorize the training data, exposing several risks such as privacy leakage (Somepalli et al., 2023b) and copyright infringement (Somepalli et al., 2023a; Zhao et al., 2023; Wang et al., 2024). For example, Carlini et al. (2023) show that it is possible to extract a few training images from Stable Diffusion (Rombach et al., 2022), substantiating a tangible hazard. In response to these inquiries and concerns, this paper presents a comprehensive empirical study on memorization behavior in widely adopted diffusion models, including EDM (Karras et al., 2022) and Stable Diffusion (Rombach et al., 2022). We start with an analysis of EDM, noting that memorization tends to occur when trained on smaller-sized datasets, while remaining undetectable on larger datasets. This motivates our definition of effective model memorization (EMM), a metric quantifying the maximum number of training data points (sampled from distribution P) at which a diffusion model M demonstrates the similar memorization behavior as the theoretical optimum after the training procedure T . We then quantify the impact of critical factors on memorization in terms of EMM on the CIFAR-10, FFHQ (Karras et al., 2019), and Imagenette (Deng et al., 2009; Somepalli et al., 2023b) datasets, considering the three facets of P, M, and T . Among all illuminating results, we surprisingly observe that the memorization can be triggered by conditioning training data on completely random and uninformative labels. Specifically, using such conditioning design, we show that more than 65% of samples generated by diffusion models trained on the 50k CIFAR-10 images are replicas of training data, an obvious contrast to the original 0%. Our study holds practical significance for diffusion model users and offers clues to theoretical research in deep generative models. 2 Memorization in Diffusion Models We start by examining the memorization in the widely-adopted EDM (Karras et al., 2022), which is one of the state-of-the-art diffusion models for image generation. To determine whether a generated image x is a memorized replica from the training data D, we adopt the criteria introduced in Yoon et al. (2023), which considers x as memorized if its ℓ2 distance to the nearest neighbor is smaller than 1/3 of that to the second nearest neighbor in the training data. Here the factor 1/3 is an empirical threshold as it accurately aligns human perception of memorization (Yoon et al., 2023). We train an EDM model on the CIFAR-10 dataset without applying data augmentation (to avoid any ambiguity regarding memorization) and evaluate the ratio of memorization among 10k generated images. Remarkably, we observe a memorization ratio of zero throughout the entire training process, as illustrated by the bottom curve in Figure 1(c). Intuitively, we hypothesize that the default configuration of EDM lacks the essential capacity to memorize the 50k training images in CIFAR-10, which motivates our exploration into whether expected memorization behavior will manifest when reducing the training dataset size. In particular, we generate a sequence of training datasets with different sizes of {20k, 10k, 5k, 2k}, by sampling subsets from the original set of 50k CIFAR-10 training images. We follow the default EDM training procedure (Karras et al., 2022) with consistent training epochs on these smaller datasets. As shown in Figure 1(c), when |D| = 20k or 10k, the memorization ratio remains close to zero. However, upon further reducing the training dataset size to |D| = 5k or 2k, the EDM models exhibit noticeable memorization. This observation indicates the substantial impact of training dataset size on memorization. Additionally, we notice that the memorization ratio increases with more training epochs. To observe this, we extend the training duration to 40k epochs, ten times longer than that in Karras et al. (2022). In Figure 1(d), when |D| = 1k, the model achieves over 90% memorization. However, even with 40k training epochs, the diffusion model still struggles to replicate a large portion of training samples when |D| = 5k. Based on our findings, we seek to quantify the maximum dataset size at which diffusion models demonstrate behavior similar to the theoretical optimum, which is crucial in understanding memorization behavior. To formalize this notion, considering a data distribution P, a diffusion model configuration M, and a training procedure T , we introduce the concept of effective model memorization with the definition as follows inspired by Yoon et al. (2023). Published in Transactions on Machine Learning Research (02/2025) 0.5k 1k 2k 4k 8k 16k Data Size | | 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Res = 32 32 Res = 16 16 Res = 8 8 (a) Varying data dimensions 0.5k 1k 2k Data Size | | Ratio (%) of Memorization C = 10 C = 5 C = 2 C = 1 (b) Varying inter-diversity 0.5k 1k 2k Data Size | | Ratio (%) of Memorization = 1.0 = 0.8 = 0.5 = 0.2 = 0.0 (c) Varying intra-diversity Figure 2: Memorization ratios (%) of diffusion models on CIFAR-10 under different factors of data distribution P. The intersections of dashed line (90%) and different curves are the estimates of EMMs. Definition 2.1 (Effective model memorization). The effective model memorization (EMM) with respect to P, M, T and parameter ζ > 0, is defined as EMMζ(P, M, T ) = max N n E[RMem(D, M, T )] 1 ζ D P, |D| = N o , (3) where RMem refers to the ratio of memorization. EMM indicates the condition under which the learned diffusion model approximates the theoretical optimum and reveals how P, M, and T interact and affect memorization. Our definition assumes that a higher memorization ratio tends to occur on smaller-sized training datasets, which is stated as Hypothesis 2.2. Given M, T and two training datasets D1 and D2, both of which are sampled from the same data distribution P, the ratio of memorization satisfies RMem(D1, M, T ) RMem(D2, M, T ) if D1 D2 and D1, D2 P. (4) Based on Hypothesis 2.2, we provide a viable way to estimate EMM. Specifically, we sample a series of training datasets D1, D2, ..., with different sizes from the data distribution P, and then train diffusion models with configuration M following the training procedure T . Afterwards, we evaluate the ratio of memorization RMem(D1, M, T ), RMem(D2, M, T ), ..., and then determine the size of training dataset D which meets that RMem(D, M, T ) 1 ζ. We note that it is computational intractable to determine the accurate value of EMM. Therefore, we interpolate the value of EMM based on two consecutive sampled datasets Di, Di+1 that RMem(Di, M, T ) > 1 ζ and RMem(Di+1, M, T ) < 1 ζ. Therefore, this study is formulated as how the above three factors P, M, and T affect the measurement of EMM. There is no principled way to select the value of ζ, so we set it as 0.1 based on our experiments in Figure 1(d) throughout our study. We include an analysis on the sensitivity of ζ in Appendix H. We set CIFAR-10 as our default dataset, and the basic experimental setup is introduced in Appendix B, which is a well-adopted recipe for diffusion models. We highlight that compared to Karras et al. (2022), we run 10 times the number of training epochs. The experiments related to FFHQ (Karras et al., 2019) and Imagenette (Deng et al., 2009; Somepalli et al., 2023b) are shown in Appendix D and Section 6, respectively. For evaluation, we also consider the image quality metric, e.g., Fréchet Inception Distance (FID) (Heusel et al., 2017) in Appendix E, and alternative memorization metric in Appendix F. It is worth mentioning that when diffusion models memorize a significant proportion of training data, the metrics related to image quality (including FID) are also close to optimal. This intuition is in line with our experimental results in Appendix E. Moreover, when considering alternative memorization metrics, the relationship of various factors and memorization remains consistent. 3 Data Distribution P In the preceding section, we have illustrated the substantial impact of the size of training data on the memorization in diffusion models and how we evaluate the value of effective model memorization (EMM). Published in Transactions on Machine Learning Research (02/2025) We now proceed to investigate the influence of specific attributes of the data distribution P on the EMM, focusing primarily on the dimensions and diversity of the data. We keep both the model configuration M and the training procedure T fixed throughout this section. 3.1 Data Dimension As likelihood-based generative models, diffusion models could face challenges when fitting high-resolution images, stemming from their mode-covering behavior as noted by Rombach et al. (2022). To explore the influence of data dimensionality on the memorization tendencies of diffusion models, we evaluate the EMMs on CIFAR-10 at various resolutions: 32 32, 16 16 and 8 8, where the latter two are obtained by downsampling. Note that the U-Net (Ronneberger et al., 2015) seamlessly accommodates inputs of different resolutions, requiring no modification of the model configuration M. For each resolution, we sample a series of training datasets D with varying sizes and evaluate the memorization ratios of trained diffusion models. As illustrated in Figure 2(a), we estimate the EMM for each resolution by determining the intersection between the line of 90% memorization (dashed line) and the memorization curve. The results reveal natural insights into the EMM with varying input dimensions. Specifically, for the 32 32 input resolution, we observe an EMM of approximately 1k. Transitioning to a 16 16 resolution, the EMM slightly surpasses 4k, while for the 8 8 resolution, it reaches approximately 8k. Furthermore, even for |D| = 16k, the ratio of memorization still exceeds 60% when trained on 8 8 CIFAR-10 images. These results underscore the profound impact of data dimensionality on the memorization within diffusion models. 3.2 Data Diversity Number of Classes. We consider four different data distributions by selecting C {1, 2, 5, 10} classes of images from CIFAR-10 and then evaluate the EMMs. While varying the data size |D| during probing the EMMs, we ensure that each class contains an equal share of |D|/C data instances. The results of EMMs for different C are shown in Figure 2(b). We find that as the number of classes increases, diffusion models tend to exhibit a lower memorization ratio and a lower EMM, which is consistent with the intuition that diverse data is harder to be memorized. Note, however, that this effect is subtle, as evidenced by the nearly identical EMMs observed for C = 5 and C = 10. Intra-Class Diversity. We also explore the impact of intra-class diversity, which measures variations within individual classes. We conduct experiments with C = 1, where only the Dog class of CIFAR-10 is used. To control this diversity, we gradually blend images (scaled to a resolution of 32 32) from the Dog class in Image Net (Deng et al., 2009) into the Dog class of CIFAR-10. We introduce an interpolation ratio, denoted as α (α [0, 1]), representing the proportion of Image Net data in the constructed training dataset. Notably, Image Net s Dog class contains 123 sub-classes, indicating higher intra-class diversity compared to CIFAR-10. Consequently, a larger α corresponds to higher intra-class diversity. As shown in Figure 2(c), an increased blend of Image Net data results in slightly lower EMM in the trained diffusion models. Similar to our experiments concerning the number of classes, these results reaffirm that diversity contributes limitedly to memorization. 4 Diffusion Model Configuration M In this section, we study the influence of different model configurations on the memorization tendencies of diffusion models. Our evaluation encompasses several aspects of model design, such as model size (width and depth), the way to incorporate time embedding, and the presence of skip connections in the U-Net. Similar to previous sections, we probe the EMMs by training multiple models on training data of different sizes from the same data distribution P (CIFAR-10), while keeping the training procedure T fixed. 4.1 Model Size Diffusion models are usually constructed using the U-Net architecture (Ronneberger et al., 2015). We explore the influence of model size on memorization using two distinct approaches. First, we increase the channel multiplier, thereby augmenting the width of the model. Alternatively, we raise the number of residual blocks per resolution, as demonstrated by (Song et al., 2021b), to increase the model depth. Model width. We explore different channel multipliers, specifically {128, 192, 256, 320}, while keeping the number of residual blocks per resolution fixed at 2. As illustrated in Figure 3(a), it is evident that as the Published in Transactions on Machine Learning Research (02/2025) 0.5k 1k 2k 3k 4k 5k Data Size | | Ratio (%) of Memorization Width = 128 Width = 192 Width = 256 Width = 320 (a) Varying model widths 0.5k 1.0k 1.5k 2.0k 2.5k 3.0k Data Size | | Ratio (%) of Memorization Depth = 2 Depth = 4 Depth = 6 Depth = 8 Depth = 10 (b) Varying model depths 0.5k 1k 2k Data Size | | 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Fourier, depth = 2 Fourier, depth = 4 Positional, depth = 2 Positional, depth = 4 (c) Varying time embeddings Figure 3: Memorization ratios (%) of diffusion models on CIFAR-10 under different factors of model configuration M. model width increases in diffusion models, the EMMs exhibit a monotonic rise. Notably, scaling the channel multiplier to 320 yields an EMM of approximately 5k, representing a four times increase compared to the EMM observed with a channel multiplier set at 128. These results show the direct relationship between model width and memorization in diffusion models. Model depth. We vary model depth by adjusting the number of residual blocks per resolution (ranging from 2 to 12), while maintaining a constant channel multiplier of 128. In contrast to the scenario of varying model width, modifying model depth yields non-monotonic effects on memorization. As present in Figure 3(b), the EMM initially increases as we scale the number of residual blocks per resolution from 2 to 6. However, when further increasing the model depth, the EMM starts to decrease. This non-monotonic trend is further shown in Figure 7(b). We assess the above two approaches for scaling the model size of diffusion models. Model size scales linearly when augmenting model depth but quadratically when increasing model width. With a channel multiplier set to 320, the diffusion model encompasses roughly 219M trainable parameters, yielding an EMM of approximately 5k. In contrast, with a residual block number per resolution set to 12, the model contains approximately 138M parameters, resulting in an EMM only ranging between 1k and 2k. Further increasing the model depth may encounter failures in training. Consequently, scaling model width emerges as a more viable approach for increasing the memorization ratio of diffusion models. We conduct more experiments in Appendix C.1 to further confirm our conclusions. 4.2 Time Embedding In our experimental setup (see Appendix B), we employ the model architecture of DDPM++ (Song et al., 2021b; Karras et al., 2022), which incorporates positional embedding (Vaswani et al., 2017) to encode the diffusion time step. In addition to positional embedding, Song et al. (2021b) used random Fourier features (Tancik et al., 2020) in their NCSN++ models. Therefore, we conduct experiments to test both time embedding methods and assess their impact on memorization. As depicted in Figure 3(c), to further support our conclusion, we consider two diffusion models with 2 and 4 residual blocks per resolution. We observe a significant decrease in the memorization ratio (and thus EMM) when using the Fourier features in DDPM++, highlighting the noteworthy effect of time embedding on memorization. The same trend is observed on FFHQ (see Appendix D.2). 4.3 Skip Connections We investigate the impact of skip connections, which are known for their significance in the success of U-Net (Ronneberger et al., 2015), on the memorization of diffusion models. Specifically, in our experimental setup (see Appendix B), the number of skip connections is 3(n + 1), where n corresponds to the number of residual blocks per resolution. For each resolution, namely, 32 32, 16 16, 8 8, we establish n + 1 skip connections that bridge the output of the encoder to the input of the decoder. To conduct our experiments, we set the size of training dataset |D| to 1k, and the value of n to 2, resulting in a total of 9 skip connections in our model architecture. Initially, we explore the influence of the quantity of skip connections on memorization. Intriguingly, our observations reveal that even with a limited number of skip connections, the trained diffusion models are capable of maintaining a memorization ratio equivalent to that achieved with full skip connections, as demonstrated in Published in Transactions on Machine Learning Research (02/2025) 32 32 16 16 8 8 Spatial Resolution 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Skip Num = 3 Skip Num = 2 Skip Num = 1 1 2 3 4 5 6 7 8 9 Index of Skip Connection 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Res = 32 32 Res = 16 16 Res = 8 8 32 32 16 16 8 8 Spatial Resolution 0 5 10 15 20 25 30 35 40 45 50 Ratio (%) of Memorization Skip Num = 3 Skip Num = 2 Skip Num = 1 1 2 3 4 5 6 7 8 9 Index of Skip Connection 0 5 10 15 20 25 30 35 40 45 50 Ratio (%) of Memorization Res = 32 32 Res = 16 16 Res = 8 8 Figure 4: Memorization ratio (%) of diffusion models on CIFAR-10 when retaining (a) skip connections of certain spatial resolution for DDPM++; (b) single skip connection at different locations for DDPM++; (c) skip connections of certain spatial resolution for NCSN++; (d) single skip connection at different locations for NCSN++. Appendix C.2. This underscores the role of skip connection sparsity as a significant factor influencing memorization in diffusion models, prompting our exploration of the effects of specific skip connections. In the following experiments, we consider both DDPM++ and NCSN++ architectures in Song et al. (2021b); Karras et al. (2022). Utilizing full skip connections, we observe that NCSN++ attains only approximately 45% memorization ratio, whereas DDPM++ achieves a memorization ratio exceeds 90%, all under the identical training dataset D. Skip connection resolution. We narrow our examination to the inclusion of a selected number of skip connections, specifically m = 1,2,3, all situated at a particular spatial resolution. The results, illustrated in Figure 4(a) and Figure 4(c), unveil notable trends. We note that different markers represent distinct quantities of skip connections. Our observations reveal that skip connections situated at higher resolutions contribute more significantly to memorization. Interestingly, we also find that an increase in the number of skip connections does not consistently result in higher memorization ratio. For instance, the DDPM++ model with m = 3 at a spatial resolution of 16 16 exhibits a lower memorization ratio compared to cases where m = 1 or m = 2. Skip connection location. Additionally, we retain a single skip connection but alter its placement within the model architecture. As depicted in Figure 4(b) and Figure 4(d), with the presence of just one skip connection, DDPM++ can achieve a memorization ratio exceeding 90%, while NCSN++ attains a memorization ratio surpassing 40%, provided that this single skip connection is positioned at a resolution of 32 32. This further reinforces that skip connections at higher resolutions play a more substantial role in memorization of diffusion models. 5 Training Procedure T In Section 2, we have demonstrated that the ratio of memorization increases with the progression of training epochs, indicating the influence of training procedure T on memorization. Consequently, in this section, we delve into the impact of various training hyperparameters on CIFAR-10. Batch size. In light of prior research highlighting the substantial role of batch size in the performance of discriminative models (Goyal et al., 2017; Keskar et al., 2016), we hypothesize that it may also influence memorization in diffusion models. Therefore, we investigate a range of batch sizes, from 128 to 896. Given that we maintain a constant number of training epochs, we apply the linear scaling rule proposed by Goyal et al. (2017) to adjust the learning rate accordingly for each batch size. Based on our basic experimental setup, we ensure a consistent ratio of learning rate to batch size 2 10 4/512. As indicated in Table 2, a larger batch size correlates with a higher memorization ratio. Weight decay. Weight decay is typically adopted to prevent neural networks from overfitting. Memorization can be regarded as an overfitting scenario in terms of diffusion models. Motivated by this, we explore its effect on memorization. In our basic experimental setup (see Appendix B), we follow Ho et al. (2020); Song et al. (2021b); Karras et al. (2022) to set a zero weight decay. Now we set different values of weight decay and show the memorization results in Table 2(left). Consequently, we find that when weight decay is ranged between 0 1 10 3, it has a subtle contribution to memorization. When further increasing the value, the memorization ratio drastically decreases. Published in Transactions on Machine Learning Research (02/2025) Table 1: Memorization ratios (%) of diffusion models on CIFAR-10 under different batch sizes. Batch Size 128 256 384 512 640 768 896 Learning rate (10 4) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Rmem(%) |D| =1k 89.52 90.87 91.52 92.09 91.40 92.05 92.77 |D| =2k 56.67 60.36 60.31 60.93 62.61 63.33 61.95 Table 2: Memorization ratios (%) of diffusion models on CIFAR-10 under different (left) weight decay values; (right) EMA values. Weight decay Rmem(%) |D| =1k |D| =2k 0 92.09 60.93 1 10 5 91.67 61.63 1 10 4 92.47 61.03 1 10 3 92.11 58.39 1 10 2 89.07 35.88 5 10 2 13.79 0.03 1 10 1 1.33 0.00 EMA Rmem(%) |D| =1k |D| =2k 0.99929 92.09 60.93 0.999 91.72 61.38 0.99 91.45 59.27 0.9 90.16 58.31 0.5 90.42 57.80 0.1 90.78 58.00 0.0 90.19 59.20 Exponential model average. EMA was shown to effectively stabilize FIDs (Heusel et al., 2017) and remove artifacts in generated samples (Song & Ermon, 2020), as also shown in Table 6(right). It is widely adopted in current diffusion models (Ho et al., 2020; Dhariwal & Nichol, 2021; Karras et al., 2022). Motivated by this, we explore its effect on memorization. Previously, we fix the EMA rate as 0.99929 after the warmup following Karras et al. (2022). As present in Table 2(right), we investigate the memorization of diffusion models with varying EMA rates, from 0.0 to 0.99929. Although EMA values are significant for FIDs, they contribute limitedly to memorization. 6 Unconditional v.s. Conditional Generation It has shown that conditional diffusion models typically yield lower FIDs than their unconditional counterparts (Karras et al., 2022). This observation motivates us to investigate whether the input condition also exerts an influence on the memorization of diffusion models. It is worth noting that to train conditional diffusion models, we slightly change M by incorporating a class embedding layer and adjust T through modifications of the training objective. We show the training objective of class-conditioned diffusion models and its optimal solution in Appendix A.3. The experiments are conducted using both EDM (Karras et al., 2022) and Stable Diffusion (Rombach et al., 2022). Class conditioning. As depicted in Figure 5(a), we note that class-conditional diffusion models exhibit higher memorization ratios compared to their unconditional counterparts (thus larger EMMs). We hypothesize that this observation is attributed to the additional information introduced by these true class labels. To validate the hypothesis put forth above, we substitute the true class labels with random labels. Intriguingly, we find that the memorization of diffusion models with random labels remains at a similar level to that of models with true labels. This contradicts our previous hypothesis, as random labels are uninformative regarding training images. These results align with earlier research in the realm of discriminative models, e.g., Zhang et al. (2017), which suggests that deep neural networks can memorize training data even with randomly assigned labels. It is worth noting that when monitoring the FIDs in Figure 15, we observe that diffusion models conditioned on true labels yield lower FIDs than that on random labels. We hypothesize that with true labels as conditioning, similar training samples tend to share the same individual model for each class, resulting in better individual models . Number of classes. When employing random labels as conditions, the class number C is not restricted to 10 in CIFAR-10. Therefore, we manipulate the choices of C for random labels and then observe their effects on memorization. Initially, we compare Figure 5(a) and Figure 5(b), revealing that unconditional diffusion models and conditional models with C = 1 maintain similar memorization ratios. However, a discernible trend emerges as we introduce more classes, with diffusion models exhibiting increased memorization (also observed on FFHQ Published in Transactions on Machine Learning Research (02/2025) 1k 2k 3k 4k 5k Data Size | | 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Uncond Cond Cond (random) (a) Unconditional vs conditional 1 2 5 10 20 50 0.1k 0.5k 1k Class Number 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization (b) Varying C for random labels 2k 4k 6k 8k 10k 12k Training Epochs Ratio (%) of Memorization Cond (unique) Uncond (c) Unconditional vs Unique Figure 5: Memorization ratios (%) of (a) unconditional diffusion models and conditional diffusion models on CIFAR-10 with true / random labels; (b) conditional diffusion models with C random labels; (c) conditional EDM with unique labels and unconditional EDM at |D| = 50k during the training. in Figure 11(c)). Intriguingly, with a size of training dataset |D| = 5k, even starting at a modest 10% memorization ratio for C = 1, conditional diffusion models can attain over 90% memorization ratio with C = 1k random labels. Based on these observations, we conclude that the number of classes significantly influences EMMs. Unique class condition. We consider an extreme scenario where each training sample in D is assigned a unique class label, which can be regarded as the case of C = |D|. We compare the memorization dynamics of this unique class conditional scenario with true class conditional and unconditional ones during the training, as illustrated in Figure 9. Notably, within only 8k epochs, the diffusion model with unique labels achieves a memorization ratio of more than 95%. On the FFHQ dataset, the memorization ratios of unique class conditional diffusion models are also much higher than that of unconditional ones, as present in Figure 11(d). Inspired by this, we extend this condition mechanism to encompass the entire CIFAR-10 dataset, which contains |D| = 50k samples. It is worth noting that we train an EDM model Karras et al. (2022) to facilitate a comparison with our initial observations in Section 2. As illustrated in Figure 5(c), the unconditional EDM maintains a memorization ratio of zero throughout the training process even extending it to 12k training epochs. However, upon conditioning on the unique labels, we observe a notable shift on memorization. The trained conditional EDM achieves more than 65% memorization ratio within 12k training epochs, a significant increase compared to its previous performance. We also visualize the generated images by these two models in Appendix C.3 to further validate their memorization behaviors. These results imply that with unique labels, the training samples become strongly associated with their input conditions, rendering them more readily accessible during the generation process when the same conditions are applied. Extension to text condition. Beyond unconditional diffusion models and class-conditioned diffusion models, we incorporate the experiments on state-of-the-art text-to-image diffusion models, Stable Diffusion (SD) (Rombach et al., 2022). As it is infeasibly computationally expensive to train Stable Diffusion from scratch, we opt to fine-tune the U-Net in SD on the Imagenette dataset2, which is a realistic scenario. This dataset consists of C = 10 classes from the Image Net dataset (Deng et al., 2009) and has also been adopted in Somepalli et al. (2023b). We consider two types of text prompts as conditions: (1) plain conditioning: using a picture as text prompt for all images; (2) class conditioning: incorporating class labels in the text prompt, i.e., a picture of . For fine-tuning, we consider both full fine-tuning and Lo RA fine-tuning with different ranks (Hu et al., 2021). More experimental setups are detailed in Appendix G. As shown in Table 3, we compare two distinct types of text prompts conditioning under training data with different sizes |D|. Firstly, we observe that although SD was pretrained on billions of images, it is still prone to memorize training data when fully fine-tuning it on customized data. While Lo RA fine-tuning has less tendency to memorize training data. Additionally, we find that with higher Lo RA rank, SD models demonstrate higher memorization ratios. Secondly, it is noticed that with the increase of |D|, the memorization ratio drops. Furthermore, the memorization ratios of SDs fine-tuned on class conditioning data is much higher than that on plain conditioning data. To explain this, class conditioning is similar to the case of a number of classes C = 10 in the class-conditioned diffusion model while plain conditioning is similar to the case C = 1. Our results on Stable Diffusion reaffirm the significant influence of class number on memorization within diffusion models. 2https://github.com/fastai/imagenette Published in Transactions on Machine Learning Research (02/2025) Table 3: Memorization ratio (%) of Stable Diffusion (SD) on Imagenette with different fine-tuning approaches on the Imagenette dataset under two different text conditioning types. Fine-tuning Condition |D| =400 |D| =500 |D| =800 |D| =1000 |D| =2000 Lo RA r =16 plain 0.00 0.00 0.00 0.00 0.00 class 0.00 0.00 0.00 0.00 0.00 Lo RA r =256 plain 0.08 0.01 0.00 0.00 0.00 class 5.22 0.25 0.00 0.00 0.00 Lo RA r =1024 plain 0.76 0.20 0.00 0.00 0.00 class 8.77 2.07 0.04 0.00 0.00 Full fine-tuning plain 21.95 20.63 7.30 2.85 0.06 class 46.89 38.79 20.46 10.67 0.21 7 Related Work Memorization in discriminative models. For discriminative models, memorization and generalization have interleaving connections. Zhang et al. (2017) first demonstrated that deep learning models can memorize the training data, even with random labeling, but generalize well. Additionally, Feldman (2020); Feldman & Zhang (2020) showed that this memorization is necessary for achieving close-to-optimal generalization under long-tailed assumption of data distribution. In the follow-up works, Baldock et al. (2021); Stephenson et al. (2021) showed that memorization predominately happens in the deep layers while Maini et al. (2023) argued that memorization is confined to a few neurons across various layers of deep neural networks. Although discriminative models can largely memorize training data, this phenomenon does not apply to diffusion models. For instance, Zhang et al. (2017) showed that the Inception v3 model (Szegedy et al., 2016) with under 25M trainable parameters can almost memorize the Image Net dataset (Deng et al., 2009) with approximately 1.28M images. However, the EDM model (about 56M parameters) (Karras et al., 2022) can not memorize even the CIFAR-10 dataset with 50k images as present in Section 2. From another view, Bartlett et al. (2020); Nakkiran et al. (2020) showed that over-parameterized models generalize well to real data distribution and even perfectly fit to the training dataset, which is called benign overfitting. Nevertheless, diffusion models demonstrate adverse generalization performance (Yoon et al., 2023). Memorization in generative models. Somepalli et al. (2023a); Carlini et al. (2023) represent the initial explorations on memorization in diffusion models. They investigated a range of diffusion models and demonstrated that these models memorize a few training samples and replicate them globally or in the object level during the generation. For instance, Carlini et al. (2023) identified only 50 memorized training images from 175 million generated images by Stable Diffusion (Rombach et al., 2022) and extracted 200-300 training images from 220 generated images by DDPM and its variant (Ho et al., 2020; Nichol & Dhariwal, 2021). This highlights the memorization gap between empirical diffusion models and the theoretical optimum defined in equation 2. Somepalli et al. (2023a;b) showed that training data duplication and text conditioning play significant roles in the memorization of diffusion models from empirical studies. Another work (Wen et al., 2023) introduced a novel approach to detect memorized prompts and mitigate memorization for Stable Diffusion (Rombach et al., 2022). However, these conclusions are mostly derived from text-to-image diffusion models. The nature of memorization in diffusion models, especially for unconditional ones, remains unexplored. One recent paper Hintersdorf et al. (2024) localized the memorization to the specific neurons in text-to-image diffusion models. Another recent paper Liu et al. (2024) proposed ensembling model parameters and discarding samples with small loss values in diffusion model training, thus mitigating the memorization. Apart from diffusion models, several studies are researching towards memorization in Generative Adversarial Networks (GANs) (Webster et al., 2021; Feng et al., 2021), and language models (Carlini et al., 2021; 2022). Generalization in diffusion models. Yoon et al. (2023); Kadkhodaie et al. (2023) analyzed the relationship between memorization and generalization. Yoon et al. (2023) hypothesized that generalization of diffusion models is a failure to memorize the training data when data size is large and model width is comparatively small and this memorization-generalization dichotomy may manifest at the level of classes. Our work focuses on the timing of occurrence and the influential factors regarding memorization through a much more thorough quanti- Published in Transactions on Machine Learning Research (02/2025) tative analysis. Kadkhodaie et al. (2023) also showed that diffusion models memorize samples when trained on small data. However, it focused on examining the inductive bias of DNN denoisers and showed that generalization arises from the geometry-adaptive harmonic basis (GAHB) using mathematical formulations. Additionally, Zhang et al. (2023) showed that given the same noise input, different diffusion models often generate remarkably similar samples through a deterministic sampler, manifesting in memorization and generalization regimes. This provides a new perspective on understanding memorization and generalization of diffusion models. One recent paper (Kamb & Ganguli, 2024) shares motivations similar to ours. Kamb & Ganguli (2024) identified two inductive biases (locality and equivariance) in common architectures of diffusion models that prevent convergence to the ideal score function. Building on this insight, they theoretically derive equivariant local score (ELS) machines, which mirror the formulation of the ideal score function while exhibiting the generalization behavior of empirically trained models. However, these two inductive biases do not fully capture the behavior of trained models, particularly those with self-attention architectures. Analytical characterization of the exact properties of trained diffusion models remains challenging. Our work, on the empirical side, investigates how data, model designs, training processes, and conditioning prevent trained models from converging to the ideal score function. 8 Conclusion In this study, we first showed that the theoretical optimum of diffusion models can only replicate training data, representing a memorization behavior. This contracts the typical generalization ability demonstrated by state-of-the-art diffusion models. To understand the memorization gap, we found that when trained on smaller-sized datasets, learned diffusion models tend to approximate the theoretical optimum. Motivated by this, we defined the notion of effective model memorization (EMM) to quantify this memorization behavior. Afterwards, we explored the impact of critical factors on memorization through the lens of EMM, from the three facets of data distribution, model configuration, and training procedure. We found that data dimension, model size, time embedding, skip connections, and class conditions play significant roles on memorization. Among all illuminating results, the memorization of diffusion models can be triggered by conditioning training data on completely random and uninformative labels. Intriguingly, when incorporating such conditioning design, more than 65% of samples generated by diffusion models trained on the entire 50k CIFAR-10 images are replicas of training data, in contrast to the original 0%. We believe that our study deepens the understanding of memorization in diffusion models and offers clues to theoretical research in the area of generative modeling. Acknowledgments We would like to thank our action editor Kamalika Chaudhuri and the reviewers for their detailed and constructive comments and feedback. Robert Baldock, Hartmut Maennel, and Behnam Neyshabur. Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems (Neur IPS), 34:10876 10889, 2021. Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633 2650, 2021. Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. ar Xiv preprint ar Xiv:2202.07646, 2022. Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. ar Xiv preprint ar Xiv:2301.13188, 2023. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. Published in Transactions on Machine Learning Research (02/2025) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 954 959, 2020. Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems (Neur IPS), 33:2881 2891, 2020. Qianli Feng, Chenqi Guo, Fabian Benitez-Quiroz, and Aleix M Martinez. When do gans replicate? on the choice of dataset size. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6701 6710, 2021. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (Neur IPS), pp. 6626 6637, 2017. Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, and Franziska Boenisch. Finding nemo: Localizing neurons responsible for memorization in diffusion models. ar Xiv preprint ar Xiv:2406.02366, 2024. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021. Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. ar Xiv preprint ar Xiv:2301.12661, 2023. Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representation. ar Xiv preprint ar Xiv:2310.02557, 2023. Mason Kamb and Surya Ganguli. An analytic theory of creativity in convolutional diffusion models. ar Xiv preprint ar Xiv:2412.20292, 2024. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision (CVPR), 2019. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems (Neur IPS), 35:26565 26577, 2022. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, 2016. Heeseung Kim, Sungwon Kim, and Sungroh Yoon. Guided-tts: A diffusion model for text-to-speech via classifier guidance. In International Conference on Machine Learning (ICML), pp. 11119 11133. PMLR, 2022. Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Published in Transactions on Machine Learning Research (02/2025) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Ruiqi Gao. Understanding the diffusion objective as a weighted integral of elbos. ar Xiv preprint ar Xiv:2303.00848, 2023. Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 300 309, 2023. Xiao Liu, Xiaoliu Guan, Yu Wu, and Jiaxu Miao. Iterative ensemble training with anti-gradient control for mitigating memorization in diffusion models. ar Xiv preprint ar Xiv:2407.15328, 2024. Pratyush Maini, Michael C Mozer, Hanie Sedghi, Zachary C Lipton, J Zico Kolter, and Chiyuan Zhang. Can neural network memorization be localized? ar Xiv preprint ar Xiv:2307.09542, 2023. Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. In International Conference on Learning Representations (ICLR), 2020. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ar Xiv preprint ar Xiv:2112.10741, 2021. Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (ICML), pp. 8162 8171. PMLR, 2021. Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In International Conference on Learning Representations (ICLR), 2023. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234 241. Springer, 2015. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), pp. 2256 2265. PMLR, 2015. Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6048 6058, 2023a. Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Understanding and mitigating copying in diffusion models. ar Xiv preprint ar Xiv:2305.20086, 2023b. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021a. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (Neur IPS), pp. 11895 11907, 2019. Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Published in Transactions on Machine Learning Research (02/2025) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021b. Cory Stephenson, Suchismita Padhy, Abhinav Ganesh, Yue Hui, Hanlin Tang, and Sue Yeon Chung. On the geometry of generalization and memorization in deep neural networks. ar Xiv preprint ar Xiv:2105.14602, 2021. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818 2826, 2016. Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems (Neur IPS), 33: 7537 7547, 2020. Gerrit van den Burg and Chris Williams. On memorization in probabilistic deep generative models. Advances in Neural Information Processing Systems (Neur IPS), 34:27916 27928, 2021. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems (Neur IPS), 30, 2017. Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. In International Conference on Learning Representations (ICLR), 2022. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23 (7):1661 1674, 2011. Haonan Wang, Qianli Shen, Yao Tong, Yang Zhang, and Kenji Kawaguchi. The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copyright breaches without adjusting finetuning pipeline. ar Xiv preprint ar Xiv:2401.04136, 2024. Ryan Webster, Julien Rabin, Loic Simon, and Frederic Jurie. This person (probably) exists. identity membership attacks against gan generated faces. ar Xiv preprint ar Xiv:2107.06018, 2021. Yuxin Wen, Yuchen Liu, Chen Chen, and Lingjuan Lyu. Detecting, explaining, and mitigating memorization in diffusion models. In The Twelfth International Conference on Learning Representations, 2023. Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations (ICLR), 2022. Mingyang Yi, Jiacheng Sun, and Zhenguo Li. On the generalization of diffusion model. ar Xiv preprint ar Xiv:2305.14712, 2023. Tae Ho Yoon, Joo Young Choi, Sehyun Kwon, and Ernest K Ryu. Diffusion probabilistic models generalize when they fail to memorize. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017. Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Liyue Shen, and Qing Qu. The emergence of reproducibility and consistency in diffusion models. ar Xiv preprint ar Xiv:2310.05264, 2023. Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-Man Cheung, and Min Lin. A recipe for watermarking diffusion models. ar Xiv preprint ar Xiv:2303.10137, 2023. Published in Transactions on Machine Learning Research (02/2025) A Optimal Solution of Diffusion Models A.1 Derivation of the Theoretical Optimum In this section, we prove the close form of the optimal score model defined in equation 2. Firstly, the empirical denoising score matching (DSM) objective (Vincent, 2011) is: JDSM(θ) 1 2N n=1 Et [0,T ]Eϵ N(0,I) λ(t) sθ(zt, t) + ϵ = Et [0,T ]λ(t)Eϵ N(0,I) sθ(zt, t) + ϵ = Et [0,T ] λ(t) Z 1 2N sθ(zt, t) + ϵ 2 N(ϵ; 0, I)dϵ Compared to equation 1, we add a positive weighting function λ(t) > 0, which is normally used in the training of diffusion models (Song et al., 2021b). Since zt = αtxn + σtϵ, we have ϵ = αtxn zt σt . Therefore, the derivative of ϵ w.r.t. zt is dϵ = dzt JDSM(θ) = Et [0,T ] λ(t) Z 1 2N sθ(zt, t) αtxn zt 2 N(zt; αtxn, σ2 t I)σtdzt = Et [0,T ] λ(t) Z JDSM(θ, zt, t)dzt To minimize the empirical DSM objective JDSM(θ), we can minimize JDSM(θ, zt, t) given each zt since λ(t) > 0. The minimization of JDSM(θ, zt, t) is a convex optimization problem w.r.t. the score sθ(zt, t), which can be solved by taking the gradient w.r.t. sθ(zt, t): 0 = sθ(zt,t) sθ(zt, t) αtxn zt 2 N(zt; αtxn, σ2 t I)σt n=1 2 s(zt, t) αtxn zt N(zt; αtxn, σ2 t I) (11) n=1 N(zt; αtxn, σ2 t I) n=1 N(zt; αtxn, σ2 t I)αtxn zt σ2 t . (12) The optimal diffusion model can be written s (zt, t) = PN n=1 N(zt; αtxn, σ2 t I) αtxn zt σ2 t PN n =1 N(zt; αtxn , σ2 t I) (13) PN n=1 exp αtxn zt 2 2 2σ2 t σ2 t PN n =1 exp αtxn zt 2 2 2σ2 t αtxn zt 2 2 2σ2 t σ2 t , (15) where S refers to the softmax operation. It is noted that when sθ(zt, t) is parameterized using a neural network θ, the objective in equation 5 is typically highly non-convex w.r.t. θ. Therefore, achieving this theoretical optimum necessitates a model θ Published in Transactions on Machine Learning Research (02/2025) with sufficient capacity. Firstly, this theoretical optimum may lie outside the solution space of θ s model family. Even if the solution space encompasses the theoretical optimum, reaching it remains challenging due to the non-convex nature of the optimization in equation 5 w.r.t. model parameters θ. The optimization may get stuck in the local optimum. Given the existence of theoretical optimum, we find that the empirical DSM objective can be rewritten JDSM(θ) = 1 2N n=1 Et,ϵ h λ(t) sθ(αtxn + σtϵ, t) s (αtxn + σtϵ, t) 2 2 i + C, (16) where C = Et λ(t) R 1 2N PN n=1 s (zt, t) αtxn zt 2 N(zt; αtxn, σ2 t I)σtdzt is a constant value without involvement of the trained diffusion model θ. This equivalence shows that diffusion models are trained to approximate the theoretical optimum. The proof is shown as below JDSM(θ) (17) λ(t) Z 1 2N sθ(zt, t) αtxn zt 2 N(zt; αtxn, σ2 t I)σtdzt λ(t) Z 1 2N sθ(zt, t) s (zt, t) + s (zt, t) αtxn zt 2 N(zt; αtxn, σ2 t I)σtdzt λ(t) Z 1 2N n=1 sθ(zt, t) s (zt, t) 2 2 N(zt; αtxn, σ2 t I)σtdzt λ(t) Z 1 2N n=1 2 (sθ(zt, t) s (zt, t)) s (zt, t) αtxn zt N(zt; αtxn, σ2 t I)σtdzt λ(t) Z 1 2N s (zt, t) αtxn zt 2 N(zt; αtxn, σ2 t I)σtdzt λ(t) Z 1 2N n=1 sθ(zt, t) s (zt, t) 2 2 N(zt; αtxn, σ2 t I)σtdzt n=1 Et,ϵ h λ(t) sθ(αtxn + σtϵ, t) s (αtxn + σtϵ, t) 2 2 i + C. (22) The above equation holds since n=1 2 (sθ(zt, t) s (zt, t)) s (zt, t) αtxn zt N(zt; αtxn, σ2 t I)σtdzt (23) N (sθ(zt, t) s (zt, t)) Z N X n=1 N(zt; αtxn, σ2 t I)σt s (zt, t) αtxn zt In Karras et al. (2022), the authors have derived the optimal denoised function, while in Yi et al. (2023), the authors provided the closed form of the optimal DDPM (Ho et al., 2020). Therefore, we show the equivalence of our equation 2 with the above two forms. For DDPM, which is trained under noise prediction objective, Kingma & Gao (2023) gave the transformation between the two parameterizations: sθ(zt, t) = ϵθ(zt, t) Published in Transactions on Machine Learning Research (02/2025) Therefore, the optimal DDPM can be represented as ϵ (zt, t) = σts (zt, t) = PN n=1 N(zt; αtxn, σ2 t I) zt αtxn σt PN n =1 N(zt; αtxn , σ2 t I) (27) exp αtxn zt 2 2 2σ2 t xn PN n =1 exp αtxn zt 2 2 2σ2 t In DDPM (Ho et al., 2020), the forward process is defined as zt = αtx + 1 αtϵ, so the closed form for optimal DDPM is ϵ (zt, t) = zt 1 αt αt 1 αt exp zt αtxn 2 2 2(1 αt) xn PN n =1 exp zt αtxn 2 2 2(1 αt) , (29) which is the same as Theorem 2 in Yi et al. (2023). For denoised function, which is trained under data prediction objective, Kingma & Gao (2023) gave: sθ(zt, t) = zt αt Dθ(zt, t) σ2 t . (30) Therefore, the optimal denoised function can be represented as D (zt, t) = σ2 t s (zt, t) + zt αt = PN n=1 N(zt; αtxn, σ2 t I)xn PN n =1 N(zt; αtxn , σ2 t I) . (31) In EDM (Karras et al., 2022), the forward process is defined as zt = x + σtϵ, so the closed form for optimal denoised function is: D (zt, t) = PN n=1 N(zt; xn, σ2 t I)xn PN n =1 N(zt; xn , σ2 t I) . (32) which is the same as Eq. (57) in Karras et al. (2022). A.2 Backward Process of the Optimal Diffusion Model We analyze the memorization behavior of the optimal diffusion model s (zt, t) defined in Eq. equation 2 through the lens of backward process. As shown in Kingma et al. (2021), the backward process of our defined diffusion models is governed by the following stochastic differential equation (SDE): dzt = f(t)zt g2(t)sθ(zt, t) dt + g(t)d Wt, (33) where Wt is a standard Brownian motion, and f(t) and g(t) follow f(t) = d log αt dt , g2(t) = dσ2 t dt 2d log αt dt σ2 t . (34) Besides the SDE, Song et al. (2021b) showed that there exists an ordinary differential equation (ODE) for deterministic backward process dzt = f(t)zt 1 2g2(t)sθ(zt, t) dt. (35) We first show how to adopt the above ODE to generate samples using the optimal score model defined in equation 2. Specifically, we sample multiple time steps 0 = t0 < ξ = t1 < ... < tn = T, where ξ refers to a Published in Transactions on Machine Learning Research (02/2025) small value close to 0 and T > 0 represents the maximum time step. For simplicity, we consider Euler solver, and then we have the following update rule ztn = ztn+1 + d log αt dσ2 t dt 2d log αt sθ(zt, t) t=tn+1 (tn tn+1). (36) We also use Euler method to approximate d log αt dt and dσ2 t dt considering lim tn tn+1 0 d log αt t=tn+1 = lim tn tn+1 0 1 αt t=tn+1 = lim tn tn+1 0 1 αtn+1 tn tn+1 , (37) lim tn tn+1 0 dσ2 t dt t=tn+1 = lim tn tn+1 0 2σt dσt t=tn+1 = lim tn tn+1 0 2σtn+1 σtn σtn+1 tn tn+1 . (38) Then we have ztn = αtn αtn+1 ztn+1 σtn+1(σtn σtn+1) (αtn αtn+1)σ2 tn+1 αtn+1 sθ(ztn+1, tn+1) (39) = αtn αtn+1 ztn+1 σtn+1σtn αtnσ2 tn+1 αtn+1 sθ(ztn+1, tn+1). (40) For t0 = 0, we know that α0 = 1 and σ0 = 0. z0 refers to the generated samples, and we have αξ + σ2 ξ αξ sθ(zξ, ξ) (41) αξ + σ2 ξ αξ exp αξxn zξ 2 2 2σ2 ξ PN n =1 exp αξxn zξ 2 2 2σ2 ξ exp αξxn zξ 2 2 2σ2 ξ PN n =1 exp αξxn zξ 2 2 2σ2 ξ From the above equation, we conclude that the generated samples by the optimal diffusion model are the linear combinations of training samples in D. Next, we consider a discrete distribution, and suppose z = zξ p(z = xn, ξ) = exp z xn 2 2 2α2 ξσ2 ξ PN n =1 exp z xn 2 2 2α2 ξσ2 ξ , n = 1,2,...,N. (44) Suppose xm = NN1(z, D) = arg minx D ||z x||2 2, then we have z xm 2 2 z xn =m 2 2 < 0. (45) Published in Transactions on Machine Learning Research (02/2025) When ξ 0+, η = αξσξ 0+, then we have lim ξ 0+ p(z = xm, ξ) = lim η 0+ exp z xm 2 2 2η2 PN n =1 exp z xn 2 2 2η2 (46) = lim η 0+ 1 PN n =m exp z xn 2 2 z xm 2 2 2η2 + 1 (47) = 1 PN n =m limη 0+ exp z xn 2 2 z xm 2 2 2η2 + 1 (48) = 1 PN n =m limη + exp η2 2 z xm 2 2 z xn 2 2 + 1 (49) p(z = xm, ξ) + n =m p(z = xn , ξ) n =m lim ξ 0+ p(z = xn , ξ) = 0. (52) Given p(z = xi, ξ) 0, lim ξ 0+ p(z = xn =m, ξ) = 0. (53) lim ξ 0+ z0 = lim ξ 0+ Ep(z=xn,ξ) [z] = NN1(z, D). (54) From the above analysis, we conclude that when t1 = ξ is closed to 0, the probabilistic ODE solver returns a training sample in D. Next, we consider an SDE solver, then the update rule is ztn = ztn+1 + d log αt dt zt dσ2 t dt 2d log αt sθ(zt, t) t=tn+1 (tn tn+1) (55) dσ2 t dt 2d log αt t=tn+1 (tn tn+1)ϵ = αtn αtn+1 ztn+1 2 σtn+1σtn αtnσ2 tn+1 αtn+1 sθ(ztn+1, tn+1) (56) σtn+1σtn αtnσ2 tn+1 αtn+1 (tn tn+1)ϵ, where ϵ N(0, I) is a Gaussian noise. Similarly, we consider the update step at t0 = 0 αξ + 2 σ2 ξ αξ sθ(zξ, ξ) + exp αξxn zξ 2 2 2σ2 ξ PN n =1 exp αξxn zξ 2 2 2σ2 ξ lim ξ 0+ z0 = 2NN1 αξ , D lim ξ 0+ zξ αξ . (58) Published in Transactions on Machine Learning Research (02/2025) 5k 10k 15k 20k 25k 30k 35k 40k 45k 50k Sample Size 50 55 60 65 70 75 80 85 90 95 100 Ratio (%) of Memorization 5k 10k 15k 20k 25k 30k 35k 40k 45k 50k Sample Size Ratio (%) of Memorization std Figure 6: (Left) Memorization ratios (%) (Right) Standard deviations of memorization ratios (%) of diffusion models with different sample sizes. Consider limξ 0+ zξ αξ = limξ 0+ z0 lim ξ 0+ z0 = NN1 αξ , D . (59) To summarize, through our analysis, we find that the optimal diffusion model always replicates training data through the backward process. A.3 Optimal Class-conditioned Diffusion Model In the above, we provide the derivation of the optimal diffusion model for unconditional generation under the assumption of empirical data distribution x 1 N PN n=1 δ(x xn). Next, we consider the scenario of classconditional generation. The dataset can be represented as D {xn, yn}N n=1, yn {1, 2, ..., C}, where C is the number of classes. Then the empirical joint distribution of data and label is x, y 1 N PN n=1 δ(x xn)δ(y yn). For class-conditional generation, the empirical DSM objective is written JDSM(θ) = 1 2N λ(t) sθ(αtxn + σtϵ, yn, t) + ϵ λ(t) sθ(αtxc n + σtϵ, c, t) + ϵ where xc n refers to the n-th sample with class label c, and Nc represents the number of class label c. Similarly, by taking the gradient w.r.t. sθ(zt, c, t), we derive the optimal class-conditioned diffusion model for each class condition c s θ(zt, c, t) = PNc n=1 exp αtxc n zt 2 2 2σ2 t αtxc n zt σ2 t PNc n =1 exp αtxc n zt 2 αtxc n zt 2 2 2σ2 t αtxc n zt σ2 t . (61) B Implementation Details of Our Basic Experimental setup Data distribution. Most of our experiments are conducted on the CIFAR-10 dataset, which consists of 50k RGB images with a spatial resolution of 32 32. CIFAR-10 has 10 classes, each of which has 5k images. We also incorporate the experiments on the FFHQ dataset (Karras et al., 2019) to emphasize our findings. The FFHQ dataset comprises 70k RGB images of human faces with an original resolution of Published in Transactions on Machine Learning Research (02/2025) 1024 1024. We down-sample these images to a resolution of 64 64 following Karras et al. (2022) to conduct our experiments. When modifying the intra-diversity of data distribution, we blend several images from the Image Net dataset (Deng et al., 2009) to construct training datasets. In our study, we disable the data augmentation to prevent any ambiguity regarding memorization. Model configuration. We consider the baseline VP configuration in Karras et al. (2022).3 The model architecture is DDPM++, which is based on U-Net (Ronneberger et al., 2015). As our basic model, we select the number of residual blocks per resolution in the U-Net as 2 instead of 4 in original implementations. For diffusion models trained on CIFAR-10, the channel multiplier is 128, resulting in 256 channels at all resolutions of 32 32, 16 16, and 8 8. While for diffusion models trained on FFHQ, the employed U-Net has 128 channels at the resolution of 64 64, and 256 channels at the resolutions of 32 32, 16 16, and 8 8. The time embedding is positional encoding (Vaswani et al., 2017). Training procedure. Our diffusion models are trained using Adam optimizer (Kingma & Ba, 2014) with a learning rate of 2 10 4 and a batch size of 512 for CIFAR-10 while 256 for FFHQ. The training duration is 40k epochs, while it is 4k epochs in Karras et al. (2022). It is worth mentioning that for different training sizes, the number of training epochs is the same. Therefore, for smaller training datasets, the number of total training steps will be smaller. This setup ensures that the frequency of each image being drawn during the training procedure is the same. We schedule the learning rate and EMA rate similar to Karras et al. (2022) but in an epoch-wise manner. In the first 200 epochs, we warm-up the learning rate and the EMA rate with the increase of training iterations. Afterwards, the learning rate is fixed to 2 10 4 and the EMA rate is fixed to 0.99929. Generation and evaluation. We follow the backward process in Karras et al. (2022) to generate images to compute the memorization metric. To decide appropriate sample size, we first train two diffusion models on CIFAR-10 according to our basic experimental setup at |D| = 1k and |D| = 2k. Afterwards, we generate 50k images by each model and then bootstrap different number of samples to compute memorization ratios. As present in Figure 6, we find that the ratio of memorization has a negligible variance when sample size more than 10k images. Therefore, we generate 10k images throughout this study. We report the highest memorization ratio during the training process. To ensure the consistency in our analysis, we maintain the same set of 10k samples for computing FIDs and alternative memorization metrics. Specifically, we compute FIDs using 10k generated samples and the training dataset D (with varying sizes |D|). Experimental compute resource. All experiments were run on 8 NVIDIA A100 GPUs, each with 80GB of memory. The running time of each experiment highly depends on the model configuration and data size. For example, it takes about 120 hours to train an unconditional model or conditional model with unique labels in Figure 5(c). Experimental statistical significance. We also evaluate the variance of memorization ratios across multiple trials. Specifically, on CIFAR-10, we run a pair of experiments focused on data distribution, primarily by altering the number of classes. We set the training data size as |D| = 1k and the number of classes as C = 2, 5, and then execute each experiment over three distinct random seeds. Consequently, the memorization ratio for C = 2 stands at 94.59 0.19%, whereas for C = 5, it is recorded at 92.32 0.14%. It is noticed that the variance of memorization ratios is less than 0.2%, which is insignificant. Considering this minimal fluctuation and difficulty for repeating all experiments, we run each experiment once in the main paper. C More Empirical Results on CIFAR-10 C.1 Model Size Model width. We investigate the influence of different channel multipliers, specifically exploring the values from {64, 96, 128, 160, 192, 224, 256}. We also consider two scenarios for the number of residual blocks per resolution: 2 and 4. As illustrated in Figure 7(a), when |D| = 2k, we observe a consistent and monotonic increase in the memorization ratio as the model width grows. This observation aligns with the conclusions 3We run the training configuration C in Karras et al. (2022) after adjusting hyper-parameters and redistributing capacity compared to Song et al. (2021b). Published in Transactions on Machine Learning Research (02/2025) drawn in Sec. 4.1. Furthermore, we provide a dynamic view of the memorization ratios during the training process in Figure 7(c) and Figure 7(e). It is worth noting that wider diffusion models consistently exhibit higher levels of memorization throughout the entire training process. Model depth. We delve into the impact of varying numbers of residual blocks per resolution, considering values in the range of {2, 4, 6, 8, 10}. We maintain two different channel multiplier values, specifically 128 and 256, and set the training data size |D| = 2k. The results, as presented in Figure 7(b), confirm the non-monotonic effect of model depth on memorization. . When the channel multiplier is set to 256, the curve depicting the relationship between memorization ratio and the number of residual blocks per resolution exhibits multiple peaks. To gain a better understanding of this non-monotonic effect, we visualize the training process in Figure 7(d) and Figure 7(f). Occasionally, deeper diffusion models yield lower memorization ratios than shallower ones throughout the whole training process. It is noteworthy that when both model width and depth are set large values (e.g., the number of residual blocks per resolution is 8 and the channel multiplier is 256), the memorization of the diffusion model also demonstrates non-monotonic characteristics. C.2 Skip Connections We employ DDPM++ (Song et al., 2021b; Karras et al., 2022) to explore the influence of skip connection quantity on memorization. Specifically, we consider retaining m = 1, 3, 5, 7, 8, 9 skip connections. To keep the costs tractable, we randomly select five distinct combinations of skip connections for each specific m. Our findings, depicted in Figure 8(a), illustrate a consistent memorization ratio for larger values of m, whereas a considerable degree of variance is observed for smaller values of m. Notably, when m = 1, the memorization ratio exhibits substantial variability, spanning a range from approximately 0% to over 90%. In addition to remaining specific skip connections at different resolutions in the main paper, we also investigate the effect of selectively deleting certain skip connections at varying resolutions. As present in Figure 8(b), when deleting one or two skip connections at any spatial resolution, the memorization ratios consistently remain above 90%. It is noteworthy that the memorization ratio only falls below 90% when three skip connections are deleted at the resolution of 32 32. These findings further confirm the pivotal role played by skip connections at higher resolutions on memorization of diffusion models. C.3 Unconditional v.s. Conditional Generation We compare the memorization ratios of unconditional diffusion models, conditional diffusion models with true labels, and conditional diffusion models with unique labels. As summarized in Table 4, it is noticed that with unique labels as class conditions, the trained diffusion models can only replicate training data as the memorization ratios achieve close to 100% when |D| 5k. Additionally, we visualize the memorization ratios of diffusion models during the training in Figure 9, which reaffirms that the diffusion models with unique labels as input conditions memorize training data quickly. Typically, they can achieve more than 90% memorization ratios within 8k training epochs. Finally, when trained on the entire CIFAR-10 dataset, i.e., |D| = 50k, the memorization ratio of conditional EDM with unique labels achieves more than 65% within 12k training epochs, as present in Figure 5(c). However, its unconditional counterpart still maintains a zero value of memorization ratio. To further demonstrate this memorization gap, we visualize the generated images and their ℓ2-nearest training samples in D by the above two models in Figure 10. It is noticed that the conditional EDM with unique labels replicate a large proportion of training data while the unconditional EDM generates novel samples. D More Empirical Results on FFHQ D.1 Data Dimension We investigate the influence of data dimension on the memorization of diffusion models, particularly those trained using the FFHQ dataset. Specifically, we evaluate various resolutions: 64 64, 32 32, 16 16, with the latter two resolutions achieved by downsampling. We keep the model configurations and training procedure the same. As shown in Figure 11(a), for the 64 64 input resolution, we observe an EMM between 500 and Published in Transactions on Machine Learning Research (02/2025) 64 96 128 160 192 224 256 Channel Multiplier 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Depth = 4 Depth = 2 2 4 6 8 10 Block Number 55 60 65 70 75 80 85 90 95 100 Ratio (%) of Memorization Width = 256 Width = 128 8k 16k 24k 32k 40k Training Epochs 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Width = 256 Width = 224 Width = 192 Width = 160 Width = 128 8k 16k 24k 32k 40k Training Epochs 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Depth = 10 Depth = 8 Depth = 6 Depth = 4 Depth = 2 8k 16k 24k 32k 40k Training Epochs 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Width = 256 Width = 224 Width = 192 Width = 160 Width = 128 8k 16k 24k 32k 40k Training Epochs 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Depth = 10 Depth = 8 Depth = 6 Depth = 4 Depth = 2 Figure 7: At the size of training data |D| = 2k, memorization ratio (%) of diffusion models on CIFAR-10 with (a) varying model widths under 2 or 4 residual blocks per resolution; (b) varying model depths under the channel multiplier as 128 or 256; (c) varying model widths under 2 residual blocks per resolution during the training; (d) varying model depths under the channel multiplier as 128 during the training; (e) varying model widths under 4 residual blocks per resolution during the training; (f) varying model depths under the channel multiplier as 256 during the training. 1k. While for the 32 32 input resolution, the EMM is close to 4k. Furthermore, when input resolution is 16 16, the memorization ratio is still above 90% even for |D| = 8k. The EMM is estimated to be between 8k and 10k. These results further indicate the significance of data dimensionality on the memorization within diffusion models. Published in Transactions on Machine Learning Research (02/2025) 1 3 5 7 8 9 Skip Num 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization 32 32 16 16 8 8 Spatial Resolution 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Skip Num = 3 Skip Num = 2 Skip Num = 1 Figure 8: Memorization ratio (%) of diffusion models on CIFAR-10 when (a) retaining different numbers of skip connections; (b) deleting skip connections of certain spatial resolution. 8k 16k 24k 32k 40k Training Epochs 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Cond (unique) Cond (true) Uncond 8k 16k 24k 32k 40k Training Epochs 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Cond (unique) Cond (true) Uncond 8k 16k 24k 32k 40k Training Epochs 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Cond (unique) Cond (true) Uncond 8k 16k 24k 32k 40k Training Epochs 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Cond (unique) Cond (true) Uncond Figure 9: Memorization ratios (%) of unconditional / conditional (with true labels) / conditional (with unique labels) diffusion models on CIFAR-10 during the training at the training data size (a) |D| = 2k; (b) |D| = 3k; (c) |D| = 4k; (d) |D| = 5k. D.2 Time Embedding We conduct experiments to compare two distinct time embedding methods within the model structure of DDPM++ (Song et al., 2021b; Karras et al., 2022): positional embeeding (Vaswani et al., 2017) and random fourier features (Tancik et al., 2020). As illustrated in Figure 11(b), by varying training data size |D|, there Published in Transactions on Machine Learning Research (02/2025) Table 4: Memorization ratios (%) of unconditional / conditional (true labels) / conditional (unique labels) diffusion models on CIFAR-10. |D| = 1k |D| = 2k |D| = 3k |D| = 4k |D| = 5k Unconditional 92.09 60.93 34.60 17.12 9.00 Conditional (true labels) 96.59 79.46 55.62 37.59 23.00 Conditional (unique labels) 100.00 99.88 99.89 99.57 99.66 Figure 10: Generated images (top three rows) and their ℓ2-nearest training samples in D (bottom three rows) by (a) the conditional EDM with unique labels (b) the unconditional EDM. is notable decline in the memorization ratio when employing random fourier features for time embeddings in the DDPM++ model, which reconfirms our conclusions in the main paper. D.3 Unconditional v.s. Conditional Generation In the main paper, we demonstrated the significant contributions of random labels to memorization within diffusion models. To further substantiate these findings, we conduct additional experiments using the FFHQ dataset. We note that FFHQ has no ground truth class labels for each image, so we only consider random labels and unique labels in our experimental design. Firstly, we construct a training dataset with |D| = 2k. Subsequently, the number of classes C, for random labels, is varied within the set {1, 10, 50, 100, 200, 400, 2000}. Afterwards, we train conditional diffusion models on these training data of different random labels. As demonstrated in Figure 11(c), the memorization ratio for C = 1 (equivalent to the unconditional scenario) approximates to only 20%. However, it achieves over 90% with an increase in C to 400. Additionally, we conduct another experiment to investigate the effects of unique labels. As visualized in Figure 11(d), when unique labels are provided as conditions to diffusion models, the memorization ratio still exceeds 90% at a training data size of |D| = 5k. In contrast, for unconditional diffusion models, the memorization ratio diminishes even below 90% when |D| = 1k. To summarize, these findings show that the number of classes for random labels plays a pivotal role in the memorization of diffusion models, thus aligning with our earlier results derived from experiments on the CIFAR-10 dataset. E Experimental Results of FIDs In this section, we provide the Fréchet Inception Distances (FIDs) (Heusel et al., 2017) in parallel with the memorization ratios in the main paper. The results related to CIFAR-10 are shown in Figure (12-15) and Published in Transactions on Machine Learning Research (02/2025) 0.5k 1k 2k 4k 8k 12k Data Size | | 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Res = 64 64 Res = 32 32 Res = 16 16 0.5k 1k 2k Data Size | | 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Fourier Positional 1 10 50 100200400 2000 Class Number 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization 0.5k 1k 2k 3k 4k 5k Data Size | | 0 10 20 30 40 50 60 70 80 90 100 Ratio (%) of Memorization Uncond Cond (unique) Figure 11: Memorization ratio (%) of diffusion models on FFHQ under (a) different data dimension; (b) different time embeddings; (c) different numbers of classes for random labels; (d) unconditional diffusion models and conditional diffusion models with unique conditions. 0.5k 1k 2k 4k 8k 16k Data Size | | 0 4 8 12 16 20 24 28 32 36 Res = 32 32 Res = 16 16 Res = 8 8 0.5k 1k 2k Data Size | | 0 2 4 6 8 10 12 14 16 C = 10 C = 5 C = 2 C = 1 0.5k 1k 2k Data Size | | 3 6 9 12 15 18 21 24 27 30 = 1.0 = 0.8 = 0.5 = 0.2 = 0.0 Figure 12: FIDs of diffusion models on CIFAR-10 under different (a) data dimensions; (b) inter-diversity; (c) intra-diversity. Table (5-6). While the results on FFHQ are present in Figure 16. By analyzing the FID results, we observe that a configuration with higher memorization ratio tends to results in a lower FID. This observation is in line with the intuition: when diffusion models memorize a significant proportion of training data, FIDs are also close to optimal. Published in Transactions on Machine Learning Research (02/2025) 0.5k 1k 2k 3k 4k 5k Data Size | | 0 2 4 6 8 10 12 14 16 Width = 128 Width = 192 Width = 256 Width = 320 0.5k 1.0k 1.5k 2.0k 2.5k 3.0k Data Size | | Depth = 6 Depth = 8 0.5k 1k 2k Data Size | | 0 10 20 30 40 50 60 70 80 Fourier, depth = 2 Fourier, depth = 4 Positional, depth = 2 Positional, depth = 4 Figure 13: FIDs of diffusion models on CIFAR-10 under different (a) model widths; (b) model depths; (c) time embeddings. 32 32 16 16 8 8 Spatial Resolution 0 10 20 30 40 50 60 70 80 90 100 Skip Num = 3 Skip Num = 2 Skip Num = 1 1 2 3 4 5 6 7 8 9 Index of Skip Connection 0 10 20 30 40 50 60 70 80 90 Res = 32 32 Res = 16 16 Res = 8 8 32 32 16 16 8 8 Spatial Resolution 0 30 60 90 120 150 180 210 240 270 Skip Num = 3 Skip Num = 2 Skip Num = 1 1 2 3 4 5 6 7 8 9 Index of Skip Connection 20 40 60 80 100 120 140 160 180 200 Res = 32 32 Res = 16 16 Res = 8 8 Figure 14: FIDs of diffusion models on CIFAR-10 when retaining (a) skip connections of certain spatial resolution for DDPM++; (b) single skip connection at different locations for DDPM++; (c) skip connections of certain spatial resolution for NCSN++; (d) single skip connection at different locations for NCSN++. Table 5: FIDs of diffusion models on CIFAR-10 under different batch sizes. Batch Size 128 256 384 512 640 768 896 Learning rate (10 4) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Rmem(%) |D| =1k 8.89 8.60 8.14 7.99 8.25 7.98 7.85 |D| =2k 17.56 15.57 15.23 14.41 14.83 14.56 14.85 Table 6: FIDs of diffusion models on CIFAR-10 under different (left) weight decay values; (right) EMA values. Weight decay FID |D| =1k |D| =2k 0 7.99 14.41 1 10 5 8.17 15.73 1 10 4 8.08 14.36 1 10 3 9.37 12.75 1 10 2 13.47 17.02 5 10 2 41.08 51.70 1 10 1 63.78 48.21 EMA FID |D| =1k |D| =2k 0.99929 7.99 14.41 0.999 8.11 15.75 0.99 8.55 17.11 0.9 9.83 18.04 0.5 10.32 21.28 0.1 12.38 23.12 0.0 11.77 22.39 Published in Transactions on Machine Learning Research (02/2025) 1k 2k 3k 4k 5k Data Size | | 3 8 13 18 23 28 33 38 43 Uncond Cond (random) Cond (true) Cond (unique) 1 2 5 10 20 50 0.1k 0.5k 1k Class Number 3 7 11 15 19 23 27 31 35 39 43 Figure 15: FIDs of (a) unconditional diffusion models and conditional diffusion models with true / random / unique labels on CIFAR-10; (b) conditional diffusion models with random labels of different C. 0.5k 1k 2k 4k 8k 12k Data Size | | 0 3 6 9 12 15 18 21 24 27 Res = 64 64 Res = 32 32 Res = 16 16 0.5k 1k 2k Data Size | | 0 10 20 30 40 50 60 70 80 90 Fourier Positional 1 10 50 100200400 2000 Class Number 0 3 6 9 12 15 18 21 24 27 0.5k 1k 2k 3k 4k 5k Data Size | | 0 3 6 9 12 15 18 21 24 27 Uncond Cond (unique) Figure 16: FIDs of diffusion models on FFHQ under (a) different data dimension; (b) different time embeddings; (c) different numbers of classes for random labels; (d) unconditional diffusion models and conditional diffusion models with unique conditions. Published in Transactions on Machine Learning Research (02/2025) 0.5k 1k 2k 4k 8k 16k Data Size | | 0 4 8 12 16 20 24 28 KNN Distance Res = 32 32 Res = 16 16 Res = 8 8 0.5k 1k 2k Data Size | | 0 2 4 6 8 10 12 14 16 18 KNN Distance C = 10 C = 5 C = 2 C = 1 0.5k 1k 2k Data Size | | 0 3 6 9 12 15 18 21 24 27 KNN Distance = 1.0 = 0.8 = 0.5 = 0.2 = 0.0 Figure 17: KNN distances of diffusion models on CIFAR-10 under different (a) data dimensions; (b) inter-diversity; (c) intra-diversity. 0.5k 1k 2k 3k 4k 5k Data Size | | 0 2 4 6 8 10 12 14 16 18 KNN Distance Width = 128 Width = 192 Width = 256 Width = 320 0.5k 1.0k 1.5k 2.0k 2.5k 3.0k Data Size | | 0 2 4 6 8 10 12 14 16 KNN Distance Depth = 6 Depth = 8 0.5k 1k 2k Data Size | | 0 3 6 9 12 15 18 21 24 27 30 KNN Distance Fourier, depth = 2 Fourier, depth = 4 Positional, depth = 2 Positional, depth = 4 Figure 18: KNN distances of diffusion models on CIFAR-10 under different (a) model widths; (b) model depths; (c) time embeddings. F Experimental Results of Alternative Memorization Criteria In this section, we aim to corroborate our findings regarding memorization of diffusion models by incorporating alternative memorization metrics. The memorization ratio, as defined in the main paper, can be formally articulated as follows n=1 I( x n, NN1(x n, D) 2 x n, NN2(x n, D) 2 < 1 where I is an indicator function, NNj(x n, D) is the j-th nearest training sample of x n in D, and N is the number of generations sampled by the diffusion model. We also consider the KNN distance used in Carlini et al. (2023) as a surrogate memorization metric: n=1 x n, NN1(x n, D) 2 . (63) It is noticed that when memorization ratio is high or KNN distance is low, the diffusion models feature memorizing more on the training data. Afterwards, we re-evaluate our results in the main paper by using the KNN distance as memorization metric. The results on CIFAR-10 are shown in Figure (17-20) and Table (7-8) while the results related to FFHQ are present in Figure 21. We notice that these new results are in alignment with our original conclusions using the memorization ratio metric. Published in Transactions on Machine Learning Research (02/2025) 32 32 16 16 8 8 Spatial Resolution 0 5 10 15 20 25 30 35 40 45 50 KNN Distance Skip Num = 3 Skip Num = 2 Skip Num = 1 1 2 3 4 5 6 7 8 9 Index of Skip Connection 0 5 10 15 20 25 30 35 40 45 50 KNN Distance Res = 32 32 Res = 16 16 Res = 8 8 32 32 16 16 8 8 Spatial Resolution 0 8 16 24 32 40 48 56 64 72 80 KNN Distance Skip Num = 3 Skip Num = 2 Skip Num = 1 1 2 3 4 5 6 7 8 9 Index of Skip Connection 0 8 16 24 32 40 48 56 64 72 80 KNN Distance Res = 32 32 Res = 16 16 Res = 8 8 Figure 19: KNN distances of diffusion models on CIFAR-10 when retaining (a) skip connections of certain spatial resolution for DDPM++; (b) single skip connection at different locations for DDPM++; (c) skip connections of certain spatial resolution for NCSN++; (d) single skip connection at different locations for NCSN++. 1k 2k 3k 4k 5k Data Size | | 0 3 6 9 12 15 18 21 24 27 30 KNN Distance Uncond Cond Cond (random) 1 2 5 10 20 50 0.1k 0.5k 1k Class Number 0 3 6 9 12 15 18 21 24 27 30 KNN Distance 2k 4k 6k 8k 10k 12k Training Epochs 0 5 10 15 20 25 30 35 40 45 50 KNN Distance Cond (unique) Uncond Figure 20: KNN distances of (a) unconditional diffusion models and conditional diffusion models with true labels or random labels on CIFAR-10; (b) conditional diffusion models with random labels of different C; (c) conditional EDM with unique labels and unconditional EDM at |D| = 50k during the training. Table 7: KNN distances of diffusion models on CIFAR-10 under different batch sizes. Batch Size 128 256 384 512 640 768 896 Learning rate (10 4) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 KNN distance |D| =1k 7.35 6.73 6.51 6.35 6.51 6.40 6.30 |D| =2k 15.43 14.92 14.88 14.77 14.53 14.12 14.57 Table 8: KNN distances of diffusion models on CIFAR-10 under different (left) weight decay values; (right) EMA values. Weight decay KNN distance |D| =1k |D| =2k 0 6.35 14.77 1 10 5 6.45 14.63 1 10 4 6.42 15.29 1 10 3 7.32 16.99 1 10 2 9.99 23.40 5 10 2 26.69 38.73 1 10 1 34.44 39.51 EMA KNN distance |D| =1k |D| =2k 0.99929 6.35 14.77 0.999 6.44 14.58 0.99 6.51 15.21 0.9 7.02 15.39 0.5 7.11 15.04 0.1 6.90 15.08 0.0 7.08 14.59 Published in Transactions on Machine Learning Research (02/2025) 0.5k 1k 2k 4k 8k 12k Data Size | | 0 4 8 12 16 20 24 28 32 KNN Distance Res = 64 64 Res = 32 32 Res = 16 16 0.5k 1k 2k Data Size | | 0 4 8 12 16 20 24 28 32 36 40 KNN Distance Fourier Positional 1 10 50 100200400 2000 Class Number 0 3 6 9 12 15 18 21 24 27 30 KNN Distance 0.5k 1k 2k 3k 4k 5k Data Size | | 0 3 6 9 12 15 18 21 24 27 30 KNN Distance Uncond Cond (unique) Figure 21: KNN distances of diffusion models on FFHQ under (a) different data dimension; (b) different time embeddings; (c) different numbers of classes for random labels; (d) unconditional diffusion models and conditional diffusion models with unique conditions. G Implementation Details of Finetuning on Stable Diffusion Data distribution. Imagenette consists of C = 10 classes from the Image Net dataset (Deng et al., 2009). The classes include chain saw , garbage truck , tench , French horn , gas pump , English springer , parachute , church , cassette player , golf ball . In our experiments, the image resolution is specified as 256 256. Similar to our experiments on CIFAR-10 and FFHQ, we disable all data augmentation. Model configuration. Stable diffusion (Rombach et al., 2022) includes an image encoder, an image decoder, a unet, and a text encoder. The image encoder is used to encode images into latent representations while the decoder reconstructs images from the latent. The unet plays a role as the latent diffusion model conditioning on text embeddings. Before fine-tuning the model on the Artbench dataset, we load the model weights from pre-trained stable diffusion4. Training procedure. During the fine-tuning, we only train the unet in stable diffusion and freeze the model weights of image encoder/decoder and text encoder. We adopt a learning rate of 1 10 4 and 3 10 4 for full fine-tuning and Lo RA fine-tuning (Hu et al., 2021) with cosine learning rate schedule and weight decay of 1 10 6. The batch size is set as 16 and the gradient accumulation step is set as 4. For each training dataset D, we enable the EMA and fine-tune the stable diffusion until 400k images have been observed. Text-to-image generation. We follow the pipeline of DDIM (Song et al., 2021a) with 50 steps to sample 10k images from fine-tuned stable diffusion. During the generation, we disable the safety checker to prevent generating black images. For plain conditioning, we use the text prompt a picture . While for class conditioning, we sample 1k images using the prompt a picture of for each class, totaling 10k generated samples. 4https://huggingface.co/lambdalabs/mini SD-diffusers Published in Transactions on Machine Learning Research (02/2025) = 0.1 = 0.2 = 0.1 = 0.2 Figure 22: EMMs of diffusion models with various ζ under configurations of (a) different data dimensions; (b) different model widths; (c) different numbers of classes for random labels. H Sensitivity of Effective Model Memorization on ζ The EMM metric defined in 2.1 is used to quantify the conditions which empirically trained diffusion models exhibit memorization behavior similar to the theoretical optimum. The selection of ζ determines the closeness between the empirical model and theoretical one. Generally, a smaller ζ requires a higher memorization ratio for the empirical diffusion model to approximate the theoretical optimum. In Figure 22, we select various ζ in different configurations. We observe that a smaller ζ typically results in a smaller EMM. Meanwhile, the conclusions in our main paper remain consistent. For instance, smaller data resolutions, larger model widths, larger numbers of classes for conditioning lead to more memorization.