# addressing_negative_transfer_in_diffusion_models__beaa0345.pdf Addressing Negative Transfer in Diffusion Models Hyojun Go1 Jin Young Kim1 Yunsung Lee2 Seunghyun Lee3 Shinhyeok Oh3 Hyeongdon Moon4 Seungtaek Choi5 Twelvelabs1 Wrtn Technologies2 Riiid3 EPFL4 Yanolja5 {gohyojun15, seago0828}@gmail.com1, sung@wrtn.io2 , {seunghyun.lee shinhyeok.oh}@riiid.co3, hyeongdon.moon@epfl.ch4, seungtaek.choi@yanolja.com5 Diffusion-based generative models have achieved remarkable success in various domains. It trains a shared model on denoising tasks that encompass different noise levels simultaneously, representing a form of multi-task learning (MTL). However, analyzing and improving diffusion models from an MTL perspective remains underexplored. In particular, MTL can sometimes lead to the well-known phenomenon of negative transfer, which results in the performance degradation of certain tasks due to conflicts between tasks. In this paper, we first aim to analyze diffusion training from an MTL standpoint, presenting two key observations: (O1) the task affinity between denoising tasks diminishes as the gap between noise levels widens, and (O2) negative transfer can arise even in diffusion training. Building upon these observations, we aim to enhance diffusion training by mitigating negative transfer. To achieve this, we propose leveraging existing MTL methods, but the presence of a huge number of denoising tasks makes this computationally expensive to calculate the necessary per-task loss or gradient. To address this challenge, we propose clustering the denoising tasks into small task clusters and applying MTL methods to them. Specifically, based on (O2), we employ interval clustering to enforce temporal proximity among denoising tasks within clusters. We show that interval clustering can be solved using dynamic programming, utilizing signal-tonoise ratio, timestep, and task affinity for clustering objectives. Through this, our approach addresses the issue of negative transfer in diffusion models by allowing for efficient computation of MTL methods. We validate the efficacy of proposed clustering and its integration with MTL methods through various experiments, demonstrating 1) improved generation quality and 2) faster training convergence of diffusion models. Our project page is available at https://gohyojun15.github. io/ANT_diffusion/. 1 Introduction Diffusion-based generative models [20, 66, 71] have accomplished remarkable achievements in various generative tasks, including image [8], video [21, 23], 3D shape [44, 54], and text generation [38]. In particular, they have shown excellent performance and flexibility in a wide range of image generation settings, including unconditional [28, 47], class-conditional [22], and text-conditional image generation [1, 48, 55]. Consequently, improving diffusion models has garnered significant interest. The framework of diffusion models [20, 66, 71] comprises gradually corrupting the data towards a given noise distribution and its subsequent reverse process. A model is optimized by minimizing the weighted sum of denoising score-matching losses across various noise levels [20, 69] for learning the reverse process. This can be interpreted as diffusion training aiming to train a single shared model to Co-first author 1,2,4,5Work done while at Riiid Corresponding author 37th Conference on Neural Information Processing Systems (Neur IPS 2023). denoising its input across various noise levels. Therefore, diffusion training is inherently multi-task learning (MTL) in nature, where each noise level represents a distinct denoising task. However, analyzing and improving diffusion models from an MTL perspective remains underexplored. In particular, sharing one model between tasks may lead to competition between conflicting tasks, resulting in a phenomenon known as negative transfer [24, 25, 57, 78], leading to poorer performance compared to learning individual tasks with separate models. Negative transfer has been a critical issue in MTL research, and related works have demonstrated that the performance of multi-task models can be improved by remediating negative transfer [24, 25, 57, 78, 83]. Considering these, we argue that negative transfer should be investigated in diffusion models, and if present, addressing it is a potential direction for improving diffusion models. In this paper, we characterize how multi-task diffusion model is, and whether there exists negative transfer in denoising tasks. In particular, (O1) we first observe that task affinity [12, 78] between two denoising tasks is negatively correlated with the difference in noise levels, indicating that they may be less conflict as the noise levels become more similar [78]. This suggests that adjacent denoising tasks should be considered more harmonious tasks than non-adjacent tasks in terms of noise levels. Next, (O2) we observe the presence of negative transfer from diffusion model training. During sampling within a specific timestep interval, utilizing a model trained exclusively on denoising tasks within that interval generates higher-quality samples compared to a model trained on all denoising tasks simultaneously. This finding implies that simultaneously learning all denoising tasks can cause degraded denoising within a specific time interval, indicating the occurrence of negative transfer. Based on these observations, we focus on improving diffusion models by addressing negative transfer. To achieve this, we first propose to leverage the existing multi-task learning techniques, such as dealing with issues of conflicting gradients [5, 83], differences in gradient magnitudes [42, 46, 64], and imbalanced loss scales [4, 16, 29]. However, unlike previous MTL studies that typically focused on small sets of tasks, the presence of a large number of denoising tasks ( thousands) in diffusion models makes it computationally expensive since MTL methods generally require calculating per-task loss or gradient in each iteration [4, 5, 16, 24, 29, 42, 46, 64, 78, 83]. To address this, we propose a strategy that first clusters the entire denoising tasks and then applies multi-task learning methods to the resulting clusters. Specifically, inspired by (O1), we formulate the interval clustering problem which groups denoising tasks by pairwise disjoint timestep intervals. Based on the interval clustering, we propose timesteps, signal-to-noise ratios, and task affinity scorebased interval clustering and show that these can be clustered by dynamic programming as [2, 76, 49]. Through our strategy, we can address the issue of negative transfer in diffusion models by allowing for efficient computation of multi-task learning methods. We evaluated our proposed methods through extensive experiments on widely-recognized datasets: FFHQ [27], Celeb A-HQ [26], and Image Net [7]. For a comprehensive analysis, we employed various models, including Ablated Diffusion Model (ADM) [8], Latent Diffusion Model (LDM) [56], and Diffusion Transformer (Di T) [52]. These models represent diverse diffusion architectures spanning pixel-space, latent-space, and transformer-based paradigms. Our results underscore a significant enhancement in image generation quality, attributed to a marked reduction in negative transfer. This affirms the merits of our clustering proposition and its synergistic integration with MTL techniques. 2 Related Work Diffusion Models Diffusion models [20, 66, 71] are a family of generative models that generate samples from noise via a learned denoising process. Diffusion models beat other likelihood-based models, such as autoregressive models [62, 75], flow models [9, 10], and variational autoencoders [32] in terms of sample quality, and sometimes outperform GANs [14] in certain cases [8]. Moreover, pre-trained diffusion models can be easily applied to downstream image synthesis tasks such as image editing [30, 45] and plug-and-play generation [13, 15]. From these advantages, several works have applied diffusion models for various domains [3, 23, 38, 44, 54] and large-scale models [48, 56, 58]. Several studies have focused on improving diffusion models in various aspects, such as architecture [1, 8, 28, 52, 82], sampling speed [33, 60, 67], and training objectives [6, 17, 31, 70, 74]. Among these, the most closely related studies are improving training objectives, as we aim to enhance optimization between denoising tasks from the perspective of multi-task learning (MTL). Several works [31, 70, 74] redesign training objectives to improve likelihood estimation. However, these objectives may lead to sample quality degradation and training instability and require additional techniques such as importance sampling [70, 74] and sophisticated parameterization [31] to be successfully applied. On the other hand, P2 [6] proposes a weighted training objective that prioritizes denoising tasks for certain noise levels, where the model is expected to learn perceptually rich features. Similar to P2, we aim to improve the sample quality of diffusion models from an MTL perspective, and we will show that our method is also beneficial to P2. As a concurrent work, Min SNR [17] shares a common insight with us that diffusion training is essentially multi-task learning. However, their observation lacks a direct connection to negative transfer in terms of sample quality. They address the instability and inefficiency of multi-task learning optimization in diffusion models, mainly due to a large number of denoising tasks. In contrast, our work delves deeper into exploring negative transfer and task affinity, and we propose the application of MTL methods through task clustering to overcome the identified challenges in Min SNR. Multi-Task Learning Multi-Task Learning (MTL) is an approach that trains a single model to perform multiple tasks simultaneously [57]. Although sharing parameters between tasks can reduce the overall number of parameters, it may also result in a negative transfer, causing performance degradation because of conflicting tasks during training procedure [24, 25, 57, 78]. Prior works have tracked down three causes of negative transfer: (1) conflicting gradient, (2) the difference in gradient magnitude, and (3) imbalanced loss scale. First, Conflicting gradients among different tasks may negate each other, resulting in poorer updates for a subset of, or even for all tasks. PCgrad [83] and Graddrop [5] mitigate this by projecting conflicting parts of gradients and dropping elements of gradients based on the degree of conflict, respectively. Second, tasks with larger gradients may dominate tasks with smaller gradients due to differences in gradient magnitude across tasks. Different optimization schemes have been proposed to equalize gradient magnitudes, including MGDA-UB [64], IMTL-G [42], and Nash MTL [46]. Similarly, imbalanced loss scales may cause tasks with smaller losses to be dominated by those with larger losses. To balance task losses, uncertainty [29], task difficulty [16], and gradient norm [4] is exploited. Adapting MTL methods and negative transfer formulation to diffusion models is challenging since these techniques are typically designed for scenarios with a small number of tasks and easily measurable individual task performance. Our goal is to address this challenge and demonstrate that observing negative transfer in diffusion models and mitigating it can improve them. 3 Preliminaries and Observation We first provide the necessary background information on diffusion models and their multi-task nature. Next, we conduct analyses that yield two important observations: (O1) task affinity between two tasks is negatively correlated with the difference in noise levels, and (O2) negative transfer indeed exists in diffusion training, i.e., the model is overburdened with different, potentially conflicting tasks. 3.1 Preliminaries Diffusion model [20, 66, 71] consists of two processes: a forward process and a reverse process. The forward process q gradually injects noise into a datapoint x0 to obtain noisy latents {x1, . . . , x T } as: q(xt|x0) = N(xt|atx0, σ2 t I), q(xt|xs) = N(xt|αt|sxs, (σ2 t α2 t|sσ2 s)I), 1 s < t T (1) where αt, σt characterize the signal-to-noise ratio SNR(t) = α2 t/σ2 t , and αt|s = αt/αs. Here, SNR(t) decreases in t, such that by the designated final timestep t = T, q(x T ) N(0, I). The reverse process is a parameterized model trained to restore the original data from data corrupted during the forward process. The widely adopted training scheme uses a simple noise-prediction objective [8, 20, 34, 56, 59] that trains the model to predict the noise component ϵ of the latent xt = αtx0 + σϵ, ϵ N(0, I). More formally, the objective is as follows: Lsimple = Et,x0,ϵ[Lt], where Lt = ||ϵ ϵθ(xt, t)||2 2. (2) Let us denote by Dt the denoising task at timestep t trained by minimizing the loss Lt (Eq. 2). Then, since a diffusion model jointly learns multiple denoising tasks {Dt}t=1,...,T using a single shared model ϵθ, it can be regarded as a multi-task learner. Also, we denote by D[t1,t2] the set of tasks {Dt1, Dt1+1, . . . , Dt2} henceforth. 1 200 400 600 800 1000 1 (a) ADM (timestep t) -9.0 -6.0 -3.0 0.0 3.0 6.0 9.0 -9.0 (b) ADM (log-SNR) 1 200 400 600 800 1000 1 (c) LDM (timestep t) -8.0-6.0-4.0-2.0 0.0 2.0 4.0 6.0 -8.0 (d) LDM (log-SNR) Figure 1: Task affinity scores plotted against timestep and log-SNR axes in ADM and LDM. As the timestep and SNR differences decrease, task affinity increases, implying more aligned gradient directions between denoising tasks and reduced negative impact on their joint training. [1,200] [201,400] [401,600] [601,800][801,1000] Denoising tasks D[ , ] [1,200] [201,400] [401,600] [601,800][801,1000] Denoising tasks D[ , ] Figure 2: Negative transfer gap (NTG) with FID score of ADM and LDM for denoising tasks D[ , ]. If NTG is negative, D[ , ]-trained model outperforms the entire denoising tasks-trained model in terms of denoising latent {xt}t [ , ], showing the occurrence of negative transfer. Negative transfer occurs in both ADM and LDM. 3.2 Observation By considering diffusion training as a form of multi-task learning, we can analyze how the diffusion model learns the denoising task. We experimentally analyze diffusion models with two concepts in multi-task learning: 1) Task affinity [72, 12]: measuring which combinations of denoising tasks may yield a more positive impact on performance. 2) Negative transfer [68, 24, 25, 57, 78, 83]: degradation in denoising tasks caused by multi-task learning. We use a lightweight ADM [8] used in [6] and LDM [56] with FFHQ 256 256 dataset [27] for analyze diffusion models trained on both pixel and latent space. (O1) Task Affinity Analysis We first analyze how the denoising tasks D[1,T ] relate to each other by measuring task affinities [72, 12]. In particular, we adopt the gradient direction-based task affinity score [78]: for two given tasks Di and Dj, we calculate the pairwise cosine similarity between gradients from each task loss, i.e., θLi and θLj, then average the similarities across training iterations. Task affinity score assumes that cooperative (conflicting) tasks produce similar (conflicting) gradient directions, and it has been to correlate with the MTL model s overall performance [78]. Although there have been attempts to divide diffusion model phases using signal-to-noise ratio [6] and a trace of covariance of training targets [81], we are the first to provide an explicit and fine-grained analysis of task affinities among denoising tasks. In Fig. 1, we visualize the task affinity scores among denoising tasks, for both ADM and LDM, with both timestep and log-SNR as axes. As can be seen in Fig. 1, task affinity between two tasks Di, Dj is high for neighboring tasks, i.e., i j, and decreases smoothly as the difference in SNRs (or timesteps) increases. This suggests that tasks sharing temporal/noise-level proximity can be cooperatively learned without significant conflict. Also, this result hints at the possibility that denoising tasks for vastly different SNRs (distant in timesteps) may potentially be conflicting. (O2) Negative Transfer Analysis Next, we show that there exist negative transfers among different denoising tasks D[1,T ]. Negative transfer refers to a multi-task learner s performance degradation due to task conflicts, and it can be identified by observing the performance gap between a multi-task learner and specific-task learners. For ease of observation, we group up tasks by intervals, based on the observation (O1) that more neighboring tasks in timesteps have higher task affinity. Specifically, we investigate whether the task group D[t1,t2] suffers negative impacts from the remaining tasks. To quantify the negative transfer, we follow the procedure: First, we generate samples { x0} using a model trained on all denoising tasks D[1,T ]. Next, we repeat the same sampling procedure, except we replace the model with a model trained on D[t1,t2] for the latent {xt}t [t1,t2]; We denote the resulting samples by { x[t1,t2] 0 }. If { x[t1,t2] 0 } exhibits superior quality compared to { x0}, it indicates that the model trained solely on D[t1,t2] performs better in denoising the latent {xt}t [t1,t2] than the model trained on the entire denoising task. This suggests that D[t1,t2] suffers from negative transfer by learning other tasks. More formally, given a performance metric P, FID [18] in this paper, we define the negative transfer gap: NTG(D[t1,t2]) := P({ x[t1,t2] 0 }) P({ x0}), (3) where NTG < 0 indicates that negative transfer occurs. The relationship between the negative transfer gap in previous literature and our negative transfer gap is described in Appendix A. We visualize the negative transfers among denoising tasks for both lightweight ADM [6, 8] and LDM [56] in Fig. 2. The results indicate that negative transfer occurs in three out of the five considered task groups for both models. Notably, negative transfers often have a significant impact, such as a 7.56 increase in FID for ADM in the worst case. Therefore, we hypothesize that there is room for improving the performance of diffusion models by mitigating negative transfer, which motivates us to leverage well-designed MTL methods for diffusion training. 4 Methodology In Section 3.2, we make two observations: (O1) Denoising tasks with a larger difference in t and SNR(t) exhibit lower task affinity, (O2) Negative transfer occurs in diffusion training. Inspired by these observations, we aim to remediate the negative transfer in diffusion by leveraging MTL methods. Although MTL methods are reported effective when there are only a few tasks, they are impractical for diffusion models with a large number of denoising tasks since they require computing per-task gradients or loss at each iteration. In this section, to deal with challenges, we propose a strategy that first groups the denoising tasks as task clusters and then applies the multi-task learning methods by regarding each task cluster as one distinct task. 4.1 Interval Clustering Here, we first introduce a scheme that groups all denoising tasks D[1,T ] into a small number of task clusters. This is a necessary step for applying well-established MTL methods, for they usually involve computationally expensive subroutines such as computing per-task gradients or loss in each training iteration. Our key idea is to enforce temporal proximity of denoising tasks within task clusters, given our observation (O1) that task affinity is higher for tasks closer in timesteps. Therefore, we assign tasks in pairwise disjoint time intervals. To obtain the disjoint time intervals, we leverage an interval clustering algorithm [2, 49] that optimizes for various clustering costs. In our case, interval clustering assigns diffusion timesteps X = {1, . . . , T} to k contiguous intervals I1, . . . , Ik, with k i=1 Ii X = X, where denotes disjoint union. Let Ii = [li, ri], li ri for i = 1, . . . , k, then we have l1 = 1, and ri = li+1 1 (i < k and rk = T). The interval clustering problem is defined as: min l1=1 0 indicates that negative transfer occurs, showing that additionally training on Tsrc negatively affects the learning of Ttgt. In our study of negative transfer in diffusion models, the target task involves denoising tasks within a specific timestep interval as Ttgt = D[t1,t2], while the source task comprises the remaining denoising tasks as Tsrc = D[1,T ] \ D[t1,t2]. However, since a model trained only a subset of entire denoising tasks cannot generate samples properly, we cannot utilize the sample quality metrics (e.g. FID [18]) for P to measure P(Θ(Ttgt)) in Eq. 5 for arbitrary timestep intervals. This is a different point from a typical MTL setting, where the performance of each task can be measured. Alternatively, we redefine NTG with the difference in sample quality resulting from denoising by different models, Θ(Ttgt) and Θ(Ttgt, Tsrc), in the [t1, t2] interval. During the sampling procedure with a model trained on entire denoising tasks, we use Θ(Ttgt) or Θ(Ttgt, Tsrc) in [t1, t2]. Denote the resulting samples with Θ(Ttgt, Tsrc) as { x0} and the resulting samples with Θ(Ttgt) as { x[t1,t2] 0 }. Then, by comparing the quality of these samples as Eq. 3, we can measure how much the denoising of [t1, t2] degrades in terms of sampling quality. Furthermore, the success of multi-expert denoisers in prior studies [37, 13, 1] suggests the potential existence of negative transfer. By distinctly separating parameters for denoising tasks, they might mitigate this negative transfer, leading to enhanced performance in their generation. B Detailed Experimental Settings for Observational Study In this section, we provide the details on experimental settings in Section 3. The training details and the architectures used are the same as those in Section 5. All experiments are conducted with a single A100 GPU and with FFHQ dataset [27]. For the pixel-space diffusion model, we use the lightweight ADM as same in [6]. It inherits the architecture of ADM [8], but it uses fewer base channels, fewer residual blocks, and a self-attention with a single resolution. Specifically, the model uses one residual block per resolution with 128 base channels and 16 16 self-attention with 64 head channels. A linear schedule with T = 1000 is used for diffusion scheduling. We referenced the training scripts in the official code2 for implementation. For the latent-space diffusion model, we use the LDM architecture as the same settings for FFHQ experiments in [56]. Specifically, an LDM-4-VQ encoder and decoder are used, in which the resolution of latent vectors is reduced by four times compared to the original images and has a vector quantization layer with 8092 codewords. The denoising model has 224 base channels with multipliers for each resolution as 1, 2, 3, 4 and has two residual blocks per resolution. Self-attention with 32 head channels is used for 32, 16, and 8 resolutions. For diffusion scheduling, the linear schedule with T = 1000 is used. We conducted experiments with the official code3. In general, we utilized the pre-trained weights provided by LDM. However, if our retraining results demonstrated superior performance, we reported them. Task Affinity Analysis To measure the task affinity score between denoising tasks, we first calculate θLt for t = 1, . . . , T every 10K iterations during training. The gradient is calculated with 1000 samples in the training dataset. Then, the pairwise cosine similarity of the gradient is computed and 2https://github.com/jychoi118/P2-weighting 3https://github.com/Comp Vis/latent-diffusion I1 I2 I3 I4 I5 0.00 0.42 0.67 0.74 0.79 0.42 0.00 0.56 0.66 0.71 0.67 0.56 0.00 0.52 0.57 0.74 0.66 0.52 0.00 0.29 0.79 0.71 0.57 0.29 0.00 (a) Average conflict in PCgrad 0 1 2 3 4 Iteration 10 weight I1 weight I2 weight I3 weight I4 weight I5 (b) Gradient weights of Nash MTL 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Iteration 105 weight I1 weight I2 weight I3 weight I4 weight I5 (c) Loss weight of UW Figure 7: Behavior of multi-task learning methods through SNR-based interval clustering across training iterations. A similar trend as in Fig. 3 is observed. their cosine similarities calculated by every 10K iterations are averaged. Finally, we can plot the average cosine similarity against the timestep axis as in Fig. 1. For plotting them against the log-SNR axis, the values of the axis were adjusted, and the empty parts were filled with linear interpolation. For ADM and LDM, the pairwise cosine similarity between gradients is calculated during 1M training iterations and 400K training iterations, respectively. Negative Transfer Analysis To calculate the negative transfer gap in Eq. 3, we need to additionally train the model on denoising tasks within specific timestep interval [t1, t2]. Since we plot five intervals [1, 200], [201, 400], [401, 600], [601, 800], and [801, 1000], we trained the model on denoising tasks for each interval. Each model is trained for 600K iterations in ADM and 300K iterations in LDM on the FFHQ dataset. For the model trained on entire denoising tasks, we used the trained model the same as in Section 5.1. ADM is trained on 1M iterations and LDM is trained on 400K iterations. All of these models are trained with the same batch size and learning rate as experiments in Section 5.1 (See Appendix E). DDIM 50-step sampler [67] was used for the generation. FID is calculated with Clean-FID [51] by setting the entire 70K FFHQ dataset as reference images. Since the official code of Clean-FID4 supports FID calculation with statistics from these reference images, we used it and reported FID with 10k generated images. C Implementation Details for MTL methods We describe how MTL methods are applied in Section 4.2. To be more self-contained, we hereby present implementation details for MTL methods. For the implementation of MTL methods, we used the official code of Lib MTL [39]5. Nash MTL [46] supports practical speed-up by updating gradient weights α every few iterations, not every iteration. We utilize this by updating α every 25 training iterations. D Additional Experimental Results We present additional experimental results to supplement the empirical findings presented in Section 5. In Section D.1, we provide visualizations of the behavior of MTL methods with other clustering methods that were not covered in Section 5.3. Furthermore, we examine the impact of our hyperparameter, the number of clusters k, in Section D.2. To validate the effectiveness of interval clustering compared to other clustering methods, we present additional results in Section D.3. In Section D.4, we delve deeper into comparing the performance of stronger MTL baselines such as Linear Scalarization (LS) [80, 35] and Random Loss Weighting (RLW) [40] with our proposed approach. 4https://github.com/Ga Parmar/clean-fid 5https://github.com/median-research-group/Lib MTL I1 I2 I3 I4 I5 0.00 0.29 0.57 0.72 0.79 0.29 0.00 0.49 0.64 0.73 0.57 0.49 0.00 0.55 0.64 0.72 0.64 0.55 0.00 0.48 0.79 0.73 0.64 0.48 0.00 (a) Average conflict in PCgrad 0 1 2 3 4 Iteration 10 weight I1 weight I2 weight I3 weight I4 weight I5 (b) Gradient weights of Nash MTL 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Iteration 105 102 weight I1 weight I2 weight I3 weight I4 weight I5 (c) Loss weight of UW Figure 8: Behavior of multi-task learning methods through gradient-based interval clustering across training iterations. A similar trend as in Fig. 3 is observed. Table 5: FID-10K scores of the LDM trained using a combination of UW and PCgrad methods on the FFHQ dataset while varying the value of k. Notably, integrating MTL methods with two clusters significantly improves FID scores. Increasing k from 2 to 5 also enhances FID scores, but further increasing k from 5 to 8 shows similar results. Clustering Vanilla Number of clusters (k) k = 2 k = 5 k = 8 Timestep 9.563 9.151 9.083 SNR 9.606 9.410 9.367 Gradient 9.634 9.033 9.145 D.1 Visualization for the Behavior of MTL Methods with Other Clustering Methods Due to space constraints in our main paper, we were unable to include the behavior analysis of MTL methods for SNR-based and gradient-based interval clustering. However, we present these results in Fig. 7 and 8, which show similar trends to the observations depicted in Fig. 4. These findings suggest valuable insights into the behavior of MTL methods, regardless of the clustering objectives. Firstly, we observed a notable increase in the occurrence of conflicting gradients as the timestep difference between tasks increased. This observation suggests that the temporal distance between denoising tasks plays a crucial role in determining the frequency of conflicting gradients. Secondly, we noted that both loss and gradient balancing methods assign higher weights to task clusters with higher levels of noise. This finding indicates that these methods allocate more importance to the noisier tasks. D.2 Analysis: The Number of Interval Clusters To understand the impacts of the number of clusters k, we conducted experiments by varying k with 2, 5, and 8. We trained a model for timestep-based, SNR-based, and gradient-based clustering with each k, resulting in nine trained models. For MTL methods, we used combined methods with UW [29] and PCgrad [83] as in Section 5.4. All training configurations such as learning rate and training iterations are the same as in Section 5.1. We evaluate 10K generated samples from the DDIM 50-step sampler [67] for all methods with the FID score [51, 18]. Table 5 shows the results. Notably, we made an intriguing observation regarding the integration of MTL methods with only two clusters, which resulted in a noteworthy enhancement in FID scores. Additionally, we found that increasing the number of clusters, denoted as k, from 2 to 5 also exhibited a positive impact on improving FID scores. However, our findings indicated that further increasing k from 5 to 8 did not yield significant improvements and resulted in similar outcomes. From these results, we conjecture that increasing the number of clusters to greater than five has no significant effect. Table 7: The results of Random Loss Weighting (RLW) and Linear Scalarization (LS) on the FFHQ dataset in ADM architecture. Clustering Method FID Precision Recall - Vanilla 24.95 0.5427 0.3996 Timestep RLW 38.06 0.4634 0.3293 LS 25.34 0.5443 0.3868 SNR RLW 35.13 0.4675 0.3404 LS 25.69 0.5369 0.3843 Gradient RLW 36.19 0.4643 0.3392 LS 26.12 0.5120 0.3878 D.3 Comparison Interval Clustering with Task Grouping Method To show the effectiveness of interval clustering methods for denoising task grouping in diffusion models, we compare high-order approximation (HOA)-based grouping methods [72, 12]. For grouping N-tasks in deep neural networks, the early attempt [72] established a two-stage procedure: (1) compute MTL performance gain for all task combinations and (2) search best groups for maximizing MTL performance gain across the groups. However, performing (1) requires huge computation since MTL performance gain should be measured for all 2N 1 combinations. Therefore, they reduce computation by HOA, which utilizes MTL gains on only pairwise task combinations. Also, the HOA scheme is inherited by the following work, task affinity grouping [12], which uses their defined task affinity score instead of MTL gains. Different from these works, our interval clustering aims to group the tasks with interval constraints. Table 6: Comparison interval clustering and high order approximation-based task grouping. DDIM-50 step sampler is used. Clustering FID-10k HOA 9.873 Interval 9.033 For a fair comparison, we use a pairwise gradient similarity averaged across training iterations between denoising tasks for the objective of HOA-based grouping and interval clustering. In this case, the HOAbased grouping becomes cosine similarity grouping used in [12], and interval clustering becomes gradient-based clustering in our method. However, for HOA-based grouping, a solution of brute force searching with branch-and-bound-like algorithm [72, 12] requires computational complexity of O(2N). It incurs enormous costs in diffusion with many denoising tasks. Therefore, we use a beamsearch scheme in [68]. We set the number of clusters as 5 for both methods. We apply the combined method with UW [29] and PCgrad [83] as in Section 5.3 for the resulting clusters from both HOA-based grouping and interval clustering. We trained the model on the FFHQ dataset [27] and used LDM architecture [56]. All training configurations are the same as in Section 5.1. For evaluation metrics, we use FID and its configurations are the same as in Section 5.1. Table 6 shows the results, indicating that the interval clustering outperforms HOA-based task grouping. D.4 Comparison to Random Loss Weighting and Linear Scalarization Linear Scalarization (LS) [80, 35] and Random Loss Weighting (RLW) [40] can serve as strong baselines for MTL methods. Therefore, validating the superiority of our method compared to theirs can emphasize the necessity of applying sophisticated MTL methods such as UW, PCgrad, and Nash MTL. Accordingly, we provide the results of comparative experiments for LS and RLW on the FFHQ dataset using ADM architecture in Table 7. We note that all experimental configuration is the same as in vanilla training in Section 5.1. As shown in the results, LS achieves slightly worse performance than vanilla training, which suggests that simply re-framing the diffusion training task as an MTL task and applying LS is not enough. Also, RLW achieves much worse performance compared to vanilla training. It appears that the randomness introduced by loss weighting interferes with diffusion training. These results indicate that sophisticated MTL methods are indeed responsible for significant performance gain. E Detailed Experimental Settings in Section 5 In this section, we describe the details of experimental settings in Section 5. For validating the effectiveness in both pixel-space and latent-space diffusion models in unconditional generation, we used ADM [8] and LDM [56] as same in our observational study (refer to details of architecture in Appendix B). E.1 Detailed Settings of Comparative Evaluation and Analysis (Section 5.1 and 5.3) A single A100 GPU is used for experiments in Section 5.1 and 5.3. Setups for Unconditional Generation We trained the models on FFHQ [27] and Celeb A-HQ [26] datasets. All training was performed with Adam W optimizer [43] with the learning rate as 1e 4 or 2e 5, and better results were reported. For ADM, we trained 1M iteration with batch size 8 for the FFHQ dataset and trained 400K iterations with batch size 16 for the Celeb A-HQ dataset. For LDM, we trained 400K iterations with batch size 30 for both FFHQ and Celeb A-HQ datasets. We generate 10K samples with a DDIM-50 step sampler and measure FID [18], Precision [36], and Recall [36] scores. For all evaluation metrics, we use all training data as reference data. FID is calculated with clean-FID [51], and Precision and Recall are computed with publicly available code 6. All analyses are conducted above trained models. Setups for Class-Conditional Generation We trained the Di T-S/2 [52] on Image Net dataset [7]. All training was performed with the Adam W optimizer [43] with the learning rate of 1e 4 or 2e 5, and better results were reported. As in Di T [52], we applied the classifier-free guidance [19] and trained 800K iterations with a batch size of 50. All samples are generated by a DDPM 250-step sampler. For evaluation metrics, we follow the evaluation protocol in ADM [8], by using their evaluation code7. We used the cosine schedule [47] for noise scheduling and SD-XL VAE [53] for our VAE. E.2 Detailed Settings of Comparison to Loss Weighting Methods (Section 5.2) We trained the Di T-L/2 [52] on Image Net dataset [7]. All training was performed with the Adam W optimizer [43] with the learning rate of 1e 4. As in Di T [52], we applied the classifier-free guidance [19] and trained 400K iterations with a batch size of 256. All samples are generated by a DDPM 250-step sampler and classifier-guidance scale of 1.5. We used the cosine schedule [47] for noise scheduling. For experiments, we used 8 A100 GPUs. E.3 Detailed Settings of Combining MTL Methods with Sophisticated Training Objectives (Section 5.4) We trained three different models: vanilla LDM, vanilla LDM with P2 [6], and vanilla LDM with P2, PCgrad [83], and UW [29] applied simultaneously. All training configurations are the same in Section 5.1 but we use 500K iterations. We generate 50K samples for evaluation with a DDIM 200-step sampler and evaluate FID. F Qualitative Results In this section, we provide qualitative comparison results, which were omitted from the main paper due to space constraints. In Figure 9, 10, 11 and 12, we visualize the generated images by all models that are used for results in Table 1. As shown in the results, we can observe that incorporating MTL methods for diffusion training can improve the quality of generated images. One noteworthy observation is that UW [29] tends to generate higher-quality images compared to Nash MTL [46] and PCGrad [83]. This finding aligns with the results observed in Table 1. Moreover, we plot the randomly selected samples from 50K generated data in Fig. 13. Despite being randomly selected, the majority of the generated images exhibit remarkable fidelity. 6https://github.com/youngjung/improved-precision-and-recall-metric-pytorch 7https://github.com/openai/guided-diffusion/tree/main/evaluations (b) PCgrad - Timestep (c) PCgrad - SNR (d) PCgrad - Gradient (e) Nash - Timestep (f) Nash - SNR (g) Nash - Gradient (h) UW - Timestep (i) UW - SNR (j) UW - Gradient Figure 9: Qualitative comparison of ADM trained on the FFHQ dataset. (b) PCgrad - Timestep (c) PCgrad - SNR (d) PCgrad - Gradient (e) Nash - Timestep (f) Nash - SNR (g) Nash - Gradient (h) UW - Timestep (i) UW - SNR (j) UW - Gradient Figure 10: Qualitative comparison of LDM trained on the FFHQ dataset. G Dynamic Programming Algorithm for Interval Clustering In this section, we introduce the algorithm for optimizing the interval cluster and the implementation details. The optimal solution of interval clustering can be found using dynamic programming for a Lcluster function [2, 49, 76]. The sub-problem is then defined as finding the minimum cost of clustering X1,i = {1, . . . , i} into m clusters. By saving the minimum cost of clustering X1,i = {1, . . . , i} into m clusters to the matrix D[i, m], the value in D[T, k] represents the minimum clustering costs for the original problem in Eq. 4. For some timestep m j i, D[j 1, m 1] must contain the minimum costs for clustering X1,j 1 into (m 1) clusters [49, 76]. This establishes the optimal substructure for dynamic programming, which leads to the recurrence equation as follows: D[i, m] = min m j i D[j 1, m 1] + Lcluster(Xj,i) , 1 i T, 1 m k. (6) To obtain the optimal intervals l1, . . . , lk, we use S[i, m] to record the argmin solution of Eq. 6. Then, we backtrack the solution in O(k) time from S[T, k] by assigning lm = S[lm+1 1, m] from m = k to m = 1 by initializing lk = S[T, k]. Interval clustering with SNR-based or gradient-based objectives can produce unbalanced sizes of each interval, which causes unbalanced allocation of task clusters due to randomly sampled timestep t. Therefore, we add constraints on the size of each cluster to avoid seriously unbalanced task clusters. To add constraints on the size of each cluster ni = |Ii| = ri li + 1 for i = 1, ..., k, we define the lower and upper bounds of it as m I and MI with m I ni k MI. In Eq. 6, the m-th cluster (i.e., Xj,i) size nm must range from m I to MI, yielding i + 1 MI j i + 1 m I. Furthermore, to (b) PCgrad - Timestep (c) PCgrad - SNR (d) PCgrad - Gradient (e) Nash - Uniform (f) Nash - SNR (g) Nash - Gradient (h) UW - Timestep (i) UW - SNR (j) UW - Gradient Figure 11: Qualitative comparison of ADM trained on the Celeb A-HQ dataset. (b) PCgrad - Timestep (c) PCgrad - SNR (d) PCgrad - Gradient (e) Nash - Timestep (f) Nash - SNR (g) Nash - Gradient (h) UW - Timestep (i) UW - SNR (j) UW - Gradient Figure 12: Qualitative comparison of LDM trained on the Celeb A-HQ dataset. satisfy the (m 1)-clusters constraint, 1 + (m 1)m I j. Finally, Eq. 6 with constraints on the size of the cluster is derived as follows: D[i, m] = min max{1+(m 1)m I,i+1 MI} j j i+1 m I D[j 1, m 1]+Lcluster(Xj,i) , 1 i T, 1 m k. (7) Specifically, we assign T 2k to m I and MI, respectively. H Broader Impacts Revisiting Diffusion Models through Multi-Task Learning Our work revisits diffusion model training from a Multi-Task Learning aspect. We show that negative transfer still occurs in diffusion models and addressing it with MTL methods can improve the diffusion models. Starting from our work, a better understanding of the multi-task learning characteristics in diffusion models can lead to further advancements in diffusion models. Negative Societal Impacts Generative models, including diffusion models, have the potential to impact privacy in various ways. For instance, in the context of Deep Fake applications, where generative models are used to create realistic synthetic media, the training data plays a critical role in shaping the model s behavior. When the training data is biased or contains problematic content, the generative model can inherit these biases and potentially generate harmful or misleading outputs. This highlights the importance Figure 13: Randomly selected images from generated images of LDM with combined methods of UW, PCgrad, and P2 on the FFHQ dataset. DDIM 250-step sampler is used. of carefully selecting and curating the training data for generative models, particularly when privacy and ethical considerations are at stake. I Limitations Our work has two limitations that can be regarded as future works. Firstly, we have not yet completely resolved the issue of negative transfer in the training of diffusion models as shown in Fig. 5. This indicates that learning entire denoising tasks still causes degradation in certain denoising tasks. By successfully addressing this degradation and enabling the model to harmoniously learn entire denoising tasks, we anticipate significant improvements in the performance of the diffusion model. Secondly, our study does not delve into the architectural design aspects of multi-task learning methods. While our focus lies on model-agnostic approaches in MTL, it is worthwhile to explore the possibilities of designing appropriate architectures within an MTL framework. Previous works in diffusion models utilize timestep and noise level as input, which can be considered as using task embeddings scheme [73]. By revisiting these aspects, the architecture of the diffusion model can be further advanced in future works.