# longtailed_diffusion_models_with_oriented_calibration__2129053c.pdf

Published as a conference paper at ICLR 2024

LONG-TAILED DIFFUSION MODELS WITH ORIENTED CALIBRATION

Tianjiao Zhang1, Huangjie Zheng3, Jiangchao Yao1,2, Xiangfeng Wang4, Mingyuan Zhou3, Ya Zhang1,2, Yanfeng Wang1,2,B

1 Cooperative Medianet Innovation Center, Shanghai Jiao Tong University 2 Shanghai Artificial Intelligence Laboratory 3 The University of Texas at Austin 4 East China Normal University {xiaoeyuztj, Sunarker, ya zhang, wangyanfeng}@sjtu.edu.cn, xfwang@cs.ecnu.edu.cn, huangjie.zheng@utexas.edu, mingyuan.zhou@mccombs.utexas.edu

Diffusion models are acclaimed for generating high-quality and diverse images. However, their performance notably degrades when trained on data with a longtailed distribution. For long tail diffusion model generation, current works focus on the calibration and enhancement of the tail generation with head-tail knowledge transfer. The transfer process relies on the abundant diversity derived from the head class and, more significantly, the condition capacity of the model prediction. However, the dependency on the conditional model prediction to realize the knowledge transfer might exhibit bias during training, leading to unsatisfactory generation results and lack of robustness. Utilizing a Bayesian framework, we develop a weighted denoising score-matching technique for knowledge transfer directly from head to tail classes. Additionally, we incorporate a gating mechanism in the knowledge transfer process. We provide statistical analysis to validate this methodology, revealing that the effectiveness of such knowledge transfer depends on both label distribution and sample similarity, providing the insight to consider sample similarity when re-balancing the label proportion in training. We extensively evaluate our approach with experiments on multiple benchmark datasets, demonstrating its effectiveness and superior performance compared to existing methods. Code: https://github.com/Media Brain-SJTU/OC_LT.

1 INTRODUCTION

Diffusion models have emerged as a powerful class of deep probabilistic models. These models leverage techniques from statistical physics and probabilistic modeling to generate high-quality, realistic samples from complex data distributions (Sohl-Dickstein et al., 2015). The effective implementation of a diffusion model necessitates extensive training on a diverse and sizable collection of image data. In general, there is a prevalent occurrence of a long-tail distribution (Yang et al., 2022), wherein a vast majority of images belong to a few dominant categories, while a significant portion of the dataset comprises less frequently occurring categories. As a consequence, the training of diffusion models with long-tail data continues to pose a formidable challenge owing to the distortion in the entire dataset.

In current works, many attempts have been made to address the lack of diversity and mode collapse issues in the generation of tail classes. A series of methods based on Generative Adversarial Networks (GANs) has been proposed. One of the common approaches is to adopt strategies that refine the general model s generation ability by improving its conditional modeling (Rangwani et al., 2022) on tail categories, which heavily depend on GANs structure. Another line of research focuses on alleviating the scarcity of samples in tail classes through appropriate data augmentation techniques (Karras et al., 2020; Zhao et al., 2020; Rangwani et al., 2023), and diffusion process (Zheng et al., 2023b; Wang et al., 2023). Nevertheless, such methods may not effectively capture the under-

Published as a conference paper at ICLR 2024

lying data distribution or introduce meaningful variations (Yoo et al., 2020). For long tail diffusion models, Class Balancing Diffusion Models (CBDM) (Qin et al., 2023) has proposed a distribution adjustment regularizer enhancing tail generation based on the model prediction on the head class. However, the augmentation relying on the condition and prediction of the model might cause bias during training resulting in generated outcomes that do not meet expectations and leading to a lack of robustness.

... Generation

Direct Reference

Label condition

Former Methods: Relying on the

condition capacity

Ours: Using Head samples as reference

Prediction Regularization

Figure 1: The illustration of our motivation. We directly utilize head samples as reference for tail augmentation instead of depending on the model condition capacity.

In order to alleviate the issue, a direct knowledge transfer from head to tail categories should be established. Let s review a recent study on the diffusion process, Xu et al. (2023) observed that the score function exhibits the highest variance during the intermediate steps, which is a critical period for semantic formation (Zhang et al., 2023) with the undetermined target. Therefore, based on this evidence, we can utilize the score information from head classes to calibrate and enhance the generation of tail classes in this period and finally improve the overall generation performance. In this study, leveraging the evidence that the score of the diffusion model could be estimated via referencing multiple targets in the dataset (Xu et al., 2023), we propose a calibration strategy for the scores of the tail class directly making use of the head samples as reference, as shown in Figure. 1. By employing the strategy that leverages the similarity of underlying data distribution, the reliance on the conditional capacity of the model is mitigated and the generation performance is improved.

To realize the augmentation for the tail scores, we begin with modeling the score as the weighted scores averaging towards different targets. For conditional generation, the score estimation of noisy tail samples is augmented via properly improving the contribution of the score from the head samples in the reference batch, denoted as a T2H (noisy Tail to clean Head) operation to enhance the diversity of the tail generation. Simultaneously, a Batch Re-sample approach is utilized to alleviate the limitation on overall head-to-tail transfer strength in T2H mode. Besides that, in unconditional generation, the score function is predominantly influenced by samples from the head classes. Batch re-sampling is also employed to address this issue. While the method H2T (noisy Head to clean Tail), the reverse direction to T2H, has shown its effectiveness in unconditional generation.

Our contributions can be summarized as follows: (1) We have developed a method denoted as T2H based on the multi-target nature of score estimation to effectively calibrate and enhance the generation of tail classes in the semantic formation period, thereby significantly improving the overall generation performance. (2) A Batch Re-sample strategy is employed to construct a balanced reference batch, in order to address the extreme dominance issue of scores from the head class and promotes head-to-tail transfer under T2H mode. (3) We conduct extensive testing on three different long tail datasets (CIFAR10, CIFAR100, Tiny Image Net) to validate the effectiveness of our proposed method. The results consistently demonstrated the superiority of our approach.

2 PRELIMINARY

The diffusion model involves slowly adding noise to the existing training data in the forward process, and then utilizing a deep learning network to gradually recover the original data from the noise in the reverse process. In the forward process, a clean image is progressively transformed by adding carefully calibrated Gaussian noise (Ho et al., 2020) at each diffusion step t, q(xt|x) = N(αtx, σ2 t I), 0 t T where the coefficients αt and σt are chosen so that q(xt) is close to initial data density at t 0 and close to Gaussian at t T. For the reverse process, the diffusion model firstly samples from a Gaussian noise distribution p(x T ) N(0, I), and then gradually incorporates

Published as a conference paper at ICLR 2024

Selection: Initial or Transferred?

Denoising Prediction

Figure 2: Overall flowchart for two strategies (H2T, T2H), where the target is determined by evaluating the probabilities of labels before and after the transfer to identify the satisfied mode.

various structures and semantic information at each step. Then transition p(xt 1|xt) is estimated by training an image-to-image Unet parameterized by θ, pθ(xt 1|xt) = N(xt 1; µθ(xt, t), Σt). Instead of directly predicting the mean value, the Unet is trained to predict the noise content of xt, which is optimized by the following loss: LDiff(θ, x) = Et U(0,T ),ϵ N(0,I)[||ϵ ϵθ(xt; t)||2 2] (1)

Another perspective on the diffusion model is to start from a parameter-free designed stochastic differential equation (SDE) dx = f(x, t)dt + g(t)dw, which transfers the initial data distribution to a prior distribution as time goes from 0 to T. The w is the standard Wigner process. Sampling the diffusion model via a reverse time SDE (Anderson, 1982): dx = [f(x, t) g(t)2 xt log p(xt)]d t+ g(t)d w. The means time reverse. The score function namely s(xt, t) = xt log p(xt) should be estimated in the training stage. The predicted noise in Eq. (1) could be related to the score function via denoising score matching (Vincent, 2011): ϵθ(xt, t) = σtsθ(xt, t) (2) From Eq. (2) we could learn that the score is proportional to the noise predicted by the model. In this study, we start from the optimal score xt log p(xt) to formulate our methodology for long tail distribution.

Our problem involves training data that exhibits a long-tail distribution (x(i), y(i)) M i=1 sampled from p(x, y) and a diffusion model parametrized via a denoising Unet ϵθ(xt, t). The y(i) {C1, C2, ..., CL} is the label of x(i), assuming that the classes are ordered in descending probability of label occurrence, i.e., if i < j then ni > nj, where the ni is the number training samples belonging to class Ci. With proper training methodology using long-tailed training data and diffusion model, we want to generate a more balanced and diversified data distribution p (x0) in the inference time.

During the inference stage, a diffusion model utilizes a step-by-step reverse operation from a prior distribution p(x T ) to data distribution p(x0) with reverse-SDE discussed in Section. 2. The reverse SDE uses the score function s(xt, t) at time step t obtained from the training stage. The optimal score s (xt, t) = xt log qt(xt) could be expressed with the expectation: xt log qt(xt) = Eq(x0|xt) xt log q(xt|x0), (3) where the proof is provided in Appendix A. As the equation illustrated, the score for a given xt could be calculated by the expectation under the distribution q(x0|xt). Since this distribution is intractable, we utilize importance sampling to sample from initial data distribution q(x0).

Using the Bayesian rule, we have q(x0|xt) = q(x0)q(xt|x0)/q(xt). Given q(xt) = R q(xt|x0)q(x0)dx0, we can derive

Eq(x0|xt) xt log q(xt|x0) = Eq(x0) q(xt|x0) Ex 0 q(x0)q(xt|x 0) xt log q(xt|x0). (4)

Thus, with a mini-batch of samples from q(x0), i.e., x(1:N) 0 iid q(x0), we have

Eq(x0|xt) xt log q(xt|x0)

q(xt|x(i) 0 ) PN j=1 q(xt|x(j) 0 ) xt log q(xt|x(i) 0 ). (5)

Published as a conference paper at ICLR 2024

Specifically, we have expanded upon the original one-to-one optimal noise estimator in denoising score matching and transformed it into the one-to-many distributional matching technique, with a conditional distribution that determines the mapping probability, which encourages a mode-covering behavior in the score matching to enhance the diversity of generative modeling (Zheng & Zhou, 2021). The expectation in the equation we approximate with a weighted average of scores with respect to different samples {x(i) 0 } denoted as a reference batch.

T2H. In conditional generation, the labels are involved and conditioned. The unconditional score xt log q(xt) and conditional score xt log q(xt|y) are both estimated in the training stage. With label participated, the sampling distribution should be x(i) 0 q(x0|y) with label y instead of x(i) 0 q(x0) in Eq. (5) as:

xt log q(xt|y)

q(xt|x(i) 0 , y) P

j q(xt|x(j) 0 , y(j) 0 ) | {z } Score Weight

xt log q(xt|x(i) 0 , y) {(x(k) 0 , y(k) 0 )}M k=1 q(x0, y0),

(6) where the proof is provided in Appendix. B. In the denominator, we sample from q(x0, y0) instead of q(x0) in the self-normalizing technique with label. The distribution of q(xt|x(i) 0 ) follows a Gaussian

distribution and could be calculated by B exp( ||x(i) 0 xt||2

2σ2 t ) where B is a normalizing constant and

there is no label involved. The mixing weight q(xt|x(i) 0 ,y) P

j q(xt|x(j) 0 ,y(j) 0 ) from (x(j) 0 , y(j) 0 ) is dominant by the

L2 distance respect to xt. In addition to the L2 distance in the Gaussian kernel, here we make a further assumption that the distribution q(xt|x0, y0) is adjusted by q(y0)β:

q(xt|x0, y0) B q(y0)β exp( ||xt x0||2 2 2σ2 t ), (7)

where β is a pre-defined parameter, that controls the overall distribution density with respect to y0. When β = 0 then the xt s distribution is only dependent on x0.

Initial Sample Head Samples

Figure 3: Mixing score weights of the sample with ID 8 towards different target in a reference batch under different weight calculation of q(xt|x0, y0).

For the score estimation of xt obtained from noisy tail sample (x T 0 , y T 0 ), we employ a method of score-oriented calibration. The approach enhances the contribution of head class samples (x H 0 , y H 0 ) by increasing the mixing score weight in Eq. 6. Consequently, it leverages the rich diversity of the head class to improve the performance of the tail class.

To improve the mixing score weight for the head samples, we could assign β = 1 in Eq. 7, since the q(y H 0 ) in the equation has a relative larger value. As shown in Figure. 3, the mixing weights towards head samples are improved compared with directly calculating the mixing

weight depending only on exp( ||x(i) 0 xt||2

Furthermore, for faster and easier implementation, we firstly sample a mini-batch {x(i) 0 , y(i) 0 }K i=1. Then as the typical training strategy of the diffusion model, for each sample x(i) 0 , we sample a random t U(0, T) and random Gaussian noise ϵi N(0, I) and obtain the perturbed noisy xt. The training objective is to predict the noise ˆϵi by our parametrized Unet ϵθ(xt, t; y) = σtsθ(xt, t; y) .

Published as a conference paper at ICLR 2024

Since the score for each target xt, the xt log q(xt|x(i) 0 , y) 1

σt ϵi. As a consequence, we transfer the Eq. 6 from a score-weighted mixing problem to a score selection problem as:

ˆϵi = ϵz; z psel(z) = q(xt|x(z) 0 , y(z) 0 ) P

j q(xt|x(j) 0 , y(j) 0 ) ; 0 z K, (8)

where psel(z) denotes the probability of selecting score with index z in the batch with size K. ϵz is calculated with ϵz = xt α x(z) 0 , where the estimated noise is obtained by a new clean target and current noisy sample. So we firstly calculate the multi-nomial distribution in Eq. (8) with Gaussian

kernel: q(xt|x(z) 0 , y(z) 0 ) B exp( ||x(z) 0 xt||2 2 2σ2 t ), and then we sample z from this distribution. If

q(y(z) 0 ) q(y(i) 0 ), the transferring is allowed and the score is substituted with ϵz, which means the noisy tail sample is mapped to a clean head sample. If not, the transferring is forbidden then the score is sent back to ϵi, which means that the noisy head sample is forbidden to map to the clean tail sample. Since the noisy sample is obtained from the head class, we denote such kind of transferring mode as T2H (noisy Tail sample to clean Head sample).

Here, we amplify the contribution of head samples by selectively transferring the target solely to the clean head samples in the reference batch. The equivalence between T2H and directly enhancing the contribution of head samples via setting β = 1 in score mixing has also been validated in the experiments, as shown in Figure. 5.

We summarize the Algorithm in Alg. 1. We note for H2T mode, we just need to change the boundary condition from q(y(z) 0 ) q(y(i) 0 ) in the algorithm to q(v) q(y(i) 0 ).

Algorithm 1 T2H algorithm for conditional long tail generation

Sample mini-batch {(x(i) 0 , y(i) 0 )}K i=1 with balanced distribution q (x, y) for each sample (x(i) 0 , y(i) 0 ) in the mini-batch do Sample xt with random t and Gaussian noise ϵi N(0, σ2 t I)

Calculate psel(z) according to Eq. 8 with Gaussian kernel C exp( ||x(z) 0 xt||2 2 2σ2 t ) 0 z K Sample z psel(z) if q(y(z) 0 ) q(y(i) 0 ) (q(y(z) 0 ) < q(y(i) 0 )) then ˆϵi = ϵz (ˆϵi = ϵi) end if Compute denoising loss if Conditional then

LDiff = ||ˆϵi ϵθ(xt, t; y)||2 2 else

LDiff = ||ˆϵi ϵθ(xt, t)||2 2 end if end for

For a specific sample of tail class CT , we could augment the score via encouraging the model to predict the score towards the head class. Indeed, by leveraging the rich semantics of the head class and increasing the diversity of scores from the tail categories along the generation path, our approach enhances the overall diversity of generated samples. The increased variety ensures a more comprehensive generation of different classes and ultimately enhances the overall quality and diversity of the generated samples. The fidelity of the transferring could be found in Appendix. G.

Batch Re-sample. T2H achieves enhancement in generating tail categories through head-to-tail transfer for conditional generation. Here we evaluate the strength of the transfer quantitatively. Consider two samples (x H 0 , y H 0 ) and (x T 0 , y T 0 ) from different head class CH and tail class CT . Let s discuss the distribution with same perturbed noise level N(0, σ2 t I) with q(xt|x H 0 , y H 0 ) and q(xt|x T 0 , y T 0 ). For the purpose of measuring the strength of transition from CT to CH based on

Langevin dynamics, then the Eq(xt|x H,y H) q(xt|x T 0 ,y T 0 ) P

k q(xt|x(k) 0 ,y(k) 0 ) should be calculated (Song & Ermon,

2020). Then we have the following proposition:

Published as a conference paper at ICLR 2024

Proposition 3.1 Let q(xt|x0, y0) B q(y0)β exp( ||xt x0||2 2 2σ2 t ), then:

Eq(xt|x H 0 ,y H 0 ) q(xt|x T 0 , y T 0 ) P

k q(xt|x(k) 0 , y(k) 0 ) B

2 (q(y H) q(y T ))β exp( ||x H 0 x T 0 ||2 2 8σ2 t ). (9)

As you can see from the proposition, the transition strength is determined by the L2 distance represented by a Gaussian kernel between two samples and the product of label probability powered by β. In the perspective of Eq. (7) with algorithm T2H, when β = 1, The transfer strength is constrained by the product of the q(y) of the head and tail classes, which is relatively a small value. Therefore, employing a Batch Re-sampling strategy during training, which equalizes the appearance probability of each category, can significantly enhance the head-to-tail transfer.

Besides, in unconditional generation, the training data q(x0) exhibits a long-tail distribution during the training process. As a result, If we directly use the results from the score estimated in Eq. (5), then the generated p(x0) will also follow a long-tail distribution. The frequent occurrence of head samples will lead to unconditional scores being dominated by the score of head samples. To intuitively illustrate the necessity of using the Batch Re-sample, we employed a toy example to illustrate the phenomenon of head class dominance in long-tail unconditional generation, as shown in Figure. 4. The Batch-Re-sample could also alleviate this issue. More can be found in Appendix. J.

6 4 2 0 2 4 6 X

Figure 4: A toy example for simulating the score distribution. The red and blue points denote head and tail samples, respectively.

Here, we follow the common assumption in the long tail recognition: the balanced distribution and initial long tail distribution are related by sharing the same conditional probability q(x|y) = q (x|y) (Zhang et al., 2013).

H2T, Full. In the former analysis, in order to augment the tail class, we propose a method denoted as noisy Tail to clean Head (T2H). Conversely, there would exist a method of H2T denoting noisy Head to clean Tail. In H2T, the weight of score towards target tail samples is improved corresponding to smaller β in Eq. (7), eg. β = 1, as the inverse value q(y) for the tail class is larger than the head class. Consequently, for a noisy head sample, the contribution of the tail sample targets are enhanced. In addition to H2T and T2H, if we do not assess the probabilities of transferred labels, this mode is denoted as Full (means allowing both transfer directions).

Connection with CBDM (Qin et al., 2023). The CBDM has employed a score aligning loss in the training stage: 1 |Y| P

y Y t||ϵθ(xt, y) ϵθ(xt, y )||2. The conditional score with label y of xt is regularized with an another conditional score with label y . The aligned strength is weighted with the diffusion time t. We could further prove that the optimal score of CBDM could be written as a weighted sum of the initial denoising score and an adjustment term, which is similar to Eq. (6). When the diffusion model converges, the optimal minimizer ϵ (xt, y) for the CBDM loss: LCBDM(xt, y, t, ϵ) = ||ϵθ(xt, y) ϵ||2 + τt

y Y ||ϵθ(xt, y) ϵθ(xt, y )||2 could be denoted as: ϵ (xt, y) = 1 1+tτ ϵ+ t τ (1+t τ) |Y| P y ϵθ(xt, y ). where the proof is provided in Appendix. D. The operation also can be considered as augmenting the tail class with the head with labels. The difference is that the reference of our method is obtained from data, whereas CBDM s reference is based on the predictions and condition capacity of the model that may cause biased scores, as we have discussed in Section. 1. Our method possesses greater robustness besides the state-the-of-the-art performance.

4 EXPERIMENTS

Experimental Setup. We started by selecting two widely utilized datasets in the field of image synthesis, namely CIFAR10/CIFAR100, with their long-tailed versions CIFAR10LT and CIFAR100LT.

Published as a conference paper at ICLR 2024

Table 1: Ablation study for both conditional and unconditional generation on CIFAR10LT.

Model Conditional Batch Re-Sample T2H H2T Full FID ( ) IS ( )

A 25.31 0.12 7.01 0.02 B 16.92 0.17 8.15 0.03 C 16.85 0.10 8.16 0.02 D 16.78 0.09 8.11 0.05 E 16.09 0.11 8.27 0.02 F 10.72 0.23 9.37 0.03 G 10.20 0.13 9.65 0.01 H 8.20 0.09 9.77 0.01 I 7.52 0.12 9.73 0.02 J 6.89 0.09 9.75 0.05

The construction of CIFAR10LT and CIFAR100LT follows the methodology proposed in Cao et al. (2019) , wherein the size of each category exponentially decreases with its category index, adhering to an imbalance factor of imb = 0.01. For the CIFAR10LT dataset, we also implement a more skewed version with the imbalance factor of imb = 0.001. Two commonly used metrics for image generation are adopted namely Frechet Inception Distance (FID) (Heusel et al., 2017) and Inception Scores (IS) (Salimans et al., 2016). During the inference time, we generate 50k images for the evaluation of the metrics. A DDIM sampler (Song et al., 2020a) is utilized with 100 steps of 10 steps skip comparison with initial DDPM (Ho et al., 2020) 1000 steps.

Our training schedules are strictly follows the implementation of CBDM (Qin et al., 2023), which follows the DDPM settings. The experiments are conducted with two settings corresponding to the methods, unconditional generation and conditional generation. For the unconditional generation, we only adjust the training strategy with no label information injected into the diffusion model for optimization. At the inference stage, the unconditional diffusion model is asked to generate 50k images freely. For the conditional generation, there is label information injected into the diffusion model in the training stage. While at the inference stage, the diffusion model is asked to generate 50k/L images for each class where L is the number of classes.

Table 2: Comparison with other long tail generation methods

Datasets with different imb factors

CIFAR10LT CIFAR100LT

Imb factor 0.01 Imb factor 0.001 Imb factor 0.01

# Metrics FID ( ) IS ( ) FID ( ) IS ( ) FID ( ) IS ( )

CBGAN (Rangwani et al., 2021) 37.23 6.01 46.61 5.77 33.01 7.04 g SR-GAN (Rangwani et al., 2022) 12.86 8.56 38.71 6.89 13.96 10.01 DDPM (Ho et al., 2020) 10.72 9.37 15.00 9.16 10.25 12.96 CBDM (Qin et al., 2023) 7.27 9.37 12.71 9.01 7.82 12.40

Ours (DDPM+T2H) 6.89 9.75 11.56 9.17 6.68 12.94

Ablation study. We conduct both conditional and unconditional generation experiment on CIFAR10LT, the strategies that have been discussed are applied to a base DDPM unconditional/conditional generation model. As shown in the Table 1, in an unconditional generation, the Batch Re-sample with H2T and T2H could improve the performance, where H2T is more effective as we discussed in the former section. In a conditional generation, T2H is more efficient than H2T and Full since improving the contribution of head samples for the noisy tail sample could enhance the diversity of tail classes and promoting the overall generation performance.

Table 3: Results on Tiny Image Net200LT datasets with diffusion baselines

Method FID IS

Base DDPM 19.24 18.20 CBDM 18.07 18.01 Ours + T2H 17.81 18.12

Comparison with other methods. We conduct a conditional generation, on CIFAR10LT, CIFAR100LT dataset. And we do a comparison with two GANbased long-tail generation methods CBGAN (Rangwani et al., 2021) and g SR-GAN (Rangwani et al., 2022) and one diffusion-based methods CBDM. The results are shown in Table 2.

Published as a conference paper at ICLR 2024

As illustrated in the Table 2, in the comparative analysis of conditional generation, the T2H method achieved the best performance of 6.89 FID score, surpassing base DDPM models with 3.83 and CBDM with 0.38. In a more skewed version of CIFAR10LT with an imbalanced factor of 0.001, T2H has achieved more than 1.0 FID improvements compared with CBDM. Under the CIFAR100LT benchmark, our method has also achieved more than 1.0 FID than the CBDM. The results have shown that our method has obtained a further improvement in the diffusion model on the long-tailed distribution, illustrating a more effective calibration and augmentation from data distribution than from the model prediction.

We also conduct experiments on a dataset Tiny Image Net200LT with more classes and higher resolution, which is the long tail version of Tiny Image Net200 (Tavanaei, 2020). The metric is based on 10k generated images referenced with its validation set, as shown in Table. 3

Figure 5: FID scores versus different β value. The performance of H2T (unconditional) and T2H (conditional) is also marked on the figure as pentagram.

T2H and H2T relation with different β value. For the purpose of validating the assumption with Eq. (7), we directly utilize the formula to calculate psel(z) instead of H2T and T2H. As shown in Figure. 5, in the case of unconditional generation, the generation performance improves as the β increases. Conversely, in the case of conditional generation, the generation performance deteriorates as the β increases.

Training Robustness of our method. In order to achieve better performance for image generation, a diffusion model is required to train for a long time. Due to various forms of data imbalance, which is essentially the long-tail nature of the data, the model tends to overfit and consequently leads to a decrease in performance, as observed in Figure 6. This phenomenon is evident in both conditional and unconditional diffusion model training without any additional processing. However, when employing CBDM in conditional generation, this situation can be mitigated to some extent. However, as training progresses, CBDM augmentation heavily relies on the model s prediction for other labels, so this phenomenon still exists. Nevertheless, our method effectively suppresses the issue and stabilizes performance in the long-time training process.

0 1 2 3 4 5 Iterations (100k)

DDPM Base DDPM+CBDM DDPM+T2H

(a) Conditional Generation

0 1 2 3 4 5 Iterations (100k)

DDPM Base DDPM+H2T

(b) Unconditional Generation

Figure 6: FID metric versus the long time training process on CIFAR10LT dataset. Our method could achieve better performance and stabilize the generation quality.

Transfer Probability with diffusion time. We make a study on selecting probability psel(z) in Eq. (8) with diffusion time. The transfer probability P z =i psel(z), which denotes shifting the denoise target in the training stage, is counted and calculated with 100k samples perturbed with levels of noise corresponding to different diffusion times within a mini-batch. As shown in Figure. 7, when the diffusion step size is between 500 and 800, the probability of transferring the denoise target increases from 0 to 1 as the step size increases. If the step size exceeds 800, then the transfer will

Published as a conference paper at ICLR 2024

With Transfer

No Transfer

Figure 7: The left subfigure is the transfer probability versus diffusion time. The right subfigure shows the FID scores with different transfer cutting time.

occur with an approximately 1.0 probability. We also restrict the transfer cutting step, denoting that the target transfer (T2H, H2T) is only allowed below this diffusion step. The FID scores decrease dramatically for both conditional and unconditional generation with a cutting time range from 500800, which is consistent with the observation of high variance phase in Xu et al. (2023).

5 RELATED WORK

Diffusion Models. For the diffusion generation model, Ho et al. (2020) firstly announced that the training of the model is accomplished by utilizing a weighted variational bound. Song et al. (2020b) proposed an alternative approach to constructing a diffusion model, which involves utilizing a stochastic differential equation (SDE) that gradually injects noise to smoothly transform a complex data distribution into a known prior distribution. Karras et al. (2022) presents a design space that distinctly delineates the concrete design choices for former works. As for the diffusion process, Xu et al. (2023) has observed three phases with distinct behaviors affecting the generation diffusion model. Raya & Ambrogioni (2023) demonstrate that the diffusion process could be modeled in a manner analogous to symmetry breaking in physics.

Long Tail Recognition. Long tail recognition refers to the task of accurately recognizing and classifying rare or infrequently occurring classes in a given dataset together with frequently occurring classes (Zhou et al., 2022). There are several approaches to address the problem, including reweighting (Huang et al., 2016), logit adjustment (Menon et al., 2020; Zhou et al., 2023), robust distributional matching (Zheng et al., 2023a; Chen et al., 2024), and knowledge transfer (Wang et al., 2017; Chen et al., 2022; 2023b). Cui et al. (2019) declare that as the number of samples increases, the diminishing phenomenon suggests that there is a decreasing marginal benefit for a model to extract additional information from the data due to the presence of information overlap.

Long Tail Generation. The objective of long tail generation is to generate a more balancing and diverse dataset training with a long tail dataset. CB-GAN (Rangwani et al., 2021) has used a regularizer that makes use of a pretrained classifier in the training stage to ascertain the balance learning of all classes in the dataset. g SR-GAN (Rangwani et al., 2022) observes that the performance decline in long tail generation primarily occurs because of class-specific mode collapse in tail classes which correlated with the spectral explosion of the conditioning parameter matrix and proposes a corresponding group spectral regularizer. CBDM (Qin et al., 2023) makes use of a distribution adjustment regularizer in the training stage for the purpose of augmenting the tail classes.

6 CONCLUSION

The main challenge for the long-tail diffusion generation is the lack of diversity for the tail class generation. To tackle the challenge, based on the multi-target characteristic of denoising score, a T2H augmentation for the estimation of noisy tail samples is achieved improving the score contribution of head samples in the reference batch. At the same time, the Batch Re-sample operation helps alleviate the dominant effect of head samples on the scores and promote head-to-tail transfer. The experiments are conducted to validate our approach on multiple benchmark datasets, demonstrating effectiveness compared with baseline methods and robustness versus training time.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENT

This work is supported by the National Key R&D Program of China (No. 2022ZD0160702), STCSM (No. 22511106101, No. 22511105700, No. 21DZ1100100), 111 plan (No. BP0719010) and National Natural Science Foundation of China (No. 62306178). We also thank Fei Zhang for his insightful discussion.

Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982.

Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in Neural Information Processing Systems, 32, 2019.

Mengxi Chen, Jiangchao Yao, Linyu Xing, Yu Wang, Ya Zhang, and Yanfeng Wang. Redundancyadaptive multimodal learning for imperfect data. ar Xiv preprint ar Xiv:2310.14496, 2023a.

Xu Chen, Siheng Chen, Jiangchao Yao, Huangjie Zheng, Ya Zhang, and Ivor W. Tsang. Learning on attribute-missing graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2):740 757, 2022. doi: 10.1109/TPAMI.2020.3032189.

Xu Chen, Zida Cheng, Jiangchao Yao, Chen Ju, Weilin Huang, Jinsong Lan, Xiaoyi Zeng, and Shuai Xiao. Enhancing cross-domain click-through rate prediction via explicit feature augmentation. ar Xiv preprint ar Xiv:2312.00078, 2023b.

Xu Chen, Yuangang Pan, Ivor Tsang, and Ya Zhang. Learning node representations against perturbations. Pattern Recognition, 145:109976, 2024. ISSN 0031-3203. doi: https://doi.org/10.1016/j. patcog.2023.109976. URL https://www.sciencedirect.com/science/article/ pii/S003132032300674X.

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268 9277, 2019.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375 5384, 2016.

Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104 12114, 2020.

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusionbased generative models. Advances in Neural Information Processing Systems, 35:26565 26577, 2022.

Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. ar Xiv preprint ar Xiv:2007.07314, 2020.

Yiming Qin, Huangjie Zheng, Jiangchao Yao, Mingyuan Zhou, and Ya Zhang. Class-balancing diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18434 18443, 2023.

Published as a conference paper at ICLR 2024

Harsh Rangwani, Konda Reddy Mopuri, and R Venkatesh Babu. Class balancing gan with a classifier in the loop. In Uncertainty in Artificial Intelligence, pp. 1618 1627. PMLR, 2021.

Harsh Rangwani, Naman Jaswani, Tejan Karmali, Varun Jampani, and R Venkatesh Babu. Improving gans for long-tailed data through group spectral regularization. In European Conference on Computer Vision, pp. 426 442. Springer, 2022.

Harsh Rangwani, Lavish Bansal, Kartik Sharma, Tejan Karmali, Varun Jampani, and R. Venkatesh Babu. Noisytwins: Class-consistent and diverse image generation through stylegans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.

Gabriel Raya and Luca Ambrogioni. Spontaneous symmetry breaking in generative diffusion models. ar Xiv preprint ar Xiv:2305.19693, 2023.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in Neural Information Processing Systems, 29, 2016.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020a.

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in Neural Information Processing Systems, 33:12438 12448, 2020.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020b.

Amirhossein Tavanaei. Embedded encoder-decoder in convolutional networks towards explainable ai. ar Xiv preprint ar Xiv:2007.06712, 2020.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011.

Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. Advances in Neural Information Processing Systems, 30, 2017.

Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Diffusion GAN: Training GANs with diffusion. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=HZf7Ubp WHu A.

Yilun Xu, Shangyuan Tong, and Tommi Jaakkola. Stable target field for reduced variance score estimation in diffusion models. ar Xiv preprint ar Xiv:2302.00670, 2023.

Lu Yang, He Jiang, Qing Song, and Jun Guo. A survey on long-tailed visual recognition. International Journal of Computer Vision, 130(7):1837 1872, 2022.

Jaejun Yoo, Namhyuk Ahn, and Kyung-Ah Sohn. Rethinking data augmentation for image superresolution: A comprehensive analysis and a new strategy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8375 8384, 2020.

Fei Zhang, Tianfei Zhou, Boyang Li, Hao He, Chaofan Ma, Tianjiao Zhang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. Advances in Neural Information Processing Systems, 36, 2024.

Kun Zhang, Bernhard Sch olkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pp. 819 827. PMLR, 2013.

Zijian Zhang, Zhou Zhao, Jun Yu, and Qi Tian. Shiftddpms: Exploring conditional diffusion models by shifting diffusion trajectories. ar Xiv preprint ar Xiv:2302.02373, 2023.

Published as a conference paper at ICLR 2024

Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data-efficient gan training. Advances in neural information processing systems, 33:7559 7570, 2020.

Huangjie Zheng and Mingyuan Zhou. Exploiting chain rule and bayes'theorem to compare probability distributions. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 14993 15006. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/file/7e0ff37942c2de60cbcbd27041196ce3-Paper.pdf.

Huangjie Zheng, Xu Chen, Jiangchao Yao, Hongxia Yang, Chunyuan Li, Ya Zhang, Hao Zhang, Ivor Tsang, Jingren Zhou, and Mingyuan Zhou. Contrastive attraction and contrastive repulsion for representation learning. Transactions on Machine Learning Research, 2023a. ISSN 2835-8856. URL https://openreview.net/forum?id=f39UIDkwwc.

Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id= HDxga Kk956l.

Zhihan Zhou, Jiangchao Yao, Yan-Feng Wang, Bo Han, and Ya Zhang. Contrastive learning with boosted memorization. In International Conference on Machine Learning, pp. 27367 27377. PMLR, 2022.

Zhihan Zhou, Jiangchao Yao, Feng Hong, Ya Zhang, Bo Han, and Yanfeng Wang. Combating representation learning disparity with geometric harmonization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Published as a conference paper at ICLR 2024

A THE EXPECTATION FORMULATION OF OPTIMAL SCORE FOR xt.

The optimal score for xt could be calculated as :

xt log q(xt) = Eq(x0|xt) xt log q(xt|x0)

Proof A.1 Based on Bayes theorem and definition of expectation:

xt log q(xt) = 1 q(xt) xtq(xt)

= 1 q(xt) xt

Z q(xt|x0)q(x0)dx0

q(xt) xtq(xt|x0)dx0

= Z q(xt|x0) q(x0)

q(xt) xt log q(xt|x0)dx0

= Z q(xt, x0)

q(xt) xt log q(xt|x0)dx0

= Z q(x0|xt) xt log q(xt|x0)dx0

= Eq(x0|xt) xt log q(xt|x0)

B THE SCORE FOR THE CONDITIONAL GENERATION

The optimal score for xt given y could be calculated as:

xt log q(xt|y)

q(xt|x(i) 0 , y) P

j q(xt|x(j) 0 , y) xt log q(xt|x(i) 0 , y); (x(i) 0 , y(i) 0 ) q(x0, y0) (10)

xt log q(xt|y) = Eq(x0|xt,y) xt log q(xt|x0, y) We use importance sampling:

Eq(x0|xt,y) xt log q(xt|x0, y) = X

q(x0|xt, y)

q(x0|y) xt log q(xt|x0, y) (11)

since conditional probability formula:

q(x0|xt, y)

q(x0|y) = q(xt|x0, y)q(x0|y)

q(xt|y)q(x0|y) = q(xt|x0, y)

q(xt|y) (12)

Compared with unconditional generation, we here calculate q(xt|y) as q(xt|y) = R q(xt|x0, y)q(x0|y)dx0, we utilize Monte-Carlo sample as:

q(xt|y) = X

j q(xt|x(j) 0 , y) x(j) 0 q(x0|y) (13)

We substitute the Equation into the Eq. (12):

x(i) 0 q(x0|y)

q(xt|x(i) 0 , y) P

j q(xt|x(j) 0 , y) xt log q(xt|x(i) 0 , y) (14)

Here, we want to augment the distribution q(x0|y) for a larger generation diversity. Here we analyze the ( x(j) 0 , y) that could not be sampled during the score estimation stage in the Eq. 14. For fixed little threshold probability ps:

Published as a conference paper at ICLR 2024

i) for the small q(xt| x(j) 0 , y) < ps, the final scores almost are not affected by these samples.

ii) for the q(xt| x(j) 0 , y) ps, then we make a assumption that, there exists (x(j ) 0 , y(j ) 0 ) that:

xt log q(xt|x(j ) 0 , y(j ) 0 ) xt log q(xt| x(j) 0 , y) (x(j ) 0 , y(j ) 0 ) q(x0, y0). (15)

We make a little justification of the assumption. Suppose xt is obtained by adding sampled noise with probability p1 to (x1 0, y), then we consider another sample ( x2 0, y) q(x0|y) with probability q(xt| x2 0, y) ps. Under the q(x0|xt, y) C exp( ||x0 xt||2 2 2σ2 t ). We could easily check the similarity between x1 and x2 0, ||x1 x2 0||2 2 2σ2 t (log p1 + log ps) following the similar procedure in Proposition. G.1. x2 0 lies in a ball with radius r = 2σ2 t (log p1 + log ps) center at x1.

If there exists a (x2 0 , y2 0 ) q(x0, y0) in the ball, which could be sampled during the training stage with a another label y2 0 . We could approximately substitute x2 0 with x2 0 for probability and score. The substitution error of probability could be bounded with the |q(xt|x2 0 , y2 0 ) q(xt| x2 0, y)| max(q(xt| x2 0, y), q(xt|x2 0 , y2 0 ))(log p1 + r 2σ2 t ). And the substitution error of scores could be

bounded with || xt log q(xt|x2 0 , y2 0 ) xt log q(xt| x2 0, y)||2 2 r σt . The both substitution errors are suppressed at larger timesteps with larger σt.

We broad the distribution of q(x0|y) to entire distribution q(x0) as reference:

xt log q(xt|y)

q(xt|x(i) 0 , y) P

j q(xt|x(j) 0 , y) xt log q(xt|x(i) 0 , y); (x(i) 0 , y(i) 0 ) q(x0, y0) (16)

Note that in practice we can evaluate the density with q(xt|x(j) 0 ) as the diffusion process is not related to the label. So we here abuse the notation a little with substituting q(xt|x(j) 0 , y) with q(xt|x(j) 0 , y(j) 0 ) to represent the correspondence between x(j) 0 and y(j) 0 :

xt log q(xt|y)

q(xt|x(i) 0 , y) P

j q(xt|x(j) 0 , y(j) 0 ) xt log q(xt|x(i) 0 , y); (x(i) 0 , y(i) 0 ) q(x0, y0) (17)

Note that the sampling is initially operated with label y instead of y(j) 0 .

C THE PROOF FOR THE PROPOSITION 3.1

Proposition C.1 Let q(xt|x(i) 0 , y(i) 0 ) B q(y(i) 0 )β exp( ||xt x(i) 0 ||2 2 2σ2 t ), then:

Eq(xt|x(i) 0 ,y(i) 0 ) q(xt|x(j) 0 , y(j) 0 ) P

k q(xt|x(j) 0 , y(j) 0 ) B

2 (q(y(i) 0 ) q(y(j) 0 ))β exp( ||x(i) 0 x(j) 0 ||2 2 8σ2 t )

Eq(xt|x(i) 0 ,y(i) 0 ) q(xt|x(j) 0 , y(j) 0 ) P

k q(xt|x(j) 0 , y(j) 0 ) = Z q(xt|x(i) 0 , y(i) 0 )q(xt|x(j) 0 , y(j) 0 ) P

k q(xt|x(j) 0 , y(j) 0 ) dxt

while the term in the integral of the right side could be transformed into:

q(xt|x(i) 0 , y(i) 0 )q(xt|x(j) 0 , y(j) 0 ) P

k q(xt|x(j) 0 , y(j) 0 ) Z q(xt|x(i) 0 , y(i) 0 )q(xt|x(j) 0 , y(j) 0 )

q(xt|x(i) 0 , y(i) 0 ) + q(xt|x(j) 0 , y(j) 0 ) = 1

Z 2 1 q(xt|x(i) 0 ,y(i) 0 ) + 1 q(xt|x(j) 0 ,y(j) 0 )

q(xt|x(i) 0 , y(i) 0 )p(xt|x(j) 0 , y(j) 0 )

Published as a conference paper at ICLR 2024

Here we substitute the assumption q(xt|x(i) 0 , y(i) 0 ) B q(y(i) 0 )β exp( ||xt x(i) 0 ||2 2 2σ2 t ) into the formula:

q(xt|x(i) 0 , y(i) 0 )q(xt|x(j) 0 , y(j) 0 ) = B

q(y(i) 0 )β exp( ||xt x(i) 0 ||2 2 2σ2 t q(y(j) 0 )β exp( ||xt x(j) 0 ||2 2 2σ2 t )

2 (q(y(i) 0 )q(y(j) 0 ))β

exp( ||xt x(i) 0 ||2 2 + ||xt x(j) 0 ||2 2 2σ2 t )

Here we take a twice look at the integral and exclude terms unrelated to xt: Z B

2 (q(y(i) 0 )q(y(j) 0 ))β

exp( ||xt x(i) 0 ||2 2 + ||xt x(j) 0 ||2 2 2σ2 t )dxt

2 (q(y(i) 0 )q(y(j) 0 ))β Z exp( ||xt x(i) 0 ||2 2 + ||xt x(j) 0 ||2 2 4σ2 t )dxt

We will extract the numerator term from the exponential:

||xt x(i) 0 ||2 2 + ||xt x(j) 0 ||2 2 = (xt x(i) 0 )T (xt x(i) 0 ) + (xt x(j) 0 )T (xt x(j) 0 )

= 2xtx T t 2x T t (x(i) 0 + x(j) 0 ) + x(i)T 0 x(i) 0 + x(j)T 0 x(j) 0

= 2||xt x(i) 0 + x(j) 0 2 ||2 2 + x(i)T 0 x(i) 0 + x(j)T 0 x(j) 0 (x(i) 0 + x(j) 0 )T (x(i) 0 + x(j) 0 ) 2

Since the first term related to xt could be integral to a constant K which do not include x(i) 0 and x(j) 0 .

Z exp( ||xt x(i) 0 +x(j) 0 2 ||2 2 2σ2 t )dxt = K

owing to the Gaussian distribution normalization. Then the left term:

x(i)T 0 x(i) 0 + x(j)T 0 x(j) 0 (x(i) 0 + x(j)T 0 ) (x(i) 0 + x(j) 0 ) 2 = 1

2(x(i)T 0 x(i) 0 + x(j)T 0 x(j) 0 2 x(i)T 0 x(j) 0 )

2||x(i) 0 x(j) 0 ||2 2

Substitute in the initial formula, we obtain:

Eq(xt|x(i) 0 ,y(i) 0 ) q(xt|x(j) 0 , y(j) 0 ) P

k q(xt|x(j) 0 , y(j) 0 ) B

2 (q(y(i) 0 )q(y(j) 0 ))β Z exp( ||xt x(i) 0 ||2 2 + ||xt x(j) 0 ||2 2 4σ2 t )dxt

2 (q(y(i) 0 )q(y(j) 0 ))β Z exp( ||xt x(i) 0 +x(j) 0 2 ||2 2 2σ2 t )dxt exp( ||x(i) 0 x(j) 0 ||2 2 8σ2 t )

2 (q(y(i) 0 ) q(y(j) 0 ))β exp( ||x(i) 0 x(j) 0 ||2 2 8σ2 t )

The K is absorbed in constant B, we finish the proof.

D THE RELATION WITH CBDM LOSS AND ANALYSIS

Proposition D.1 When the diffusion model converges, the optimal minimizer ϵ (xt, y) for the CBDM loss:

LCBDM(xt, y, t, ϵ) = ||ϵθ(xt, y) ϵ||2 + τt

y Y ||ϵθ(xt, y) ϵθ(xt, y )||2

could be denoted as:

ϵ (xt, y) = 1 1 + tτ ϵ + t τ (1 + t τ) |Y|

y ϵθ(xt, y )

Published as a conference paper at ICLR 2024

Proof D.1 When the diffusion model converges, for specific label y, the score ϵθ(xt, y ) for other label y is fixed. When the loss is minimized:

LCBDM(xt, y, t, ϵ) = ||ϵθ(xt, y) ϵ||2 + τt

y Y ||ϵθ(xt, y) ϵθ(xt, y )||2

= ϵθ(xt, y)T ϵθ(xt, y) + ϵT ϵ 2ϵθ(xt, y)T ϵ+ τt |Y|

y Y (ϵθ(xt, y)T ϵθ(xt, y) + ϵθ(xt, y )T ϵθ(xt, y ) 2ϵθ(xt, y)T ϵθ(xt, y ))

= (1 + τt)ϵθ(xt, y)T ϵθ(xt, y) 2ϵθ(xt, y)T (ϵ + τt

y Y ϵθ(xt, y )) + Const

where the Const has no relation with ϵθ(xt, y). The term related with ϵθ(xt, y) could be expressed as a quadratic form:

LCBDM(xt, y, t, ϵ) = (1 + τt)ϵθ(xt, y)T ϵθ(xt, y) 2ϵθ(xt, y)T (ϵ + τt

y Y ϵθ(xt, y ))

= (1 + τt)[ϵθ(xt, y)T ϵθ(xt, y) 2 ϵθ(xt, y)T ϵ + τt

y Y ϵθ(xt, y )

= (1 + τt)[||ϵθ(xt, y) ϵ + τt

y Y ϵθ(xt, y )

(1 + τt) ||2 2] + const

When the LCBDM converges to the minimum, the L2 norm should approximately near zero, so that we have:

ϵ (xt, y) 1 1 + tτ ϵ + t τ (1 + t τ) |Y|

y ϵθ(xt, y )

Analysis This approach has two limitations. Firstly, referring to scores from other labels requires training the entire model within a conditional generation framework, thus restricting its applicability. Secondly, relying on scores from other labels for the same input (xt) introduces potential biases, particularly when there is a substantial semantic difference between the two classes. This can lead to some degree of offset. In contrast, our method addresses these limitations. First, it can be used in both conditional and unconditional generation scenarios. Second, we utilize the inherent similarity of data in the L2 space where the diffusion model operates. For example, in Figure. 2, we consider the similarity between the red airplane and the red car, ensuring that our enhancement is more logically grounded.

E CLASS-WISE FID SCORES OF T2H COMPARISON TO BASE DDPM.

We calculate the FID score using 5k images of T2H and base DDPM for each class on cifar10LT datasets. The final column of the table is overall performance utilizing 50k images.

Table 4: Class-wise FID scores of T2H comparison to base DDPM.

Class Airpl Auto Bird Cat Deer Dog Frog Horse Ship Truck All

Pcls 0.403 0.241 0.145 0.086 0.052 0.031 0.018 0.01 0.0067 0.004 1.0 Base 31.49 15.20 40.81 32.32 32.32 36.34 40.47 23.20 26.31 23.31 10.72 T2H 31.84 14.58 19.42 28.32 18.31 26.83 29.90 17.70 21.04 22.82 6.89(-3.83)

We have also modified the asssignment of head and tail classes based on whether they are animals or vehicles. As shown in the table below, By altering the assignment of classes, our method demonstrates a greater improvement compared to the baseline.

Published as a conference paper at ICLR 2024

Table 5: Class-wise FID scores of T2H comparison to base DDPM with shifted categories.

Class Horse Bird Frog Deer Dog Cat Truck Airpl Auto Ship All

Pcls 0.403 0.241 0.145 0.086 0.052 0.031 0.018 0.01 0.0067 0.004 1.0 Base 20.19 21.99 22.45 20.57 38.07 32.73 24.87 49.24 29.48 39.12 11.79 T2H 20.16 18.52 19.83 16.23 24.60 27.50 14.73 30.08 20.99 21.41 7.15(-4.64)

F THE IMPACT OF DATASET IMBALANCE ON PERFORMANCE.

We have investigated the impact of dataset balance on performance. As illustrated in the following table, we computed the FID scores using different reference sets: one balanced and the other longtailed sampled. It is observable that, despite the long-tail dataset consisting of real images, its performance metrics are inferior to those of the balanced generated dataset, due to its significant imbalance.

Table 6: The FID (IS) scores for imbalance real dataset and generation dataset. Bal means balanced dataset or generation, ref means reference real dataset

Bal/no Bal No Bal Bal

Method No bal ref T2H Base T2H Base

Bal ref 26.22(5.80) 45.79(7.04) 46.66(7.03) 8.38(9.62) 12.95(9.52) No-Bal ref - 28.68 30.34 33.19 38.81

G THE VALIDATION OF TRANSFERRING TARGETS WITH T2H AND H2T.

In this section, we aim to verify the reliability and correctness of the transfer target. Suppose we have a clean sample x0, perturbed with noise ϵ N(0, σ2 t ) obtaining the noisy sample xt with probability pt = B exp[ ||x0 xt||2 2 2σ2 t ] where B 1 σt is some normalizing constant.

Then xt is involving the T2H or H2T algorithm and transfer the target as x(z) 0 with probability

psel(z) = q(xt|x(z) 0 ,y(z) 0 ) P

j q(xt|x(j) 0 ,y(j) 0 ) where q(xt|x(z) 0 , y(z) 0 ) = B exp[ ||x(z) 0 xt||2 2 2σ2 t ] also calculated with

Gaussian kernel. Firstly, let us discuss the similarity of x0 and x(z) 0 measured with ||x0 x(z) 0 ||2 2 with following proposition.

Proposition G.1 The L2 similarity with x0 and x(z) 0 bounded by the pt and psel(z):

||x0 x(z) 0 ||2 2 2σ2 t (log(pt psel(z)) + log(pt + psel(z)) + 2 log σt). (18)

||x0 x(z) 0 ||2 2

1 2σ2 t ( ||x(z) 0 xt||2 2 2σ2 t ||x0 xt||2 2 2σ2 t )

= 2σ2 t (log exp[ ||x(z) 0 xt||2 2 2σ2 t ] + log exp[ ||x0 xt||2 2 2σ2 t ])

= 2σ2 t (log pt psel(z) P

j q(xt|x(j) 0 , y(j) 0 )

2 2σ2 t (log(pt psel(z)) + log(pt + psel(z)) + 2 log σt)

where 1 involves utilizing the triangle inequality. And 2 comes from P

j q(xt|x(j) 0 , y(j) 0 )

q(xt|x0, y0) + q(xt|x(z) 0 , y(z) 0 ).

Published as a conference paper at ICLR 2024

As can be observed from the proposition above, the similarity x0 between and x(z) 0 is bounded by an upper limit. Moreover, in the sampling process, the larger the values pt and psel(z) the tighter this bound becomes.

Furthermore, from an alternative viewpoint, we evaluate the probability of obtaining xt using a clean x(z) 0 within the framework of a one-to-one clean-to-noisy mapping.

Proposition G.2 The probability pz t of obtaining xt from x(z) 0 under single denoising target scenario could be:

pz t psel(z) pt

1 psel(z), (19)

where psel(z) comes from Equation.( 8) and q(xt|x0, y0) B exp[ ||x(z) 0 xt||2 2 2σ2 t ]

psel(z) = q(xt|x(z) 0 , y(z) 0 ) P

j q(xt|x(j) 0 , y(j) 0 )

q(xt|x(z) 0 , y(z) 0 )

q(xt|x0, y0) + q(xt|x(j) 0 , y(j) 0 ) ,

because pt = q(xt|x0, y0) and pt = q(xt|x(z) 0 , y(z) 0 ) under the assumption of Equation.( 8) so that:

psel(z) pz t pz t + pt so we obtain:

pz t psel(z) pt

We have obtained a lower bound for pz t . If we assume pt psel(z) 0.5 then the pz t 0.5 where is also valid noise-clean pair for xt x(z) 0 under the single target scenario.

H EXPERIMENTS ON IMAGENETLT DATASET

We conduct our method on large scale datasets Imagenet LT, the performance is shown in the following Table. 7. We generate 20k images with 1000 classes and use the balanced validation set with 20k images as the reference set for the calculation of FID scores. As shown in the Table, Our results on large-scale are consistent to the observations on small-scale data, which validates the effectiveness of our method at scale.

Table 7: Experiments on large scale dataset on Imagenet LT

Method FID IS

base DDPM 26.95 15.99 CBDM 28.12 15.86 T2H (Ours) 25.42 15.96

I FINETUNING EXPERIMENTS FROM NORMAL TRAINED MODEL

We also conduct the experiments of finetuning with the model pre-trained with normal denoising loss function. The finetuning starting step is ranging from 100k to 500k, and the results with no pretriaining and no finetuning are also provided. The conditional model is finetuned with T2H strategy, while the unconditional model is finetuned with H2T strategy.

As shown in the table, finetuning is capable of further improving model performance based on pretraining. However, as the number of pretraining steps increases, the extent of improvement gradually diminishes and eventually becomes stable.

Published as a conference paper at ICLR 2024

Table 8: Finetuning from normal pretrained models with different training steps.

Pretrained Pretrained Steps No finetune No pre-train 100k 200k 300k 400k

Base Uncond

FID 25.31 16.09 18.13 20.31 21.65 21.10 IS 7.01 8.27 7.93 7.86 7.32 7.33

FID 10.20 6.89 7.48 7.87 8.01 8.12 IS 9.25 9.75 9.64 9.63 9.62 9.56

J IMPLEMENTATION DETAIL OF TOY GAUSSIAN EXAMPLES IN FIGURE. 4

We random sample 10k samples from distribution N((0, 4), 0.2I) as head samples {x(Hi) 0 }, and 0.1k samples from N((0, 4), 0.2I) simulating tail samples {x(Tj) 0 }. The empirical distribution of the overall dataset could be denoted as:

pdata(x) = X

i δ(x x(Hi) 0 ) + X

j δ(x x(Tj) 0 )

Simulating the forward diffusion process, we convolute the empirical distribution with Gaussian noise N(0, σ2 t I):

pσt(xt) = pdata(x) N(0, σ2 t I) = X

i N(xt; x(Hi) 0 , σ2 t I) + X

j N(xt; x(Tj) 0 , σ2 t I) (20)

The score estimation is calculated as xt log pσt(xt), we substitute the result in Eq. (20):

xt log pσt(xt) = xt log( X

i N(xt; x(Hi) 0 , σ2 t I) + X

j N(xt; x(Tj) 0 , σ2 t I))

i N(xt; x(Hi) 0 , σ2 t I) + P j N(xt; x(Tj) 0 , σ2 t I)

i N(xt; x(Hi) 0 , σ2 t I) + X

j N(xt; x(Tj) 0 , σ2 t I))

i(xt x(Hi) 0 )N(xt; x(Hi) 0 , σ2 t I) + P

j(xt x(Tj) 0 )N(xt; x(Tj) 0 , σ2 t I) P

i N(xt; x(Hi) 0 , σ2 t I) + P

j N(xt; x(Tj) 0 , σ2 t I)

The last step is because of:

xt N(xt; x, σ2 t I) = xt C exp( ||xt x||2 2 2σ2 t )

= C exp( ||xt x||2 2 2σ2 t ) xt ||xt x||2 2 2σ2 t

σ2 t N(xt; x, σ2 t I)(xt x)

The number of points in the head class is 100 times greater than the tail class, resulting in an imbalance factor of 0.01. It can be observed that the score distribution across the entire space is dominated by the head class, leading to a highly imbalanced dataset generation.

K FUTURE WORK

Now, diffusion models are widely applied to the generation and learning research in multimodal fields. Because the definition of the long-tail problem is not yet clear in the context of multimodal data, more efforts can be used to define the data defects in multimodal data (Chen et al., 2023a) such as long tail distribution, how to solve the long-tail generation problem in multimodal scenarios, and more downstream tasks in multimodal contexts (Zhang et al., 2024).

Published as a conference paper at ICLR 2024

L VISUALIZATION RESULTS

Figure 8: Visualization of generation results of CIFAR10LT dataset with T2H conditional generation

Published as a conference paper at ICLR 2024

Figure 9: Visualization of generation results of CIFAR100LT dataset with T2H conditional generation

Published as a conference paper at ICLR 2024

Figure 10: Visualization of generation results of Tiny Image Net200 dataset with T2H conditional generation