# representative_guidance_diffusion_model_sampling_with_coherence__d9ae1db8.pdf

Published as a conference paper at ICLR 2025

REPRESENTATIVE GUIDANCE: DIFFUSION SAMPLING WITH COHERENCE

Anh-Dung Dinh School of Computer Science The University of Sydney anh-dung.dinh@sydney.edu.au

Daochang Liu School of Physics, Mathematics and Computing The University of Western Australia daochang.liu@uwa.edu.au

Chang Xu School of Computer Science The University of Sydney c.xu@sydney.edu.au

The diffusion sampling process faces a persistent challenge stemming from its incoherence, attributable to varying noise directions across different timesteps. Our Representative Guidance (Rep G) offers a new perspective to address this issue by reformulating the sampling process with a coherent direction toward a representative target. From this perspective, classic classifier guidance reveals its drawback in lacking meaningful representative information, as the features it relies on are optimized for discrimination and tend to highlight only a narrow set of class-specific cues. This focus often sacrifices diversity and increases the risk of adversarial generation. In contrast, we leverage self-supervised representations as the coherent target and treat sampling as a downstream task one that focuses on refining image details and correcting generation errors, rather than settling for oversimplified outputs. Our Representative Guidance achieves superior performance and demonstrates the potential of pre-trained self-supervised models in guiding diffusion sampling. Our findings show that Rep G not only significantly improves vanilla diffusion sampling, but also surpasses state-ofthe-art benchmarks when combined with classifier-free guidance. source code: https://github.com/dungdinhanh/rep-guidance.

1 INTRODUCTION

In diffusion sampling processes Ho et al. (2020), a persistent challenge arises from incoherence due to uncontrollable noise introduced at each timestep. As illustrated in Figure 1, at each timestep, xt is used to predict the original image, which then aids in generating xt 1 in the next step. During training, the original images are sampled from the dataset, ensuring a consistent image distribution across all timesteps. However, during inference, the real dataset distribution is unavailable. Instead, the diffusion model draws from varying distributions at each timestep, incorporating different types of information, as shown in the bottom row of Figure 1. This distributional shift between timesteps introduces incoherent features into the generated images. This paper addresses incoherence by framing it as a discrepancy between the predicted image distributions across successive timesteps. Such discrepancies allow noise information to persist, leading to undesired artifacts in the generated images. For instance, an image of a Leonberger may exhibit bizarre or inconsistent features in consecutive timesteps, hindering its transformation into a realistic depiction as the sampling process progresses, as illustrated in Figure 2. Moreover, the generated images often lack crucial details, such as background elements and fine object features. While efforts such as DDIM Song et al. (2020a) have attempted to alleviate incoherence by removing random noise during sampling, they often do so at the cost of sample quality. As a result, many recent diffusion models continue to rely on the mechanisms of conventional DDPMs Ho et al. (2020).

Corresponding Author

Published as a conference paper at ICLR 2025

Training Sampling

Incoherence in x0 Inferior Generation

The same x0

Figure 1: The top row is the real sampling process during training, where at every timestep, real images are picked from a coherent distribution. Nevertheless, during the inference phase, as in the row below, the predicted images at every time step have different distributions. The images with earlier timesteps are more blurred than the images in the later stage of the sampling. This results in the incoherence between intermediate distributions.

Under our formulation of the incoherence, we propose a solution that involves tuning image features at each timestep to rectify incoherent features. We introduce a guidance scheme termed Representative Guidance (Rep G), which leverages information from representative vectors to steer the sampling process towards a coherent direction. Moreover, unlike the traditional classifier guidance, where one-hot vectors represent classes, Rep G represents each class through a set of representative vectors containing features specific to that class. To harness the optimal representative information, we employ self-supervised models prevalent in representative learning as our guidance model. The gradients derived using the pre-trained self-supervised model are directly integrated into the sampling process to facilitate feature tuning in generated images. In this sense, the sampling process can be viewed as a downstream task of the self-supervised models.

In comparison to the classifier guidance, which is a popular method for enhancing the performance of the diffusion model, Rep G offers multiple advantages. Firstly, our method provides a better representative target than the classifier guidance. The utilization of representative vectors for each class inherently contains valuable information for generative tasks. In contrast, the classifier guidance relies on one-hot vectors representing each class, which offer limited information. This overly compact target leads to reliance on discriminative features within the classifier, which often proves insufficient for generative tasks and raises concerns about potential adversarial effects that could degrade the quality of generated images Dinh et al. (2023b).

Secondly, self-supervised models are trained to generalize well across datasets rather than being tailored to a single task like classifiers. This characteristic helps mitigate the need for noise-aware training of the guidance model, which can be prohibitively expensive, particularly for high-resolution images. Additionally, unlike noise-aware classifiers, self-supervised networks do not require memorizing noise patterns, making the guidance model more lightweight. For instance, our Rep G, which leverages Res Net50, achieves efficiency in reducing computational time during sampling.

Thirdly, Rep G does not compromise diversity, unlike the classifier guidance approach. While the classifier guidance alters images at the class level to enforce diversity, Rep G fine-tunes images at the feature level. Consequently, while the former method encourages the generation of images only with the popular features for each class, Rep G preserves most of the image content while modifying faulty features and details.

In summary, our proposed Rep G operates distinctively compared to the classifier guidance. As for the classifier-free guidance, while the classifier-free guidance offers a trade-off between quality and diversity, our method focuses on upgrading details or fine-tuning features, as depicted in Figure

Published as a conference paper at ICLR 2025

250 200 185 160 130 100 50 1 timesteps

Figure 2: condition:Leonberg. The top row is the vanilla diffusion sampling process, and the bottom row is the sampling process with our Representative Guidance. From timestep 250 to timestep 185, both of the processes are similar. However, inconsistent features appeared in the vanilla sampling process as the black bubble exists at the head and the tail of the Leonberg at timestep 160. Without Rep G, the process struggles to fix inconsistent features for the rest of the process. In contrast, Rep G handles the case by removing the inconsistent features and making the image very clear from the time step 130. The Rep G sampling process later focuses on improving other details such as hair, background, and surrounding objects. (Dataset: Image Net256x256/ Diffusion Model: ADM)

3. Consequently, our Rep G can complement the classifier-free guidance to enhance the generation quality further. Combining our method with the classifier-free guidance demonstrates superior performance compared to several SOTA baselines. The contributions of this paper are three-fold:

Model the incoherence of the diffusion sampling process and introduce a suitable guidance scheme. Propose the representative guidance target based on self-supervised pre-trained models. Validate the results against a number of state-of-the-art baselines.

2 RELATED WORKS

Denoising Diffusion Probabilistic Models (DDPMs) Ho et al. (2020) and their score-based counterparts Song & Ermon (2019); Song et al. (2020b) have become one of the most popular generative models recently and replacing Generative Adversarial Networks (GANs) Odena et al. (2017); Kang et al. (2021); Sauer et al. (2022). The following works Song et al. (2020a); Nichol & Dhariwal (2021); Dhariwal & Nichol (2021); Bao et al. (2022); Lam et al. (2022) improve the models in different perspectives such as time reduction or sampling quality improvement. Recent trends in developing the Diffusion model leveraging the latent space for diffusion and denoising processes such as Di T Peebles & Xie (2023), and Stable Diffusion Rombach et al. (2022) also offer diffusion models with less sampling time with good quality images.

Exposure bias Ning et al. (2023); Yu et al. (2023); Li et al. (2023) is when the noise is accumulated through timesteps due to the lack of ground truth. However, the incoherence problem in this paper has different meanings. Incoherence means a mismatch between two distributions of predicted images at two timesteps that should share the same information. This mismatch results in a gap, allowing incoherent features to be added to the images.

Guidance methods also emerge as essential techniques to boost the performance of generated samples Dhariwal & Nichol (2021); Nichol et al. (2021); Zheng et al. (2022); Dinh et al. (2023a;b); Liu et al. (2023); Bansal et al. (2023). In general, the noise-aware or off-the-shelf classifier/CLIP gradient is utilized to guide the diffusion sampling process to improve its performance in terms of FID. Classifierfree guidance Ho & Salimans (2022) offers a different way to trade off quality with diversity by combining conditional and unconditional diffusion models in the same framework. In Dinh et al. (2023b), the author points out that classifier guidance utilizes the most discriminative features only to do sampling, reducing the generated images robustness and diversity. However, to achieve superior performance, these methods all give up diversity by significantly modifying details of the image to be close the the most common features of the conditional class. In this manuscript, we propose a guidance method that fixes the details of the image instead of generating another one based on feature-level guidance.

Published as a conference paper at ICLR 2025

Although Pro G Dinh et al. (2023b) solves the problem of diversity suppression by including other classes features, it still cannot avoid the fact that Pro G is still based on discriminative features from a classifier that are not diverse enough for a generative task. Thus, our work utilizes self-supervised models that contain more general information. Self-supervised models Chen et al. (2020b); He et al. (2020); Chen & He (2021); Grill et al. (2020); Chen et al. (2020a) aim to learn representative vectors that contain helpful information about data. While the applications of these models on generative tasks are still limited, this work shows that the pre-trained backbone from a self-supervised model is helpful without any training or fine-tuning. Other self-supervised learning in diffusion models works all aim to fine-tune the diffusion model in a self-supervised manner or utilize the diffusion model as a self-supervised model Hu et al. (2023); Zhang et al. (2024).

3 BACKGROUND

DDPM: pθ(x0) := R pθ(x0:T )dx1:T with x1, x2, ..., x T are latent variables sharing the same dimensionality with the data x0 q(x0) as the main formulation of DDPMs with p(x T ) = N(x T ; 0, I). The main aim of DDPMs training is to obtain the pθ(x0:T ) is the reverse process following the Markovian property pθ := p(x T ) QT t=1 pθ(xt 1|xt), where pθ(xt 1|xt) := N(xt 1; µθ(xt, t), Σθ(xt, t)). The reverse process moves from a total noise image to a clear image. Hence, it is used as a generator in the inference process.

The forward process corrupts the original data x0 to x T with Gaussian noise to train the θ for serving the reverse purpose. This process is a fixed Markov chain q(x1:T |x0) := QT t=1 q(xt|xt 1)j, where q(xt|xt 1) := N(xt; 1 βxt 1, βt I). βt is the fixed variance scheduled from the start of the process.

From the given schedule, distribution of xt given x0 can be derived as:

q(xt|x0) = N(xt; αtx0, (1 αt)I) (1)

Denote αt = 1 βt and α = Qt s=1 αs. Reverse from xt given x0, xt 1 distribution is derived as:

q(xt 1|xt, x0) = N(xt 1; µt(xt, x0), βt I) (2)

Where mean value µt(xt, x0) := αt 1βt

1 αt x0 + αt(1 αt 1)

1 αt xt and variance Bt := 1 αt 1

1 αt βt. with reparameterization trick, we can sample the xt 1 as:

xt 1 = (1 αt) αt 1

1 αt x0 + (1 αt 1) αt

1 αt xt + σtz (3)

Similar to a Variational Auto Encoder Kingma & Welling (2013), the optimization of θ will be conducted via negative log-likelihood variational bound:

E[ log pθ(x0)] E q [ log p(x T ) Σt 1 log pθ(xt 1|xt)

q(xt|xt 1) ] (4)

We re-write the Eq. 4 as:

E[ log pθ(x0)] E q [DKL(q(x T |x0)||p(x T ))+ X

t>1 DKL(q(xt 1|xt, x0)||pθ(xt 1|xt)) log pθ(x0|x1)]

In detail implementation, the θ is chosen to be parameters of the noise predictor ϵθ(xt, t). The well-trained θ using Eq. 4 can be used for sampling equation:

xt 1 = 1 αt (xt 1 αt 1 αt ϵθ(xt, t)) + σtz (5)

4 METHODOLOGY

In this section, we first reformulate the sampling process to analyze the coherence issue. From Eq. 5, we first re-write the formulation of the sampling process as:

xt 1 = (αt 1) αt 1

1 αt ( xt αt + 1 αtϵθ(xt, t) αt ) + (1 αt 1) αt

1 αt xt + σtz (6)

Published as a conference paper at ICLR 2025

The complete derivation of Eq.6 can be found in Eq.24 in Appendix. Denote xt 0 as the prediction of x0 at time step t. From Eq.1, we have xt 0 = ( xt αt 1 αtϵθ(xt,t) αt ) as the prediction of x0 at the sampling step t. This results in a new form of sampling equation:

xt 1 = (1 αt) αt 1

1 αt xt 0 + (1 αt 1) αt

1 αt xt + σtz (7)

The Eq.7 is the sampling from the distribution of q(xt 1|xt, x0) with µt(xt, x0) = (1 αt) αt 1

(1 αt 1) αt

1 αt xt and σt = β which is matched with Eq.2 and Eq.3 in the training of the DDPMs.

In the training reverse phase in Eq.3, there is a small assumption that x0 q(x0). This results in information being passed from timestep t into timestep t 1, or the data distribution q(x0) is consistent throughout all timesteps. However, this assumption is no longer valid during the sampling step. By assuming ϵθ(xt, t) ϵ, we have:

xt 0 q( xt 0|xt) = N( xt 0; xt α , 1 αt αt ) (8)

However, the q( xt 0|xt) at two different timesteps t are not the same, although they are both used for sampling xt 0 which is later used for sampling in Eq.7. The illustration of the difference between these distributions can be found in Figure 6 in the Appendix. The assumption that x0 q(x0) at all timestep is not correct anymore and sample xt 1 from Eq. 7 can not hold. We define the incoherence problem as below:

Definition 4.1. Incoherence is the mismatch between predicted xt 0 distributions at different timestep t and mismatch between predicted x0 distributions with real data distribution q(x0).

q( xt1 0 |xt1) = q( xt2 0 |xt2) = q(x0) t1, t2 > 1, t1 = t2 (9)

The incoherence in the sampling process leaves the gap for the inconsistent features resulting from random noise appearing inside the image at some stage of the process. For example, in the top row of Figure 2, we observe the black bubbles at the head and tail of the dog at the 160th timestep. The consequence is that the generated samples contain many blur details, inconsistent features, or unnecessary features.

4.1 REPRESENTATIVE GUIDANCE

From definition 4.1, gaps of inconsistent features result in poor-quality images. Thus, to solve the incoherence, we need to make the distribution of intermediate samples q( xt 0|xt) as close as possible to q(x0). However, the q(x0) is intractable during the sampling process. The intractability would make calculating any distance between these two distributions impossible.

Instead of calculating the direct distance between q( xt 0|xt) and q(x0), we inject features information of x0 during the sampling process in Eq.7 to force the sampling of xt 0 to mimic the features of x0 at every timestep. First, we denote the fϕ(x0) as features extractor, parameterized by ϕ, for x0. Our design aims to force xt 0 to have similar features fϕ( xt 0) as x0. We denote d(fϕ(x0), fϕ( xt 0)) as the distance between two features.

Once again, since q(x0) is intractable, fϕ(x0) is also intractable. To address this, instead of representing fϕ(x0) at an instance-wise level, we encode the features of the entire dataset at a class-wise level using g(x0|c). Here, g(x0|c) is defined as an operation on the set of fϕ(x 0) | x 0 q(x0|c), where c C denotes the class. The feature distance is then transformed into d(g(x0|c), fϕ( xt 0)).

Given class c and data x0, g(x0|c) results in the set of vectors representing the features of class c. Specifically, g(x0|c) = {rc 1, rc 2, . . . , rc n}, where n denotes the number of vectors required to represent class c. With C classes, the entire dataset is encoded as V = {g(x0|1), g(x0|2), . . . , g(x0|C)}. This representation encoding enables the model to store a compact set of representation vectors g(x0|c) for each class c, rather than storing the representative vectors fϕ(x0) for the entire dataset. The selection of g and f will be discussed in Section 4.2.

Published as a conference paper at ICLR 2025

As a result, at each time step, given class c, we refine the predicted x0 through the equation below:

xt 0 := xt 0 γ xt 0d(g(x0|c), fϕ( xt 0)), (10)

where γ is the guidance scale.

From Eq.10 and 7, given class c, the new sampling process with coherence is demonstrated in Eq. 11:

xt 1 = (1 αt) αt 1

1 αt xt 0 + (1 αt 1) αt

1 αt xt + σtz (1 αt) αt 1

1 αt γ xt 0d(g(x0|c), fϕ( xt 0))

(11) with xt 0 = ( xt αt 1 αtϵθ(xt,c,t) αt ). The guidance features from x0 provide a consistent and reliable target for x0 to avoid the incoherence problem. We will discuss the similarity between Eq.11 and a Stochastic Gradient Descent process in Appendix D.

The choice of distance d can be varied. The rest of this section 4.1 mainly discusses the options of d.

Negative Cosine similarity: At each timestep, we sample a vector rc t g(x0|c). The two vectors fϕ( xt 0) and rc t can be matched via a negative cosine similarity loss as below:

Lcs(fϕ( xt 0), rc t) = fϕ( xt 0) rc t fϕ( xt 0) rc t (12)

Contrastive loss: Apart from negative cosine similarity, the contrastive loss has also been used in many works on representative learning have presented He et al. (2020); Chen et al. (2020b;a). The contrastive loss in our work is more toward the supervised contrastive rather than instance contrastive. We can define a positive pair as two vectors with the same classes and a negative pair as two vectors with different classes. When sampling the image in class c, the loss for contrastive matching is:

Lct(fϕ( xt 0), V ) = exp fϕ( xt 0) rc t H PC i=1,i =c exp fϕ( xt 0) ri t H (13)

where H is the softmax temperature.

Replacing the matching equation in Eq.12 and Eq.13 into Eq.11 as L , we have the final sampling guidance equation:

xt 1 = (1 αt) αt 1

1 αt xt 0 + (1 αt 1) αt

1 αt xt + σtz (1 αt) αt 1

1 αt γ xt 0L(fϕ( xt 0), V )

4.2 REPRESENTATIVE TARGETS

In section 4.1, we have discussed a coherent guidance method given representative information from a class c. This section will discuss the choice of the mapping function fθ and the representative information for each class.

The most straightforward way is to use naive classification for the guidance, where a network such as Res Net He et al. (2016) or a noise-aware classifier Dhariwal & Nichol (2021) is selected to be a classifier. This is the case of the classic classifier guidance. The representative vector g(x0|c) {0, 1}C reduces to a one-hot vector, with C as the number of classes. However, the use of classification as representative information has many shortages. Firstly, the guidance reveals very little detail about the generated images. Since the classifier only processes the discriminative features, many details that are less discriminative for a class are missed when using the classifier gradient to construct the image Dinh et al. (2023b). Secondly, the motivation for using a classifier to construct images in diffusion models is becoming weaker than the use of the classifier-free guidance. Since a conditional diffusion model already had class-conditioned information, the reason for using additional classification information to improve the performance of a conditioned diffusion model seems to be not strong enough to convince the community. As a result, the research community often opts for classifier-free guidance Rombach et al. (2022); Peebles & Xie (2023). Thirdly, the use of classifier guidance is often associated with the very expensive training cost of noisy classifiers.

Published as a conference paper at ICLR 2025

Self-supervised models are known to be very good at generalizing data agnostic to augmentation/noise and separating image samples on representative spaces according to features He et al. (2020); Chen & He (2021); Jing et al. (2021). Thus, we choose the self-supervised model to be our guidance model. The self-supervised models are pre-trained and we consider the sampling process as a downstream task of the model. Given a real dataset x0 and a pre-trained self-supervised model fϕ, an instance xc i x0 is the ith instance in class c of the dataset. We have rc i = fθ(xc i). The centre of each class on the representative space has the form rc = E rc i . We assume that the representative instances that are closer to the mean values represent the most important features of the classes. We represent the whole class c via the representative information g(x0|c) as below:

g(x0|c) = {rc k} K, k Sc| X

rc k rc min (15)

Where Sc is the list of indexes of K representative vectors that are selected to be the closest to the class mean representative vector rc. We will discuss in section 5 the value of K and different schemes to select representative vectors g(x0|c) in addition to the closest scheme.

Given Eq.15, we have V = {g(x0|1), g(x0|c), ..., g(x0|c)} as a set of representative class vectors representing the whole dataset to enable diffusion sampling with coherence using Eq.14. Before the sampling process starts, V will be calculated in advance and stored as the network parameters for sampling.

5 EXPERIMENTAL RESULTS

Experiments are conducted to evaluate on Image Net Deng et al. (2009) dataset with two resolutions 64x64 and 256x256 with 50000 generated samples. We first verify our claims that our proposed Rep G helped to improve the details and fix the faulty information in the images qualitatively in section5.1 and quantitatively in section5.2. After that, we will compare quantitatively with other state-of-the-art methods such as Big GAN Brock et al. (2018), ADM Dhariwal & Nichol (2021), Px P Dinh et al. (2023a), Pro G Dinh et al. (2023b), EDS Zheng et al. (2022), IDDPM Nichol & Dhariwal (2021), VAQ-VAE-2 Razavi et al. (2019) and Classifier-free guidance (CLSFree) Ho & Salimans (2022). Three baseline diffusion models are leveraged to evaluate the improvement of the proposed Representative Guidance method are ADM Dhariwal & Nichol (2021), IDDPM Nichol & Dhariwal (2021) and Di T Peebles & Xie (2023).

We denote that ADM or IDDPM as the ADM or IDDPM diffusion model without guidance. ADMG is denoted for ADM with classifier guidance. Px P, Pro G, EDS are advanced techniques to improve classifier guidance going after + sign. ADM-CLSFree and Di T-CLSFree are denoted for the application of classifier-free guidance on ADM and Di T respectively. ADM-CLSFree-G or Di T-CLSFree-G are denoted for applying the combination of classifier-free guidance and classifier guidance on ADM and Di T correspondingly.

5.1 INCOHERENT FEATURES ALLEVIATION

As discussed in section 4, we observe incoherent features during the sampling process due to the incoherence of xt 0 at each timestep. This section shows that Rep G successfully alleviates the inconsistent features in the generated images following three categories as in Figure 3. In detail, Rep G helps to improve the diffusion sampling process by fixing faulty features, removing unnecessary features, and upgrading details.

5.2 QUANTITATIVE IMPROVEMENT

This section compares the performance of our proposed Rep G guidance with other state-of-the-art baselines as in Table 1.

Firstly, the use of Rep G helps to improve the performance of vanilla baselines such as ADM or IDDPM. Apart from the observation in section 5.1 with qualitative improvement, we see a significant improvement in FID/s FID and precision when applying Rep G on ADM or IDDPM. Secondly, given the same ADM diffusion model, ADM + Rep G has a significantly better Recall value than other guidance methods such as ADM-G, ADM-CLSFree, ADM-G (+Px P, +Pro G, +EDS+Pro G) which

Published as a conference paper at ICLR 2025

Table 1: Comparison with state-of-the-art generative baselines on Image Net64x64 and Image Net256x26. denotes the obtained score evaluated from generated images from the published repo. represents the number taken directly from the paper due to the lack of the source code or generated samples. Other values are reproduced from the published source code. The proposed Rep G is shown to achieve better results than other state-of-the-art.

Model FID s FID Prec Rec

Image Net 64x64

Big GAN 4.06 3.96 0.79 0.48 IDDPM 2.90 3.78 0.73 0.62 IDDPM + Rep G 2.53 3.44 0.75 0.60 ADM 2.07 4.29 0.73 0.63 ADM + Rep G 1.69 3.42 0.75 0.62

ADM-G 2.47 4.88 0.80 0.57 ADM-G + Px P 1.84 3.97 0.76 0.60 ADM-G + Pro G 1.87 4.33 0.77 0.60 ADM-G + EDS + Pro G 1.77 4.25 0.77 0.61 ADM-CLSFree 1.89 4.45 0.77 0.60 ADM-CLSFree + Pro G 1.91 4.51 0.76 0.60 ADM-CLSFree + Rep G 1.67 3.44 0.78 0.61

Image Net 256x256

Big GAN 7.03 7.29 0.87 0.27 DCTrans 36.51 8.24 0.36 0.67 VQ-VAE-2 31.11 17.38 0.36 0.57 IDDPM 12.26 5.42 0.70 0.62 ADM 10.94 6.02 0.69 0.63 ADM + Rep G 7.83 5.79 0.72 0.61

ADM-G 4.58 5.23 0.81 0.52 ADM-G + EDS 3.96 5.00 0.82 0.52 ADM-G + Px P 4.00 5.19 0.81 0.53 ADM-G + Pro G 4.53 5.08 0.85 0.49 ADM-G + Pro G + EDS 3.84 5.00 0.83 0.51 ADM-CLSFree 3.76 4.45 0.77 0.53 ADM-CLSFree-G + Pro G 3.81 4.46 0.77 0.53 ADM-CLSFree + Rep G 3.34 4.60 0.85 0.52 Di T-CLSFree 2.27 4.80 0.82 0.58 Di T-CLSFree-G + Pro G 2.25 4.56 0.82 0.58 Di T-CLSFree + Rep G 2.17 4.59 0.80 0.60

Table 2: We compare the use of different self-supervised models for our representative guidance.

Self-sup Model FID s FID Prec Rec

Image Net 64x64

W/o Guidance 2.07 4.29 0.73 0.63 Mo Co-v2 1.69 3.42 0.75 0.62 Sim Siam 1.88 3.80 0.74 0.62 Moco-v3 1.81 3.93 0.76 0.62

Table 3: Different K values for our representative guidance with several possible values K = {1, 5, 10, 15}.

FID s FID Prec Rec

Image Net64x64

ADM 2.07 4.29 0.73 0.63 ADM + Rep G (K=1) 1.77 3.44 0.75 0.60 ADM + Rep G (K=5) 1.69 3.42 0.75 0.62 ADM + Rep G (K=10) 1.73 3.43 0.75 0.62 ADM + Rep G (K=15) 1.82 3.45 0.75 0.62

Table 4: We compare the use of two different representative vector selection schemes.

FID s FID Prec Rec

Image Net64x64

ADM 2.07 4.29 0.73 0.63 ADM + Rep G 1.69 3.42 0.75 0.62 ADM + Rep G Rand 2.04 4.17 0.74 0.62

Table 5: The use of two matching losses used in sampling as mentioned in section 4 affects the performance of the diffusion sampling process. The result indicates the superiority of both of the losses performances compared to without guidance. The contrastive achieves slightly better than negative cosine similarity loss.

Loss FID s FID Prec Rec

Image Net 64x64

W/o Guidance 2.07 4.29 0.73 0.63 Contrastive 1.69 3.42 0.75 0.62 Cosine Similarity 1.75 3.57 0.75 0.60

indicates that Rep G helps to keep the diversity better than other guidance methods (As highlight in pink column) Finally, The combination of Rep G and CLSFree guidance outperforms other state-ofthe-art guidance methods such as Px P Dinh et al. (2023a), Pro G Dinh et al. (2023b), EDS Zheng et al. (2022) or CLSFree Ho & Salimans (2022).

Note: On Image Net256x256, the Rep G improves baseline ADM significantly but lags behind other guidance methods. This is expected as Rep G only improves details and keeps diversity while other methods sacrifice diversity to achieve better quality. On Image Net64x64, Rep G outperforms all other guidance methods due to the information in Image Net64x64 is less than its 256 counterpart and focuses on foreground objects. Improving objects features is enough to beat other methods.

Classifier guidance failed to improve the performance of classifier-free guidance significantly (ADMCLSFree-G+Pro G and Di T-CLSFree-G+Pro G in Table 1). This is due to the overlapping trade-off essence of the two methods. These two methods do the same thing: trade-off quality with diversity, which offers less improvement when combined. However, Rep G successfully improves classifier-free guidance since Rep G does a different task: tune the details of the images.

5.3 ABLATION STUDY

Section 5.1 and 5.2 have shown qualitative and quantitative improvement compared to previous state-of-the-art baselines. In the Ablation study, we discuss different choices for our models, such as the choice of self-supervised models, the performance of the proposed methods on different

Published as a conference paper at ICLR 2025

Details upgrade

Faulty correction

unnecessary features removal

W/o Rep G Rep G W/o Rep G Rep G

Figure 3: Rep G enhances diffusion sampling in three key ways: (1) correcting faulty features to improve realism (row 1), (2) refining object and background details, such as sharpening the dog s hair and enhancing grass, trees, and fences (row 2), and (3) removing unnecessary elements like a human figure or an incorrect body structure (row 3). Image Net256x256

guidance scales, the number of representative targets utilized, and the performance comparison between contrastive matching loss (Eq.13) and cosine similarity matching loss (Eq.12).

5.3.1 DIFFERENT SELF-SUPERVISED MODELS

In all of the Rep G results in Table 1, we use Mo Co-v2 Chen et al. (2020b) as the backbone for guidance. This section compares different choices of pretrained self-supervised models in Table 2. In detail, three popular pre-trained self-supervised models are utilized, which are Mo Co-v2 Chen et al. (2020b), Sim Siam Chen & He (2021), and Moco-v3 Chen et al.. The performance shows that Mo Co-v2 achieves the best among the three models. The outperformance of Moco-v2 could be due to the representative information obtained by Mo Co-v2 having contrastive information compared to Sim Siam, hence obtaining more information about data than just clustering it. Moco-v3 delivers better FID than Sim Siam but is still not as good as Moco-v2, yet Moco-v3 offers better Precision.

5.3.2 GUIDANCE SCALES EFFECTS

0 2 4 6 8 10 Guidance scale

0 2 4 6 8 10 Guidance scale

Classifier guidance Representative guidance

Figure 4: We compare the Recall trend between the classifier and representative guidance (Rep G). Rep G shows a much more stable trend in diversity than classifier guidance.

Similar to the classifier guidance Dhariwal & Nichol (2021); Zheng et al. (2022); Dinh et al. (2023a;b), our Rep G can also be controlled by the guidance scale γ as in Eq.10. We compare the effect of the guidance scale in the range of [0, 10] with γ = 0.0 in the diffusion sampling process without any guidance. Figure 4 shows the trend of FID and Recall when increasing the guidance scale. The generation quality of Rep G is improved steadily without trading off diversity compared to classifier guidance. Improvement without trading off with diversity is expected since our method mostly keeps the content of the generated images while upgrading the details or fixing faulty features. The effects of increasing the guidance scale can be observed in Figure 5.

5.3.3 REPRESENTATIVE TARGETS

This section shows the effect of selecting representative targets for each class.

Published as a conference paper at ICLR 2025

Guidance scale

0 50 80 Figure 5: Ablation study on the Rep G guidance scale. Unlike classifier guidance, where increasing the guidance scale shifts the image toward a simpler region (as shown in Figure 7), increasing the Rep G guidance scale enhances image details.

The K values: The different choices of K value in Eq.15 affects the performance. The experiment is conducted on Image Net64x64 as shown in Table 3. As we can see, given K=1, there is only one representative vector for one class, reducing the generated samples quality and diversity. However, more than five representative vectors per class will confuse the sampling process and downgrade the performance. Understandably, including more representative vectors brings more features to be excluded due to the contrastive loss. The excluded features might include the shared features between classes, which have become common due to the inclusion of more information.

Selection strategy: In the previous experiments, representative vectors are selected closest to the mean vector of all vectors belonging to a class. We compare our selection scheme with the random selection scheme in Table 4. Rep G Rand denotes the random selection scheme. From the results, we show that our proposed selection of representative vectors is essential and verify our hypothesis that the vector is close to the mean values of one class bearing crucial features of that class.

5.3.4 CONTRASTIVE MATCHING VS COSINE SIMILARITY MATCHING

In section 4.2, we discussed the two losses: the contrastive loss and the cosine similarity loss. Table 5 shows the comparison between the two losses, which show that both of them improve the performance significantly compared to the baseline ADM in Dhariwal & Nichol (2021).

6 CONCLUSION

In this work, we formulate the problem of incoherence in the diffusion sampling process, defined as the mismatch between predicted image distribution at two different timesteps. After that, we propose a guidance method named Representative Guidance (Reg G). Rep G is based on representative information of a class and pre-trained self-supervised models to guide the sampling process. The representative information offers a number of advantages compared to one-hot representation as in classifier guidance, such as rich information and information to avoid incoherence problems.

Published as a conference paper at ICLR 2025

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 843 852, 2023.

Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. ar Xiv preprint ar Xiv:2201.06503, 2022.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020a.

X Chen, S Xie, and K He. An empirical study of training self-supervised vision transformers. in 2021 ieee. In CVF International Conference on Computer Vision (ICCV), pp. 9620 9629.

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15750 15758, 2021.

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020b.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780 8794, 2021.

Anh-Dung Dinh, Daochang Liu, and Chang Xu. Pixelasparam: a gradient view on diffusion sampling with guidance. In International Conference on Machine Learning, pp. 8120 8137. PMLR, 2023a.

Anh-Dung Dinh, Daochang Liu, and Chang Xu. Rethinking conditional diffusion sampling with progressive guidance. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.

Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271 21284, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729 9738, 2020.

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020.

Vincent Tao Hu, Yunlu Chen, Mathilde Caron, Yuki M Asano, Cees GM Snoek, and Bjorn Ommer. Guided diffusion from self-supervised diffusion features. ar Xiv preprint ar Xiv:2312.08825, 2023.

Li Jing, Pascal Vincent, Yann Le Cun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. ar Xiv preprint ar Xiv:2110.09348, 2021.

Published as a conference paper at ICLR 2025

Minguk Kang, Woohyeon Shim, Minsu Cho, and Jaesik Park. Rebooting acgan: Auxiliary classifier gans with stable training. Advances in neural information processing systems, 34:23505 23518, 2021.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Max WY Lam, Jun Wang, Dan Su, and Dong Yu. Bddm: Bilateral denoising diffusion models for fast and high-quality speech synthesis. ar Xiv preprint ar Xiv:2203.13508, 2022.

Mingxiao Li, Tingyu Qu, Wei Sun, and Marie-Francine Moens. Alleviating exposure bias in diffusion models through sampling with shifted time steps. ar Xiv preprint ar Xiv:2305.15583, 2023.

Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 289 299, 2023.

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc Grew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ar Xiv preprint ar Xiv:2112.10741, 2021.

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp. 8162 8171. PMLR, 2021.

Mang Ning, Enver Sangineto, Angelo Porrello, Simone Calderara, and Rita Cucchiara. Input perturbation reduces exposure bias in diffusion models. ar Xiv preprint ar Xiv:2301.11706, 2023.

Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pp. 2642 2651. PMLR, 2017.

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195 4205, 2023.

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp. 1 10, 2022.

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020a.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020b.

Hu Yu, Li Shen, Jie Huang, Man Zhou, Hongsheng Li, and Feng Zhao. Debias the training of diffusion models. ar Xiv preprint ar Xiv:2310.08442, 2023.

Junyu Zhang, Daochang Liu, Shichao Zhang, and Chang Xu. Contrastive sampling chains in diffusion models. Advances in Neural Information Processing Systems, 36, 2024.

Guangcong Zheng, Shengming Li, Hui Wang, Taiping Yao, Yang Chen, Shouhong Ding, and Xi Li. Entropy-driven sampling and training scheme for conditional diffusion generation. In European Conference on Computer Vision, pp. 754 769. Springer, 2022.

Published as a conference paper at ICLR 2025

Algorithm 1 DDPM denoising process with representative guidance

Input: class labels y, classification scale s, V = {g(x0|1), g(x0|2), ..., g(x0|c)} according to Eq.15 x T N(0, I) pick class c and g(x0|c) V for t = T, ..., 1 do

x0 ( xt αt 1 αtϵθ(xt,t,c) αt )

g (1 αt) αt 1

1 αt γ xt 0L(fϕ( xt 0), V ) according to Eq.14 xt 1 1 αt (xt 1 αt 1 αt ϵθ(xt, t, c)) + σ2 t g + σtz g end for

A SAMPLING ALGORITHMS

Like DDPMs, our sampling only updates xt 0 at every time step t. We have the set of representative vectors V obtained in advance and stored as the model parameters used for sampling.

The mechanism is the same for latent diffusion, but we will decode the latent vector to xt 0 first. After that, the process is similar to Algorithm 1.

B EXPERIMENTAL DETAILS

All the experiments in this paper are conducted on A100 GPUs 40GB.

We have three hyperparameters in the paper, which are the number of representative vectors K in Eq.15, temperature H in Eq.13 and scale guidance γ in Eq.14.

Table 6: All hyperparameters for producing the results are shown in this table.

Model Datasets K H γ

IDDPM + Rep G Image Net64x64 5 1 10.0 ADM + Rep G Image Net64x64 5 1 10.0 ADM-CLSFree + Rep G Image Net64x64 5 1 8.0 ADM + Rep G Image Net256x256 10 2 20.0 ADM-CLSFree + Rep G Image Net256x256 10 2 20.0 Di T-CLSFree + Rep G Image Net256x256 10 2 15.0

Table 2 W/o Guidance Image Net64x64 - - 0.0 Moco-v2/Sim Siam/Moco-v3 Image Net64x64 5 1 10.0

Table 3 ADM + Rep G Image Net64x64 1,5,10,15 1 10.0

Table 4 ADM + Rep G /ADM+Rep G RAND Image Net64x64 5 1 10.0

Figure 1,2,3,5,6,7 ADM + Rep G Image Net256x256 10 2 0.0,20.0, 50.0,80.0

Figure 4 ADM + Rep G Image Net64x64 5 1 2.0,4.0, 6.0, 8.0, 10.0

Published as a conference paper at ICLR 2025

Table 7: GPU hours on 1 GPU are needed to generate 50,000 images with 256x256 resolutions. Diffusion Model: ADM/ Datasets: Image Net256x256

Model Computational Cost (GPU hours) No guidance 171.22 Representative Guidance 182.36 Classifier Guidance 247.84 Classifier-free Guidance 352.89

Table 8: GPU hours on 1 GPU are needed to generate 50,000 images with 64x64 resolutions. Diffusion Model: ADM/ Datasets: Image Net64x64

Model Computational Cost (GPU hours) No guidance 16.71 Representative Guidance 17.55 Classifier Guidance 31.52 Classifier-free Guidance 32.64

C RUNNING TIME OF REPG COMPARED TO CLASSIFIER GUIDANCE

Rep G utilizes a much lighter model compared to noise-aware used in classifier guidance Dhariwal & Nichol (2021). As a result, the calculation of gradients using this model is much lighter compared to the noise-ware classifiers. We have the running time comparison as in Table 7 and 8.

D FULL DERIVATION OF EQUATIONS

Similarity between Eq. 11 and Stochastic Gradient Descent: We start from Eq.11 as below:

xt 1 = (1 αt) αt 1

1 αt xt 0 + (1 αt 1) αt

1 αt xt + σtz

(1 αt) αt 1

1 αt γ xt 0d(g(x0|c), fϕ( xt 0)) (16)

with xt 0 = ( xt αt 1 αtϵθ(xt,c,t) αt ). Similarly we have xt 1 = αt 1 xt 1 0 + 1 αt 1ϵθ(xt 1, c, t 1). Thus, we have Eq.16 is equivalent to Eq.17:

xt 1 0 =(1 αt)

(1 αt) xt 0 1 αt 1 αt 1 ϵθ(xt 1, c, t 1) + (1 αt 1) αt

(1 αt) αt 1 xt + σtz

1 αt γ xt 0d(g(x0|c), fϕ( xt 0))

= xt 0 (αt αt

1 αt xt 0 + 1 αt 1 αt 1 ϵθ(xt 1, c, t 1) (1 αt 1) αt

(1 αt) αt 1 xt σtz)

1 αt γ xt 0d(g(x0|c), fϕ( xt 0)) (17)

with xt 1 is obtained from Eq.16.

The Eq.17 has a very close form with a Stochastic Gradient Descent optimization with xt 0 as

parameters and two gradients 1 = αt αt

1 αt xt 0 +

1 αt 1 αt 1 ϵθ(xt 1, c, t 1) (1 αt 1) αt

(1 αt) αt 1 xt σtz

and 2 = (1 αt)

1 αt γ xt 0d(g(x0|c), fϕ( xt 0)). We will show that this Eq.17 has a consistent objective function. From Eq.1 and two timesteps t1 < t2, we have:

1 αt1ϵ1 (18)

1 αt2ϵ2 (19)

Published as a conference paper at ICLR 2025

From xt1 and xt2, we have the prediction of x0 at t1 is x(t1) 0 and t2 is x(t2) 0 . We have:

x(t1) 0 = xt1 1 αt1ϵθ(xt1, t1) αt1 (20)

x(t2) 0 = xt2 1 αt2ϵθ(xt2, t2) αt2 (21)

Replace Eq.18 and 19 into Eq.20 and 21, we have:

x(t1) 0 = x0 + 1 αt1(ϵ1 ϵθ(xt1, t1)) αt1 (22)

x(t2) 0 = x0 + 1 αt2(ϵ2 ϵθ(xt2, t2)) αt2 (23)

From Eq.22 and 23, we have at any timestep t, the distance between xt 0 x0 =

1 αt1(ϵ ϵθ(xt,t)) αt which means || xt 0 x0|| =

1 αt1||ϵ ϵθ(xt,t))|| αt . Assuming that ϵθ is trained to converge, we assume ||ϵθ(xt1, t1) ϵ1|| ||ϵθ(xt2, t2) ϵ2||, because when image is clearer, we also expect the error should be smaller. The extreme case is ||ϵ1 ϵθ(xt1, t1)|| ||ϵ2 ϵθ(xt2, t2)|| . As a

result || xt1 0 x0|| =

αt1 and || xt2 0 x0|| =

αt2 . Since t1 < t2,

αt2 which means || x(t1) 0 x0|| < || x(t2) 0 x0|| t1 < t2. Which means that from T to 0, the sampling process will update xt 0 so that || xt 0 x0|| min. We have the first gradient of the Eq.17 is

1 αt xt 0 +

1 αt 1 αt 1 ϵθ(xt 1, c, t 1) (1 αt 1) αt

(1 αt) αt 1 xt σtz.

We can easily see the second gradient is the 2 = (1 αt)

1 αt γ xt 0d(g(x0|c), fϕ( xt 0)) to minimize the distance d(g(x0|c), fϕ( xt 0)).

Thus, we can conclude that the sampling process as Eq.11 is a process of Stochastic Gradient Descent to optimize two objectives. The first objective is min xt 0 || xt 0 x0|| and the second objective is min xt 0 d(g(x0|c), fϕ( xt 0)).

Full derivation of Eq.6: Eq.6 can be fully derived as below:

xt 1 = 1 αt xt 1 αt 1 αt ϵθ(xt, t) + σtz

= ( 1 αt (1 α) αt xt + (1 αt 1) αt

1 αt xt) 1 αt 1 αt ϵθ(xt, t) + σtz

1 αt ( xt αt 1 αt αt ϵθ(xt, t)) + (1 αt 1) αt

1 αt xt + σtz

= (1 αt) αt 1

1 αt ( xt αt 1 αtϵθ(xt, t) αt ) + (1 αt 1) αt

1 αt xt + σtz (24)

E x0 DISTRIBUTION

Figure 6 shows the difference in the distributions of xt 0 at different timesteps.

F CLASSIFER GUIDANCE DIVERSITY SUPPRESSION

Similar to Dinh et al. (2023b), we reproduce the diversity suppression of classifier guidance as in Figure 8, 9 and 10.

G MORE QUALITATIVE RESULTS COMPARISON FOR REPG

Figure 11, 12, 13 and 14 shows more examples of how Rep G can help to fix details in the generated images

Published as a conference paper at ICLR 2025

250 200 150 50

Figure 6: Visualization of xt 0 at different timesteps. xt 0 has different distributions when t varies. The earlier timesteps have less information, while the later stage has clearer views of the images.

Guidance scale

Guidance scale

Figure 7: Rep G edit details in the image while Classifier Guidance (CLSG) generates another image with a good discriminative feature. However, sometimes, the over-exploiting of discriminative features results in the lack of robustness features in the output.

H MORE SAMPLES WITH REPG

Figure 15 shows several samples by Di T combined with Rep G.

Published as a conference paper at ICLR 2025

Figure 8: Classifier guidance utilizes the same style to repeat features to all generated images. This is due to the overexploitation of discriminative features (front-face features) reducing the diversity of the diffusion model

Figure 9: Classifier guidance utilizes the same style to repeat features to all generated images. This is due to the overexploitation of discriminative features (front-face features) reducing the diversity of the diffusion model

Published as a conference paper at ICLR 2025

Figure 10: Classifier guidance utilizes the same style to repeat features to all generated images. This is due to the overexploitation of discriminative features (lie-in-bed features) reducing the diversity of the diffusion model

Figure 11: Image Net256x256/class: tiger shark. The images on the left, shown before the arrow, are the erroneous outputs generated by ADM, while the images on the right, after the arrow, depict the corrections made using Rep G. The examples show that Rep G can improve the details and fix the erroneous features.

Published as a conference paper at ICLR 2025

Figure 12: Image Net256x256/class: green lizard. The images on the left, shown before the arrow, are the erroneous outputs generated by ADM, while the images on the right, after the arrow, depict the corrections made using Rep G. The examples show that Rep G can improve details/background, remove unnecessary features and fix erroneous features.

Figure 13: Image Net256x256/class: crane. The images on the left, shown before the arrow, are the erroneous outputs generated by ADM, while the images on the right, after the arrow, depict the corrections made using Rep G.The examples show that Rep G can upgrade details, modify faulty features.

Published as a conference paper at ICLR 2025

Figure 14: Image Net256x256/class: English springer. The images on the left, shown before the arrow, are the erroneous outputs generated by ADM, while the images on the right, after the arrow, depict the corrections made using Rep G. The examples show that Rep G can improve details/background, remove unnecessary features and fix erroneous features.

Published as a conference paper at ICLR 2025

Figure 15: Sampling by Di T with Rep G for several classes.