# dragondiffusion_enabling_dragstyle_manipulation_on_diffusion_models__3e82094a.pdf Published as a conference paper at ICLR 2024 DRAGONDIFFUSION: ENABLING DRAG-STYLE MANIPULATION ON DIFFUSION MODELS Chong Mou1,3 Xintao Wang2 Jiechong Song1 Ying Shan2 Jian Zhang1,3 1School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University 2ARC Lab, Tencent PCG 3Peking University Shenzhen Graduate School-Rabbitpre AIGC Joint Research Laboratory {eechongm, xintao.alpha}@gmail.com, {songjiechong, zhangjian.sz}@pku.edu.cn Object Pasting Object Moving & Resizing Appearance Replacing Content Dragging Figure 1: The image editing tasks that our Dragon Diffusion can achieve without training. Despite the ability of text-to-image (T2I) diffusion models to generate highquality images, transferring this ability to accurate image editing remains a challenge. In this paper, we propose a novel image editing method, Dragon Diffusion, enabling Drag-style manipulation on Diffusion models. Specifically, we treat image editing as the change of feature correspondence in a pre-trained diffusion model. By leveraging feature correspondence, we develop energy functions that align with the editing target, transforming image editing operations into gradient guidance. Based on this guidance approach, we also construct multi-scale guidance that considers both semantic and geometric alignment. Furthermore, we incorporate a visual cross-attention strategy based on a memory bank design to ensure consistency between the edited result and original image. Benefiting from these efficient designs, all content editing and consistency operations come from the feature correspondence without extra model fine-tuning. Extensive experiments demonstrate that our method has promising performance on various image editing tasks, including within a single image (e.g., object moving, resizing, and content dragging) or across images (e.g., appearance replacing and object pasting). Code is available at https://github.com/MC-E/Dragon Diffusion. 1 INTRODUCTION Thanks to the large-scale training data and huge computing power, generative models have developed rapidly, especially text-to-image (T2I) diffusion models Saharia et al. (2022); Rombach et al. (2022); Nichol et al. (2022); Ramesh et al. (2022), which aims to generate images conditioned on a given text/prompt. However, this generative capability is usually diverse, and it is challenging to design suitable prompts to generate images consistent with what the user has in mind Mou et al. (2023); Zhang et al. (2023), let alone fine-grained image editing based on the text condition. Corresponding author This work was supported by National Natural Science Foundation of China under Grant 62372016. Published as a conference paper at ICLR 2024 In the community of image editing, previous methods are usually designed based on GANs Abdal et al. (2019; 2020); Alaluf et al. (2022) due to the compact and editable latent space, e.g., the W space in Style GAN Karras et al. (2019). Recently, Drag GAN Pan et al. (2023) proposes a pointto-point dragging scheme, which can achieve refined content dragging. However, it is limited by the capacity and generalization of GANs. Compared to GANs, diffusion model Ho et al. (2020) has higher stability and superior generation quality. Due to the lack of a concise and editable latent space, numerous diffusion-based image editing methods Hertz et al. (2022); Feng et al. (2022); Balaji et al. (2022) are built based on T2I diffusion models via correspondence between text and image features. Recently, self-guidance Epstein et al. (2024) proposes a differentiable approach that employs crossattention maps between text and image to locate and calculate the size of objects within images. Then, gradient guidance is utilized to edit these properties. However, the correspondence between text and image features is weak, heavily relying on the design of prompts. Moreover, in complex or multi-object scenarios, text struggles to build accurate correspondence with a specific object. In this paper, we aim to investigate whether the diffusion model can achieve drag-style image editing, which is a fine-grained and generalized editing ability not limited to point dragging. In the large-scale T2I diffusion model, besides the correspondence between text features and intermediate image features, there is also a strong correspondence across image features. This characteristic is studied in DIFT Tang et al. (2023), which demonstrates that this correspondence is high-level, enabling point-to-point correspondence of relevant image content. Therefore, we are intrigued by the possibility of utilizing this strong correspondence across image features to achieve image editing. In this paper, we regard image editing as the change of feature correspondence and convert it into gradient guidance via energy functions Dhariwal & Nichol (2021) in score-based diffusion Song et al. (2020b). Additionally, the content consistency between editing results and original images is also ensured by feature correspondence in a visual cross-attention design. Here, we notice that there is a concurrent work, Drag Diffusion Shi et al. (2023), studying this issue. It uses LORA Ryu (2023) to maintain consistency with the original image and optimizes the latent in a specific diffusion step to perform point dragging. Unlike Drag Diffusion, our image editing is achieved by energy functions and a visual cross-attention design, without extra model fine-tuning or new blocks. In addition, we can complete various drag-style image editing tasks beyond the point dragging, as shown in Fig. 1. In summary, the contributions of this paper are as follows: We achieve drag-style image editing via image feature correspondence in the pre-trained diffusion model. We also study the roles of the features in different layers and develop multi-scale guidance that considers both semantic and geometric correspondence. We design a memory bank, further utilizing the image feature correspondence to maintain the consistency between editing results and original images. In conjunction with gradient guidance, our method allows a direct transfer of T2I generation ability in diffusion models to image editing tasks without the need for extra model fine-tuning or new blocks. Extensive experiments demonstrate that our method has promising performance in various image editing tasks, including editing within a single image (e.g., object moving, resizing, and content dragging) or across images (e.g., appearance replacing and object pasting). 2 RELATED WORK 2.1 DIFFUSION MODELS Recently, diffusion models Ho et al. (2020) have achieved great success in the community of image synthesis. It is designed based on thermodynamics Sohl-Dickstein et al. (2015); Song & Ermon (2019), including a diffusion process and a reverse process. In the diffusion process, a natural image x0 is converted to a Gaussian distribution x T by adding random Gaussian noise with T iterations. The reverse process is to recover x0 from x T by several denoising steps. Therefore, the diffusion model is to train a denoiser, conditioned on the current noisy image xt and time step t: Ex0,t,ϵt N(0,1) ||ϵt ϵθ(xt, t)||2 2 , (1) where ϵθ is the function of the denoiser. Recently, some text-conditioned diffusion models (e.g., GLID Nichol et al. (2022) and Stable Diffusion(SD) Rombach et al. (2022)) are proposed. Especially SD, transforming xt to the latent space zt, significantly improves the generation performance. From Published as a conference paper at ICLR 2024 the continuous perspective Song et al. (2020b), diffusion models can be viewed as a score function (i.e., ϵθ(xt, t) xt log q(xt)) that samples from the corresponding distribution Song & Ermon (2020) according to Langevin dynamics Sohl-Dickstein et al. (2015); Song & Ermon (2019). 2.2 ENERGY FUNCTION IN DIFFUSION MODEL From the continuous perspective of score-based diffusion, the external condition y can be combined by a conditional score function, i.e., xt log q(xt|y), to sample from a more enriched distribution. The conditional score function can be further decomposed as: xt log q(xt|y) = xt log q(y|xt)q(xt) xt log q(xt) + xt log q(y|xt), (2) where the first term is the unconditional denoiser, and the second term refers to the conditional gradient produced by an energy function E(xt; t, y) = q(xt|y). E can be selected based on the generation target, such as a classifier Dhariwal & Nichol (2021) to specify the category of generation results. Energy function has been used in various controllable generation tasks, e.g., sketch-guided generation Voynov et al. (2023), mask-guided generation Singh et al. (2023), universal guidance Yu et al. (2023); Bansal et al. (2023), and image editing Epstein et al. (2024). These methods, inspire us to transform editing operations into conditional gradients, achieving fine-grained image editing. 2.3 IMAGE EDITING In image editing, numerous previous methods Abdal et al. (2019; 2020); Alaluf et al. (2022) invert images into the latent space of Style GAN Karras et al. (2019) and then edit the image by manipulating latent vectors. Motivated by the success of diffusion model Ho et al. (2020), various diffusionbased image editing methods Avrahami et al. (2022); Hertz et al. (2022); Kawar et al. (2023); Meng et al. (2021); Brooks et al. (2023) are proposed. Most of them use text as the editing control. For example, Kawar et al. (2023); Valevski et al. (2023); Kwon & Ye (2022) perform model fine-tuning on a single image and then generate the editing result by target text. Prompt2Prompt Hertz et al. (2022) achieves specific object editing by exchanging text-image attention maps. SDEdit Meng et al. (2021) performs image editing by adding noise to the original image and then denoising under new text conditions. Instruct Pix2Pix Brooks et al. (2023) finetunes the diffusion model with text as the editing instruction. Recently, Self-guidance Epstein et al. (2024) transforms image editing operations into gradients through the correspondence between text and image features. However, the correspondence between text and image is weak, unable to perform fine-grained editing. Recently, Drag GAN Pan et al. (2023) presents a point-to-point dragging scheme. Nevertheless, its editing quality and generalization are limited by GANs. How to utilize the high-quality and diverse generation ability of diffusion models for fine-grained image editing is still an open challenge. 3.1 PRELIMINARY: HOW TO CONSTRUCT ENERGY FUNCTION IN DIFFUSION Modeling an energy function E(xt; t, y) to produce the conditional gradient xt log q(y|xt) in Eq. 2, remains an open question. E measures the distance between xt and the condition y. Some methods Dhariwal & Nichol (2021); Voynov et al. (2023); Zhao et al. (2022) train a time-dependent distance measuring function, e.g., a classifier Dhariwal & Nichol (2021) to predict the probability that xt belongs to category y. However, the training cost and annotation difficulty are intractable in our image editing task. Some tuning-free methods Yu et al. (2023); Bansal et al. (2023) propose using the clean image x0|t predicted at each time step t to replace xt for distance measuring, i.e., E(xt; t, y) D(x0|t; t, y). Nevertheless, there is a bias between x0|t and x0, and there is hardly a suitable D for distance measuring in image editing tasks. Hence, the primary issue is whether we can circumvent the training requirement and construct an energy function to measure the distance between xt and the editing target. Recent work Tang et al. (2023) has shown that the feature correspondence in the diffusion UNet-denoiser ϵθ is high-level, enabling point-to-point correspondence measuring. Inspired by this characteristic, we propose reusing ϵθ as a tuning-free energy function to transform image editing operations into the change of feature correspondence. Published as a conference paper at ICLR 2024 gen U-Net Denoiser Moving Editing Memory Bank Selfattention DDIM Inversion Feature Correspondence ref Reference Image Original Image Appearance Editing Feature Correspondence Figure 2: Overview of our Dragon Diffusion, containing a memory bank and score-based gradient guidance on the pre-trained SD Rombach et al. (2022) without extra training or modules. 3.2 OVERVIEW The editing objective of our Dragon Diffusion involves two issues: changing the content to be edited and preserving unedited content. For example, if a user wants to move the cup in an image, the generated result only needs to change the position of the cup, while the appearance of the cup and other unedited content should not change. An overview of our method is presented in Fig. 2, which is built on the pre-trained SD Rombach et al. (2022) to support image editing with and without reference images. Since SD is a latent diffusion model (LDM), we first encode the original image x0 into the latent space z0, which is then reversed to z T by DDIM inversion Song et al. (2020a). If the reference image xref 0 exists, it will also be involved in the inversion to produce zref T . In this process, we store some intermediate features and latent at each time step to build a memory bank, which is used to provide guidance for subsequent image editing. In generation, we transform the information stored in the memory bank into content editing and consistency guidance through two paths, i.e., visual cross-attention and gradient guidance. Both of these paths are built based on feature correspondence in the pre-trained SD, without extra model fine-tuning or new blocks. 3.3 DDIM INVERSION WITH MEMORY BANK In our image editing process, the starting point z T , produced by DDIM inversion Song et al. (2020a), can provide a good generation prior to maintain consistency with the original image. However, relying solely on the final step of this approximate inversion can hardly provide accurate generation guidance. Therefore, we fully utilize the information in DDIM inversion by building a memory bank to store the latent zgud t at each inversion step t, as well as corresponding keys Kgud t and values Vgud t in the self-attention module of the decoder within the UNet denoiser. Note that in some cross-image editing tasks (e.g., appearance replacing, object pasting), reference images are required. In these tasks, the memory bank needs to be doubled to store the information of the reference images. Here, we utilize zref t , Kref t , and Vref t to represent them. The information stored in the memory bank will provide more accurate guidance for the subsequent image editing process. 3.4 GRADIENT-GUIDANCE-BASED EDITING DESIGN Continuous Sampling Space : Original Gradient : Corrected Gradient Figure 3: Illustration of continuous sampling space in score-based diffusion. Bright colors indicate areas where target data is densely distributed. The orange and green paths respectively refer to the diffusion paths without and with external gradient guidance. Inspired by classifier guidance Dhariwal & Nichol (2021), we build energy functions to transform image editing operations into gradient guidance in diffusion sampling. An intuitive illustration is presented in Fig. 3, showing a continuous sampling space of the score-based diffusion Song et al. (2020b). The sampling starting point z T , obtained from DDIM inversion, will approximately return to the original point only according to the gradient/score predicted by the denoiser. After incorporating the gradient guidance generated by the energy function that matches the editing target, the additional guidance gradient will change the path to reach a sampling result that meets the editing target. Published as a conference paper at ICLR 2024 3.4.1 ENERGY FUNCTION VIA FEATURE CORRESPONDENCE In our Dragon Diffusion, energy functions are designed to provide gradient guidance for image editing, mainly including content editing and consistency terms. Specifically, at the t-th time step, we reuse the UNet denoiser ϵθ to extract intermediate features Fgen t from the latent zgen t at the current time step. The same operation is used to extract guided features Fgud t from zgud t in memory bank. Following DIFT Tang et al. (2023), Fgen t and Fgud t come from intermediate features in the UNet decoder. The image editing operation is represented by two binary masks (i.e., mgud and mgen) to locate the original content position and target dragging position, respectively. Therefore, the energy function is built by constraining the correspondence between these two regions in Fgud t and Fgen t . Here, we utilize cosine distance cos( ) [ 1, 1] to measure the similarity and normalize it to [0, 1]: Slocal(Fgen t , mgen, Fgud t , mgud) = 0.5 cos Fgen t [mgen], sg(Fgud t [mgud]) + 0.5, (3) where sg( ) is the gradient clipping operation. Eq. 3 is mainly used for dense constraints on the spatial location of content. In addition, a global appearance similarity is defined as: Sglobal(Fgen t , mgen, Fgud t , mgud) = 0.5 cos P Fgen t [mgen] P mgen , sg( P Fgud t [mgud] P mgud ) which utilizes the mean of the features in a region as a global appearance representation. When we want to have fine control over the spatial position of an object or a rough global control over its appearance, we only need to constrain the similarity in Eq. 3 and Eq. 4 to be as large as possible. Therefore, the energy function to produce editing guidance is defined as: α + β S(Fgen t , mgen, Fgud t , mgud) , S {Slocal, Sglobal}, (5) where α and β are two hyper-parameters, which are set as 1 and 4, respectively. In addition to editing, we hope the unedited content remains consistent with the original image. We use a mask mshare to locate areas without editing. The similarity between the editing result and the original image in mshare can also be calculated by the cosine similarity as Slocal(Fgen t , mshare, Fgud t , mshare). Therefore, the energy function to produce content consistency guidance is defined as: Econtent = 1 α + β Slocal(Fgen t , mshare, Fgud t , mshare) . (6) In addition to Eedit and Econtent, an optional guidance term Eopt may need to be added in some tasks to achieve the editing goal. Finally, the base energy function is defined as: E = we Eedit + wc Econtent + wo Eopt, (7) where we, wc, and wo are hyper-parameters to balance these guidance terms. They vary slightly in different editing tasks but are fixed within the same task. Finally, regarding [mgen, mshare] as condition, the conditional score function in Eq. 2 can be written as: zgen t log q(zgen t |y) zgen t log q(zgen t ) + zgen t log q(y|zgen t ), y = [mgen, mshare]. (8) The conditional gradient zgen t log q(y|zgen t ) can be computed by zgen t E, which will also multiplies by a learning rate η. In experiments, we find that the gradient guidance in later diffusion generation steps hinders the generation of textures. Therefore, we only add gradient guidance in the first n steps of diffusion generation. Experientially, we set n = 30 in 50 sampling steps. 3.4.2 MULTI-SCALE FEATURE CORRESPONDANCE The decoder of the UNet denoiser contains four blocks of different scales. DIFT Tang et al. (2023) finds that the second layer contains more semantic information, while the third layer contains more geometric information. We also studied the role of features from different layers in image editing tasks, as shown in Fig. 4. In the experiment, we set z T as random Gaussian noise and set mgen, mgud as zeros matrixes. mshare is set as a ones matrix. In this way, generation relies solely on content consistency guidance (i.e., Eq. 6) to restore image content. We can find that the guidance Published as a conference paper at ICLR 2024 Original Image Guided by the layer-1 Guided by the layer-2 Guided by the layer-3 Guided by the layer-4 Guided by the layer-2&3 Figure 4: Illustration of using features from different layers as guidance to restore the original image. z T is randomly initialized. The generation is solely guided by content consistency guidance in Eq. 6. from the first layer is too high-level to reconstruct the original image accurately. The guidance from the fourth layer has weak feature correspondence, resulting in significant differences between the reconstructed and original images. The features from the second and third layers are more suitable to produce guidance signals, and each has its own specialty. Concretely, the features in the second layer contain more semantic information and can reconstruct images that are semantically similar to the original image but with some differences in content details. The features in the third layer tend to express low-level characteristics, but they cannot provide effective supervision for high-level texture, resulting in blurry results. In our design, we combine these two levels (i.e., high and low) of guidance by proposing a multi-scale supervision approach. Specifically, we compute gradient guidance on the second and third layers. The reconstructed results in Fig. 4 also demonstrate that this combination can balance the generation of low-level and high-level visual characteristics. 3.4.3 IMPLEMENTATION DETAILS FOR EACH APPLICATION Original Image Moving w/o Moving w Figure 5: Visualization of the effectiveness of inpainting guidance (Eopt) in the object moving task, presenting that Eopt can guide the inpainting of the area where the object is initially located. Object moving. In the task of object moving, mgen and mgud locate the same object in different spatial positions. mshare is the complement (Cu) of the union ( ) of mgen and mgud, i.e., mshare = Cu(mgen mgud). However, solely using the content editing and consistency guidance in Eq. 5 and Eq. 6 can lead to some issues, as shown in the second image of Fig. 5. Concretely, although the bread is moved according to the editing signal, some of the bread content is still preserved in its original position in the generated result. This is because the energy function does not constrain the area where the moved object was initially located, causing inpainting to easily restore the original object. To rectify this issue, we use the optional energy term (i.e., Eopt in Eq. 7) to constrain the inpainting content to be dissimilar to the moved object and similar to a predefined reference region. Here, we use mref to locate the reference region and define mipt = {p|p mgud and p / mgen} to locate the inpainting region. Finally, Eopt in this task is defined as: Eopt = wi α + β Sglobal(Fgen t , mipt, Fgud t , mref) + Slocal(Fgen t , mipt, Fgud t , mipt), (9) where wi is a weight parameter, set as 2.5 in our implementation. The third image in Fig. 5 shows that this design can effectively achieve the editing goal without impeachable artifact. Object resizing. The score function in this task is the same as the object moving, except that a scale factor γ > 0 is added during feature extraction. Specifically, we use interpolation to transform mgud and Fgud t to the target size, and then extract Fgud t [mgud] as the feature of the resized object. To locate the target object with the same size in Fgen t , we resize mgen with the same scale factor γ. Then we extract a new mgen of the original size from the center of the resized mgen. Note that if γ < 1, we use 0 to pad the vacant area. Appearance replacing. This task aims to replace the appearance between objects of the same category across images. Therefore, the capacity of the memory bank needs to be doubled to store extra information from the image containing the reference appearance, i.e., zref t , Kref t , and Vref t . mgen and mgud respectively locate the editing object in the original image and the reference object Published as a conference paper at ICLR 2024 in the reference image. mshare is set as the complement of mgen, i.e., Cu(mgen). To constrain appearance, we choose Sglobal(Fgen t , mgen, Fref t , mgud) in Eq. 5. This task has no need for Eopt. Reference Image Original Image Copy-paste Dragon Diffusion Figure 6: Visual comparison between our Dragon Diffusion and direct copypaste in cross-image object pasting. Object pasting. Object pasting aims to paste an object from an image onto any position in another image. Although it can be completed by simple copy-paste, it often results in inconsistencies between the paste area and other areas due to differences in light and perspective, as shown in Fig. 6. As can be seen, the result obtained by copy-paste exists discontinuities, while the result generated by our Dragon Diffusion can achieve a more harmonized integration of the scene and the pasted object. In implementation, similar to the appearance replacing, the memory bank needs to store information of the reference image, which contains the target object. mgen and mgud respectively mark the position of the object in the edited image and reference image. mshare is set as Cu(mgen). Point dragging. In this task, we want to drag image content via several points, as Drag GAN Pan et al. (2023). In this case, mgen and mgud locate neighboring areas centered around the destination and starting points. Here, we extract a 3 3 rectangular patch centered around each point as the neighboring area. Unlike the previous tasks, mshare is manually defined. 3.5 VISUAL CROSS-ATTENTION As mentioned previously, two strategies are used to ensure the consistency between the editing result and the original image: (1) DDIM inversion to initialize z T ; (2) content consistency guidance in Eq. 6. However, it is still challenging to maintain high consistency. Inspired by the consistency preserving in some video and image editing works Wu et al. (2022); Cao et al. (2023); Wang et al. (2023), we design a visual cross-attention guidance. Instead of generating guidance information through an independent inference branch, we reuse the intermediate features of the inversion process stored in the memory bank. Specifically, similar to the injection of text conditions in SD Rombach et al. (2022), we replace the key and value in the self-attention module of the UNet decoder with the corresponding key and value collected by the memory bank in DDIM inversion. Note that in the appearance replacing and object pasting tasks, the memory bank stores two sets of keys and values from the original image (Kgud t , Vgud t ) and the reference image (Kref t , Vref t ). In this case, we concatenate the two sets of keys and values in the length dimension. The visual cross-attention at each time step is defined as follows. c refers to the concatenation operation. ( Qt = Qgen t ; Kt = Kgud t or (Kgud t c Kref t ); Vt = Vgud t or (Vgud t c Vref t ) Att(Qt, Kt, Vt) = softmax( Qt KT t d )Vt. (10) 4 EXPERIMENTS In experiments, we use Stable Diffusion-V1.5 Rombach et al. (2022) as the base model. The inference adopts DDIM sampling with 50 steps, and we set the classifier-free guidance scale as 5. 4.1 COMPARISONS In this part, we compare our Dragon Diffusion with other methods on various image editing tasks. Content dragging. In this task, we compare our method with the recent User Controllable LT Endo (2022), Drag GAN Pan et al. (2023), and Drag Diffusion Shi et al. (2023). We first present the time complexity of different methods in Tab. 1. Specifically, We divide the time complexity of different methods into two parts, i.e., the preparing and inference stages. The preparing stage involves Diffusion/GAN inversion and model fine-tuning. The inference stage generates the editing result. The time complexity is tested on one point dragging, with the image resolution being 512 512. Published as a conference paper at ICLR 2024 Table 1: Quantitative evaluation on face manipulation with 68 and 17 points. The accuracy is calculated by Euclidean distance between edited points and target points. The initial distance (i.e., 57.19 and 36.36) is the upper bound, without editing. FID Seitzer (2020) is utilized to quantize the editing quality of different methods. The time complexity is computed on the 1 point dragging. Preparing complexity Inference complexity Unaligned face 17 Points From 57.19 68 Points From 36.36 FID 17/68 points User Controllable LT 1.2s 0.05s % 32.32 24.15 51.20/50.32 Drag GAN 52.40s 6.71s % 15.96 10.60 39.27/39.50 Drag Diffusion 48.25s 19.71s ! 22.95 17.32 38.06/36.55 Dragon Diffusion(ours) 3.62s 15.93s ! 18.51 13.94 35.75/34.58 Original Face Reference Face User Controllable LT Drag GAN Drag Diffusion Ours 68 Points 17 Points Content Dragging Replacing Ours Paint-by-example Ours Paint-by-example Self-guidance Self-guidance Ours Self-guidance an apple on wooden table a dog play with a man a mug on wooden table an apple on wooden table a dog play with a man Self-guidance a mug on wooden table Figure 7: Qualitative comparison between our Dragon Diffusion and other methods in face manipulation (target points are blue), object pasting, appearance replacing, and object moving. The experiment is conducted on an NVIDIA A100 GPU with Float32 precision. The results present that our method is relatively efficient in the preparing stage, requiring only 3.62s to prepare z T and memory bank. The inference complexity is also acceptable for diffusion generation. Following Drag GAN Pan et al. (2023), the performance evaluation is conducted on the face keypoint manipulation with 17 and 68 points. The test set is randomly formed by 800 aligned faces from Celeb A-HQ Karras et al. (2018) training set. Note that we do not set fixed regions for all methods, due to the difficulty in manually providing a mask for each face. In addition to accuracy, we also compute the FID Seitzer (2020) between face editing results and Celeb A-HQ training set to represent the editing quality. The quantitative and qualitative comparison is presented in Tab. 1 and Fig. 7, respectively. One can see that our Dragon Diffusion achieves promising results in editing accuracy and content consistency. Although Drag GAN achieves better editing accuracy, it has limitations in content consistency and robustness in areas outside faces (e.g., the headwear is distorted). The limitations of GAN-based Drag GAN and User Controllable LT also exist in requiring alignment before editing, as shown in Fig. 8. It can be seen that if editing without alignment, the results of Drag GAN will suffer from severe degradation. The alignment operation is not friendly to our editing goal, as it will change the original image content, e.g., filtering out the background. In Published as a conference paper at ICLR 2024 Original Image Moving w/o Inversion Moving w/o Content Consistency guidance Moving w/o Visual Cross-attention Full Implementation Figure 9: Effectiveness of different components in our Dragon Diffusion in the object moving task. comparison, our method has promising editing accuracy, and the generation prior from SD enables better robustness and generalization for different content. In this task, our method also has better performance than Drag Diffusion. More results are shown in the appendix. Other applications. For object pasting, we compare our method with Paint-by-example Yang et al. (2023). For appearance replacing and object moving, we compare our method with Self Guidance Epstein et al. (2024). The visual comparison in Fig. 7 shows that our method can achieve comparable performance to the training method (i.e., Paint-by-example) in object pasting. Original Image Drag GAN w/o Ours w/o alignment Figure 8: Editing comparison between our Dragon Diffusion and Drag GAN Pan et al. (2023) on the unaligned body and face. Compared to self-guidance, our method has better editing accuracy and content consistency. Due to the lack of consistency constraints, Self-Guidance produces some unexpected artifacts. Moreover, Self-Guidance has obvious deviation in complex scenes, due to the coarse correspondence between text and image features. More results are presented in Appendix. 4.2 ABLATION STUDY In this part, we demonstrate the effectiveness of some components in our Dragon Diffusion, as shown in Fig. 9. We conduct the experiment on the object moving task. Specifically, (1) we verify the importance of the inversion prior by randomly initializing z T instead of obtaining from DDIM inversion. As can be seen, the random z T leads to a significant difference between the editing result and the original image. (2) We remove the content consistency guidance (i.e., Econtent) in Eq. 7, which causes local distortion in the editing result, e.g., the finger is twisted. (3) We remove the visual cross-attention. It can be seen that visual cross-attention plays an important role in maintaining the consistency between the edited object and the original object. Using a memory bank to provide Kt and Vt can greatly reduce the additional cost. In Appendix, we show an ablation study for memory bank. Therefore, these components work together on both edited and unedited content, forming the fine-grained image editing model Dragon Diffusion, which does not require extra training or modules. 5 CONCLUSION Despite the ability of existing large-scale text-to-image (T2I) diffusion models to generate highquality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, we aim to develop a drag-style and general image editing scheme based on the strong correspondence of intermediate image features in the pre-trained diffusion model. To this end, we model image editing as the change of feature correspondence and design energy functions to transform the editing operations into gradient guidance. Based on the gradient guidance strategy, we also propose multi-scale guidance to consider both semantic and geometric alignment. Moreover, a visual cross-attention is added based on a memory bank design, which can enhance the consistency between the original image and the editing result. Due to the reuse of intermediate information from the inversion process, this content consistency strategy almost has no additional cost. Extensive experiments demonstrate that our proposed Dragon Diffusion can perform various image editing tasks, including object moving, resizing, appearance replacing, object pasting, and content dragging. At the same time, the complexity of our Dragon Diffusion is acceptable, and it does not require extra model fine-tuning or additional modules. Published as a conference paper at ICLR 2024 Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4432 4441, 2019. Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan++: How to edit the embedded images? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8296 8305, 2020. Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF conference on computer Vision and pattern recognition, pp. 18511 18521, 2022. Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208 18218, 2022. Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. ar Xiv preprint ar Xiv:2211.01324, 2022. Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 843 852, 2023. Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392 18402, 2023. Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22560 22570, 2023. Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780 8794, 2021. Yuki Endo. User-controllable latent transformer for stylegan image layout editing. In Computer Graphics Forum, volume 41, pp. 395 406. Wiley Online Library, 2022. Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and Aleksander Holynski. Diffusion selfguidance for controllable image generation. Advances in Neural Information Processing Systems, 36, 2024. Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Reddy Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. In Proceedings of the International Conference on Learning Representations, 2022. Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In Proceedings of the International Conference on Learning Representations, 2022. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In Proceedings of the International Conference on Learning Representations, 2018. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401 4410, 2019. Published as a conference paper at ICLR 2024 Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007 6017, 2023. Gihyun Kwon and Jong Chul Ye. Diffusion-based image translation using disentangled style and content representation. In Proceedings of the International Conference on Learning Representations, 2022. Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In Proceedings of the International Conference on Learning Representations, 2021. Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. ar Xiv preprint ar Xiv:2302.08453, 2023. Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp. 16784 16804, 2022. Xingang Pan, Ayush Tewari, Thomas Leimk uhler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 Conference Proceedings, pp. 1 11, 2023. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684 10695, 2022. Simo Ryu. Low-rank adaptation for fast text-to-image diffusion fine-tuning, 2023. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479 36494, 2022. Maximilian Seitzer. pytorch-fid: FID Score for Py Torch. https://github.com/mseitzer/ pytorch-fid, August 2020. Version 0.3.0. Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vincent YF Tan, and Song Bai. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. ar Xiv preprint ar Xiv:2306.14435, 2023. Jaskirat Singh, Stephen Gould, and Liang Zheng. High-fidelity guided image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5997 6006, 2023. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations, 2020a. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438 12448, 2020. Published as a conference paper at ICLR 2024 Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Proceedings of the International Conference on Learning Representations, 2020b. Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion. Advances in Neural Information Processing Systems, 36: 1363 1389, 2023. Dani Valevski, Matan Kalman, Eyal Molad, Eyal Segalis, Yossi Matias, and Yaniv Leviathan. Unitune: Text-driven image editing by fine tuning a diffusion model on a single image. ACM Transactions on Graphics (TOG), 42(4):1 10, 2023. Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models. In ACM SIGGRAPH 2023 Conference Proceedings, pp. 1 11, 2023. Wen Wang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, and Chunhua Shen. Zeroshot video editing using off-the-shelf image diffusion models. ar Xiv preprint ar Xiv:2303.17599, 2023. Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. ar Xiv preprint ar Xiv:2212.11565, 2022. Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18381 18391, 2023. Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. Freedom: Training-free energy-guided conditional diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23174 23184, 2023. Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836 3847, 2023. Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. Advances in Neural Information Processing Systems, 35:3609 3623, 2022.