# eventcustomized_image_generation__6aa88983.pdf

Event-Customized Image Generation

Zhen Wang 1 Yilei Jiang 1 Dong Zheng 1 Jun Xiao 1 Long Chen 2

<V> sleeping <V> in a doghouse

a girl wearing a hat and a scarf

Spiderman <V> a panda <V>

monkey <V> monkey

cat <V> cat

girl , hat scarf

①<V1> ②<V2> ③cat

(a) The subject customization

(b) The action and interaction customization (c) The event customization

①skeleton ②statue ③monkey ④book

①ape ②robot

①Wolverine ②Spiderman ③Deadpool ④Mac Book

①Spiderman ②Batman

①tiger ②lion ③meat

①cat ②dog ③orange

Figure 1: Customized Image Generation. (a) Generating customized images with given subjects in new contexts. (b) Generating customized images with co-existing basic action or interaction in given images. (c) Generating customized images for complex events with various target entities. Different colors and numbers show associations between reference entities and corresponding target prompts.

Customized image generation has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (e.g., human and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient exactly same reference images. To extend the customiza-

1Zhejiang University, Hangzhou, China 2The Hong Kong University of Science and Technology, Hong Kong, China. Work was done when Zhen Wang visited HKUST. Correspondence to: Long Chen <longchen@ust.hk>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

tion to more complex scenes, we propose a new task: event-customized image generation. Given a single reference image, we define the event as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free method: Free Event. Specifically, Free Event introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. We further collected two new evaluation benchmarks. Extensive experiments have demonstrated the effectiveness of Free Event.

Event-Customized Image Generation

1. Introduction

Recently, large-scale pre-trained diffusion models (Dhariwal & Nichol, 2021; Nichol et al., 2022; Ramesh et al., 2022; Rombach et al., 2022; Saharia et al., 2022) have demonstrated remarkable success in generating diverse and photorealistic images from text prompts. Leveraging these unparalleled creative capabilities, a novel application customized image generation (Gal et al., 2023a; Ruiz et al., 2023; Chen et al., 2024a) has gained increasing attention for generating user-specified concepts. Significant progress has already been made in subject-customized image generation (Ye et al., 2023; Chen et al., 2024c). As shown in Figure 1(a), given a set of user-provided subject images, existing methods can accurately capture the unique appearance features of each subject (e.g., corgi) with a special identifier token, enabling creative rendering in new and diverse scenarios. Moreover, they can seamlessly integrate multiple subjects into cohesive compositions, preserving distinctive characteristics while adapting them to novel contexts.

Beyond the appearance of different entities (i.e., humans, animals, and objects) in the images, pioneering approaches have been developed to customize the user-specified actions (Huang et al., 2024a), interactive relations (Huang et al., 2024b) and poses (Jia et al., 2024) between the entities. As shown in Figure 1(b), these methods attempt to capture the single-entity action (e.g., handstand) or interactions (e.g., back to back) between two entities that co-exist in the given reference images and transfer them to the synthesis of actionor interaction-specific images with new entities.

However, for real-world scenes that typically involve multiple entities with more complex interactions (e.g., Figure 1(c), row 3: three humans are discussing in front of a computer with different poses), these works (Huang et al., 2024b;a; Jia et al., 2024) still face notable limitations. 1) Simplified Customization. Current action customization (Huang et al., 2024a) focuses solely on the basic actions of a single person. Similarly, interaction customizations (Huang et al., 2024b; Jia et al., 2024) are limited to basic interactive relations or poses between just two entities. There is a lack of exploration into more complex and diverse actions or interactions that involve multiple humans, animals, and objects. Additionally, while these methods typically perform well when generating images with the same type of entity (e.g., all monkeys or all cats), they struggle when faced with more diverse and complex entities and their combinations. These narrow focuses and limitation on entity generation have strictly limited their abilities to customize more complex and diverse scenes with creative content. 2) Insufficient Data. To capture specific actions or interactions, existing methods (Huang et al.,

2024b;a; Jia et al., 2024) tend to represent them by learning corresponding identifier tokens, which can be further used for generating new images. However, for each action, or interaction, these training-based processes typically require a set of reference images (e.g., 10 images) paired with corresponding textual descriptions across different entities. Unfortunately, each action or interaction is highly unique and distinctive, i.e., gathering images that depict the exact same action or interaction is challenging. As shown in Figure 1(b), there are still significant differences in the same action (e.g., handstand) between different reference images, which thus compromises the accuracy of learned tokens, leading to inconsistencies in action between generated images. This insufficient data issue for identical action or interaction has severely limited the practicality and generalizability of these methods.

To address these limitations and extend customized image generation to more complex scenes, we propose a new and meaningful task: event-customized image generation. Given a single reference image, we define the event as all actions and poses of each single entity, and their relations and interactions between different entities. As shown in Figure 1(c), event customization aims to accurately capture the complex and diverse event from the reference image to generate target images with various combinations of target entities. Since it only needs one single reference image, the event customization also eliminates the need for collecting exactly same reference images.

To solve this challenging task, we proposed a novel trainingfree event customization method, denoted as Free Event. Based on the two main components of the reference image, i.e., entity and event, Free Event decomposes the event customization into two parts: 1) Switching the entities in the reference image to target entities. 2) Transferring the event from the reference image to the target image. Following this idea, alongside the general denoising process of diffusion generation, we designed two extra paths: entity switching path and event transferring path. Specifically, entity switching path guides the localized layout of each target entity for entity generation. Event transferring path further extracts the event information from the reference image and then injects it into the denoising process to generate the specific event. Through this direct guidance and injection, Free Event offers a significant advantage over existing methods by eliminating the need for time-consuming training. Furthermore, as shown in Figure 1(c), Free Event can also serve as a plug-and-play framework to combine with subject customization methods, generating creative images with both user-specified events and subjects.

Moreover, as a pioneering effort in this direction, we also collected two evaluation benchmarks from the existing dataset (i.e., SWi G (Pratt et al., 2020) and HICO-DET (Chao

Event-Customized Image Generation

et al., 2015)) and the internet for event-customized image generation, dubbed SWi G-Event and Real-Event, respectively. Both benchmarks include reference images featuring diverse events and entities, along with manually crafted target prompts. Extensive experiments demonstrate that our approach achieves state-of-the-art performance, enabling more complex and creative customization with enhanced practicality and generalizability.

In summary, we make several contributions in this paper: 1) We propose the novel event-customized image generation task, which extends customized image generation to more complex scenes in real-world applications. 2) We propose Free Event, the first training-free method for event customization, which can be further combined with subject customization methods for more creative and generalizable customizations. 3) We collect two evaluation benchmarks for event-customized image generation, and Free Event achieves outstanding performance compared with existing methods.

2. Related Work

Text-to-Image Diffusion Generation. Diffusion models (Ho et al., 2020; Nichol & Dhariwal, 2021; Song et al., 2021) have emerged as a leading approach for image synthesis. The text-to-image diffusion models (Nichol et al., 2022; Ramesh et al., 2022; Saharia et al., 2022) further inject user-provided text descriptions into the diffusion process via pre-trained text encoders. After trained on large-scale textimage pairs, they have shown great success in text-to-image generation. Different from these models that operate the diffusion process on pixel space, the latent diffusion models (LDMs) (Rombach et al., 2022) propose to perform it on latent space with enhanced computational efficiency. Besides, existing works (Hertz et al., 2023; Tumanyan et al., 2023; Cao et al., 2023; Alaluf et al., 2024) have discovered the spatial feature and attention maps in LDMs contain localized semantic information of the image and the layout correspondence between textual conditions. As a result, these features and attention maps have been utilized to control the layout, structure, and appearance in text-to-image generation. This can be achieved either through a plug-and-play feature injection (Tumanyan et al., 2023; Xu et al., 2024; Lin et al., 2024) or by computing specific diffusion guidance (Epstein et al., 2023; Mo et al., 2024) for generation. In this paper, we utilize the pre-trained LDM Stable Diffusion (Rombach et al., 2022) as our base model.

Subject Customization. This task aims to generate customized images of user-specified subjects. Current mainstream subject customization works mainly focus on 1) Single subject customization, including learning specific identifier tokens (Gal et al., 2023a), finetuning the text-toimage diffusion model (Ruiz et al., 2023; 2024), introducing layer-wise learnable embeddings (Voynov et al., 2023)

and training large-scale multimodal encoders (Gal et al., 2023b; Li et al., 2024). 2) Multi-subject composition, including cross-attention modification (Tewel et al., 2023), constrained model fine-tuning (Kumari et al., 2023), layout guidance (Liu et al., 2023), and gradient fusion of each subject (Gu et al., 2024). In conclusion, these works are tailored to capture the appearance of the entities in the image, without considering the customization of actions or poses.

Action and Interaction Customization. They aim to generate customized images with co-existing actions or interactions in user-provided reference images. Re Version (Huang et al., 2024b) first proposes to customize specific interactive relations by optimizing the learnable relation tokens. ADI (Huang et al., 2024a) makes progress in customizing specific actions for a single subject. And a following work (Jia et al., 2024) further extends it to learning interactive poses between two individuals. However, all these works only focus on simplified customization of some basic actions and interactions, and their effect is strictly limited by the insufficient data of reference images. In contrast, our proposed event customization only requires one reference image, and our training-free framework can achieve effective customization of complex events with various creative target entities.

3.1. Preliminary

Latent Diffusion Model (LDM). Generally, LDMs include a pretrained autoencoder and a denoising network. Given an image x, the encoder E maps the image into the latent code z0 = E(x), where the forward process is applied to sample Guassian noise ϵ N(0, I) to it to obtain zt = αtz0 + 1 αtϵ from time step t [1, T] with a predefined noise schedule α. While the backward process iteratively removes the added noise on zt to obtain z0, and decodes it back to image with the decoder x = D(z0). Specifically, the diffusion model is trained by predicting the added noise ϵ conditioned on time step t and possible conditions like text prompt P. The training objective is formulated as:

LLDM = Ez E(x),P,ϵ N(0,1),t ϵ ϵθ(zt; t, P) 2 2 . (1)

where ϵθ is the denoising network.

Diffusion Guidance. The diffusion guidance modifies the sampling process (Ho et al., 2020) with additional score functions to guide it with more specific controls like object layout (Xie et al., 2023; Mo et al., 2024) and attributes (Epstein et al., 2023; Bansal et al., 2023). We express it as

ˆϵt = ϵθ(zt; t, P) s g(zt; t, P), (2)

where g is the energy function and s is a parameter that controls the guidance strength.

Event-Customized Image Generation

P: skeleton, statue, monkey, book

(b) Event Transferring Path

P: skeleton, statue, monkey, book (a) Entity Switching Path Denoising Process

Entity Switching Event Transferring

cross-attention guidance cross-attention regulation spatial feature injection self-attention injection

Figure 2: The overview of pipeline. Given the reference image, the event customization is overall a general diffusion denoising process with two extra paths. 1) The entity switching path guides the generation of each target entity through cross-attention guidance and regulation 2) The event transferring path injects the spatial features and self-attention maps from the reference image to the denoising process. The final z G 0 is then transformed back to target image IG by the decoder.

3.2. Task Definition: Event-customized Generation

In this section, we first formally define the event-customized image generation task. Given a reference image IR involves N reference entities ER = {R1, . . . , RN}, we define the event as the specific actions and poses of each single reference entity, and the relations and interactions between different reference entities. Together we have the entity masks M = {m1, . . . , m N}, where mi is the mask of its corresponding entity Ri. The event-customized image generation task aims to capture the reference event, and further generate a target image IG under the same event but with diverse and novel target entities EG = {G1, . . . , GN} in the target prompt P = {w0, . . . , w N}, where wi is the description of the target entity Gi, and each target entity Gi should keep the same action or pose with its corresponding reference entity Ri. As the example shown in Figure 2, given the reference image with four reference entities (e.g., three people and one object), the event-customization aims to capture the complex reference event and generate the target image with a novel combination of different target entities (e.g., skeleton, statue, monkey, book).

3.3. Approach

Overview. We now introduce the proposed training-free event customization framework Free Event. Specifically, we decompose the event-customized image generation into two parts, 1) generating target entities (i.e., switching each reference entity to target entity), and 2) generating the same reference event (i.e., transferring the event from the reference image to the target image). Following this idea, we design two extra paths for the diffusion denoising process

of event customization, denoted as the entity switching path and the event transferring path, respectively. Generally, as shown in Figure 2, the generation of IG starts by randomly initializing the latent z G T N(0, I), and iteratively denoise it to z G 0 . During this denoising process, the entity switching path guides the generation of each target entity through cross-attention guidance and regulation based on the target prompt P and reference entity masks M. The event transferring path extracts the spatial features and self-attention maps from the reference image IR, and then injects them to the denoising process. The final z G 0 is then transformed back to the target image IG by the decoder.

U-Net Architecture The Stable Diffusion (Rombach et al., 2022) utilizes the U-Net architecture (Ronneberger et al., 2015) for ϵθ, which contains an encoder and a decoder, where each consists of several basic encoder/decoder blocks, and each encoder/decoder block further contains several encoder/decoder layers. Specifically, as shown in Figure 3(a), each U-Net encoder/decoder layer consists of a residual module, a self-attention module, and a cross-attention module. For block b, layer l, and timestep t, the residual module produces the spatial feature of the image as f. The self-attention module produces the self-attention map as SA = Softmax( Qs KT s

d ), where Qs and Ks are query and key features projected from the visual features. For text-toimage generation, the cross-attention module further produces the cross-attention map between the text prompt P and the image as CA = Softmax( Qc KT c

d ), where Qc is the query features projected from the visual features, and Kc is the key features projected from the textual embedding of P.

Event-Customized Image Generation

selfattention

crossattention

𝐦𝟏 𝐦𝟐 𝐦𝟑 𝐦𝟒

skeleton statue monkey book

(a) U-Net Layer

(b) Entity Switching

(c) Event Transferring

selfattention

crossattention

selfattention

crossattention

skeleton statue monkey book

Figure 3: (a) The architecture of the U-Net layer. (b) The process of cross-attention guidance and regulation. (c) The process of spatial feature and self-attention injection.

Entity Switching Path. This path aims on generating target entities EG = {G1, . . . , GN} in IG by switching each refenrece entity Ri to Gi based on the target prompt P and reference entity masks M. And the key is to ensure each target entity Gi is generated at the same location as their corresponding reference entity Ri and avoid the appearance leakage between different entities. Inspired by prior works (Hertz et al., 2023; Chen et al., 2024b) that utilize the cross-attention maps to control the layout of text-toimage generation, we apply the cross-attention guidance and regulation to achieve the entity switching.

As shown in Figure 2(a), at the timestep t of the denoising process, we first obtain the latent for entity switching as z A t = z G t , we then input z A t together with the target prompt P into the U-Net, and calculate the cross-attention maps as CAA. Then, we introduce an energy function to bias the cross-attention of each token wi as (cf., Figure 3(b)):

g(CAA i , mi) = (1 CAA i mi CAA i )2 (3)

where CAA i is the cross-attention map of token wi. Optimizing this function encourages the cross-attention maps of each target entity Gi to obtain higher values inside the corresponding area specified by mi, which further guides the localized layout of each target entity. We calculate the gradient of this guidance via backpropagation to update

latent z G t :

z G t = z A t σ2 t η z A t X

i N g(CAA i , mi) (4)

where η is the guidance scale and σt = p

(1 αt/ αt. Additionally, to avoid the appearance leakage between each target entity, we further regulate the cross-attention map of each token within its corresponding area. Specifically, for cross-attention maps CAG calculated at timestep t during the denoising process, we have:

CAG i = mi CAG i (5)

where CAG i is the cross-attention map of token wi.

Event Transferring Path. This path aims to extract the specific reference event from the reference image IR, including the action, pose, relation, or interactions between each reference entity, and transferring them to the target image IG. Meanwhile, from the perspective of image spatial information, the event is essentially the structural, semantic layout, and shape details of the image. Thus, based on the observation that the spatial features and self-attention maps can be utilized to control the image layout and structure (Tumanyan et al., 2023; Xu et al., 2024; Lin et al., 2024), we perform spatial feature and self-attention map injection to achieve the event transferring.

Specifically, as shown in Figure 2(b) we first get the latent code of the reference image z R 0 = E(IR), and at each time step t during the denoising process, we obtain z R t via the diffusion forward process. We then input z R t into the UNet to extract the spatial features and self-attention maps of the reference image as f R and SAR. Parallelly, for the denoising process, we input z G t together with the target prompt P into the U-Net, and calculate the spatial features and self-attention maps for the generated image as: f G

and SAG. Then, as shown in Figure 3(c), we perform the injection by directly replacing corresponding spatial features and self-attention maps:

f G f R and SAG SAR. (6)

Highlights. By applying cross-attention guidance and regulation on each text token, our attention-guided entity switching can also be used to generate target entities of userspecified subjects, i.e., represented by specific identifier tokens. Thus, our framework can be easily combined with subject customization methods to generate creative images with both customized events and subjects.

4. Experiments

4.1. Experimental Setup

Evaluation Benchmarks. In order to provide sufficient and suitable conditions for both quantitative and qualitative

Event-Customized Image Generation

Model Image Retrieval Verb Detection Image Similarity CLIP-T FID R@1 R@5 R@10 T-1 T-5 T-10 CLIP-I Dream Sim DINO Control Net 10.64 26.12 36.82 10.66 23.98 31.28 0.6009 0.3714 0.2599 0.2198 70.45 MIGC 10.90 27.00 37.64 10.62 26.14 35.04 0.6456 0.3772 0.2467 0.2145 49.81 Box Diff 8.60 22.48 32.08 5.58 14.52 19.42 0.5838 0.3135 0.2099 0.2153 68.49 Free Event 41.12 63.02 72.74 34.10 62.04 71.82 0.7044 0.6282 0.5116 0.2238 29.05

Table 1: Performance of our model and state-of-art conditional text-to-image generation models on SWi G-Event. For image retrieval, the R@k represents that among the top-k images with the highest similarity to the target image, its corresponding reference image is included. For verb detection, the T-K represents the top-k detection accuracy.

comparisons on this new task, we collect two new benchmarks1. 1) For quantitative evaluation, we present SWi GEvent, a benchmark derived from SWi G (Pratt et al., 2020) dataset, which comprises 5,000 samples with various events and entities, i.e., 50 kinds of different actions, poses, and interactions, where each kind of event has 100 reference images, and each reference image contains 1 to 4 entities with labeled bounding boxes and nouns. 2) For qualitative evaluation, we present Real-Event, which comprises 30 high-quality reference images from HICO-DET (Chao et al., 2015) and the internet with a wide range of events and entities (e.g., animal, human, object, and their combinations). We further employ Grounded-SAM (Kirillov et al., 2023; Ren et al., 2024) to extract the mask of each entity.

Baselines. We compared several kinds of SOTA baselines. For conditioned text-to-image generation baselines, we compared with training-based method Control Net (Zhang et al., 2023), MIGC (Zhou et al., 2024), and training-free method Box Diff (Xie et al., 2023). For localized editing baselines, we compared with training-free methods Pn P (Tumanyan et al., 2023) and MAG-Edit (Mao et al., 2024). For customization baselines, we compared with trainingbased methods Dreambooth (Ruiz et al., 2023) and Re Version (Huang et al., 2024b).

Implementation Details. We use Stable Diffusion v2-1base as base model for all methods. Images are generated with a resolution of 512x512 on a NVIDIA A100 GPU1.

4.2. Quantitative Comparisons

We compare Free Event with state-of-the-art conditional text-to-image generation baselines Control Net (Zhang et al., 2023), MIGC (Zhou et al., 2024), and Box Diff (Xie et al., 2023) on the SWi G-Event.

Setting. Each reference image in SWi G-Event contains reference entities together with labeled event class, bounding boxes, nouns, and their corresponding masks. Specifically, we construct the target prompt as a list of reference entity nouns, i.e., we ask all the methods to reproduce the event of the reference image with the same reference event and same

1 Due to limited space, more details/results are in the Appendix.

reference entities. Additionally, Control Net takes the semantic map merged from the masks as the layout condition. MIGC and Box Diff take the bounding boxes with labeled entity nouns as the layout condition1.

Evaluation. Our evaluations follow the principle of whether generated images are aligned or similar with their reference images, and we apply multiple metrics to evaluate the customization quality of 5,000 target images from different perspectives. 1) Global image similarity: image retrieval performance and similarity scores. We retrieved each target image for its corresponding reference image based on the CLIP score across all the 100 reference images that have the same reference event class. Specifically, we extracted the image feature of each image through a pre-trained CLIP (Radford et al., 2021) visual encoder and calculated the cosine similarities for image retrieval. Besides, we also used the CLIP-I, Dream Sim (Fu et al., 2023), and DINO (Oquab et al.) scores to evaluate the image alignment of generated images with their reference images. 2) Event similarity: verb detection performance. We utilized the verb detection model GSRTR (Cho et al., 2021) which was trained on the SWIG dataset to detect the verb class of each generated image, and then calculated the detection accuracy based on the annotations of the reference images (i.e., whether the generated images and their reference images have the same verb class). 3) Entity similarity: CLIP-T (Radford et al., 2021) score. We use the CLIP-T score to evaluate the text alignment of the generated images with text prompts. 4) Standard image generation metric. For a more comprehensive comparison, we used the FID (Heusel et al., 2017) score to evaluate the overall quality of generated images.

Results. As shown in Table 1, we can observe: 1) Free Event has better retrieval performance than both Control Net, MIGC, and Box Diff. This demonstrates that the target images generated by Free Event better preserve the overall characteristics of the reference event and entity. 2) Free Event also achieves the best verb detection performance, which indicates our method can better preserve the interaction semantics of the generated images. 3) Free Event further achieves superior performance over baselines across all similarity scores and standard image generation metrics, indicating our method can generate images with better qualities and

Event-Customized Image Generation

①Spiderman ②Batman

Control Net Box Diff Pn P MAG-Edit Dream Booth Re Version Ours Reference Image

①robot ②bird ③apple

①Spiderman ②orange ③strawberry

①Tarzan ②tiger ③lion

①bear ②Spiderman ③panther

①Spiderman ②robot ③bear ④monkey

Figure 4: Comparision of Event Customization. Different colors and numbers show the associations between reference entities and their corresponding target prompts.

alignment with both the reference images and texts. These results all demonstrate the effectiveness of Free Event for event customization.

4.3. Qualitative Comparisons

We compare Free Event with a wide range of SOTA baselines on the Real-Event1, including conditioned text-to-image generation method Control Net (Zhang et al., 2023) and Box Diff (Xie et al., 2023), localized image editing method Pn P (Tumanyan et al., 2023) and MAG-Edit (Mao et al., 2024), image customization methods Dreambooth (Ruiz et al., 2023) and Re Version (Huang et al., 2024b).

Setting. For each reference image in Real-Event, we manually constructed target prompts with various combinations of different target entities. Specifically, Control Net takes

the semantic map and Box Diff takes the labeled bounding boxes as the layout conditions. MAG-Edit takes the reference entity masks for localized editing. Dreambooth and Re Version learn event-specific identifier tokens for text-toimage generation.

Results. As shown in Figure 4, we can observe: 1) Conditional text-to-image generation models Control Net and Box Diff can only maintain the rough layout of each entity and struggle to capture the detailed action, pose, or interaction between different entities. And they both failed to match the generated entity with the desired target prompt. 2) For localized image editing methods Pn P and MAGEdit, while they can capture the reference event, they both struggle to accurately generate the target entities, and suffer from severe appearance leakage between each target entity

Event-Customized Image Generation

w/o guidance

w/o regulation

w/o injection

①skeleton ②statue ③monkey ④book

①tiger ②lion ③meat

①ape ②robot

are discussing

+ background + style + verb

on the beach cartoon

in the library

in the snow are playing

Chinese painting

(a) Ablation on two paths (b) Ablation on target prompt

Target prompt

Target prompt

Target prompt

Figure 5: Ablations of the proposed paths and the target prompt. The guidance and regulation denote the crossattention guidance and cross-attention regulation in entity switching path, respectively. The injection denotes the event transferring path.

(e.g., orange and strawberry in row three, tiger and lion in row five), and sometimes even failed to edit and output the original content. 3) The subject-customization model Dreambooth and the relation-customization model Re Version both failed to generate satisfying results. As discussed before, these training-based methods require multiple reference images and are unable to learn the specific event when facing only one reference image. 4) Obviously, our Free Event successfully achieves the customization of various complex events with novel combinations of target entities. Meanwhile, the Control Net and the localized image editing models tend to generate the target entities strictly matching the mask of their corresponding reference entities (e.g., bird in row two), which appears very incongruous. On the contrary, the entities generated by Free Event not only match the layout of the reference entity but also keep it harmonious. After all, while we use the reference entity mask to guide the generation of each target entity, the crossattention guidance focuses on directing the overall layout of each target entity and doesn t restrict detailed appearance, allowing for more diverse generation of target entities1.

4.4. Ablations

Effectiveness of Entity Switching Path and Event Transferring Path. We first run ablations to verify the effect of two proposed paths during event customization.

Results. As the results are shown in Figure 5(a), we can observe: 1) For the entity switching path, removing the

cross-attention guidance results in the failure of target entities generation (e.g., the ape, the meat), and removing cross-attention regulation leads to the appearance leakage between entities (e.g., the tiger and lion, the skeleton and statue). 2) After removing the event transferring path, although the target entities can be generated, the reference events are completely lost (i.e., the pose, action, relations, and interactions between each entity). These results all corroborates the effect of two paths in event customization.

Influence of Different Target Prompts. Noteably, in our paper, the target prompt only contains the nouns of the target entities, we then run the ablations to analyze the influence of different descriptions (i.e., verb, background, style) in the target prompt for event customization.

Results. From Figure 5(b) we can observe: 1) Adding verb description leads to a certain degree of negative impact on entity appearance (e.g., the head of the ape, the face of the monkey) since these verbs may not be aligned with the model. Besides, accurately describing events in complex scenes can be challenging for users. Therefore, since Free Event can already achieve precise extraction and transfer of the reference events, users do not need to describe the specific events in the target prompt, which further demonstrates Free Event s practicality. 2) Free Event can accurately generate extra contents for the background and style. Although there may be some detailed changes in the entity s appearance compared to the original output, these do not affect the entity s characteristics or the event. This also

Event-Customized Image Generation

①<V0> ②<V5>

①<V1> ②<V3>

①<v0>②<v1> ③<v2>④typewriter

<V0> <V1> <V2>

Subjects Event-Subject Customization

Reference Image Reference Image

Reference Image ①old man②old woman

<V3> <V4> <V5>

Reference Image ①<V4> ②bear

①monkey ②astronaut

Figure 6: Results of Event-Subject Customization. Different colors and numbers show the associations between reference entities and their corresponding target prompts.

Model Ours Control Net Box Diff Pn P MAGEdit Dream Booth Re Version

HJ 50 25 7 26 23 8 3

Table 2: Results of the user studies on the Real-Event.

demonstrates Free Event s strong generalization capability.

Combination of Event and Subject Customization. We further validate the ability of our framework to combine with subject customization methods to generate target entities with user-specified subjects, i.e., represented by identifier tokens. We took the Break-A-Scene model (Avrahami et al., 2023) to learn identifier tokens for subjects and replaced the Stable Diffusion models in Figure 2 with the fine-tuned one.

Results. As shown in Figure 6, Free Event can effectively generate various given subjects in specific events. Specifically, Free Event enables the flexible generation of a wide range of subject concepts (e.g., humans, regular objects, and backgrounds) and their combinations. These results demonstrated the great potential of our framework for Event Subject customization.

4.5. User Study

Setting. We conducted user studies on Real-Event to further evaluate the effectiveness of Free Event. Specifically, we invited 10 experts and gave them a reference image, a target prompt, and seven target images generated by different

models. They are asked to select at least one and up to three images that they believe demonstrate the best results in event customization, taking into account the generation effects of the events and entities, as well as the overall coherence of the images. We prepared 50 trials and asked the experts to give their judgments. The target image which got more than six votes is regarded as human judgment.

Results. As shown in Table 2, Free Event achieves better performance on human judgments (HJ) compared with all the baseline models.

5. Conclusion

In this paper, we proposed a new task: Event-Customized Image Generation. It focuses on the customization of complex events with various target entities. Meanwhile, we proposed the first training-free event-customization framework Free Event. To facilitate this new task, we also collected two evaluation benchmarks from existing datasets and the internet, dubbed SWi G-Event and Real-Event, respectively. We validate the effectiveness of Free Event with extensive comparative and ablative experiments. Moving forward, we are going to 1) extend the event customization into other modalities, e.g., video generation; 2) explore advanced techniques for the finer combination of different customization works, e.g., subject, event, and style customizations.

Event-Customized Image Generation

Acknowledgements

This work was supported by the National Key Research & Development Project of China (2024YFB3312900), Key R&D Program of Zhejiang (2025C01128), an Fundamental Research Funds for the Central Universities. Long Chen was supported by the Hong Kong SAR RGC Early Career Scheme (26208924), the National Natural Science Foundation of China Young Scholar Fund (62402408), Huawei Gift Fund, and the HKUST Sports Science and Technology Research Grant (SSTRG24EG04).

Impact Statement

Since Free Event can seamlessly integrate with subject customization methods to generate target entities based on userspecified subjects, this capability also raises the same concerns about the potential misuse of pretrained SD models for malicious applications (e.g., Deepfakes) involving real human figures. To address this, it is essential to implement robust safeguards and ethical guidelines, similar to the security measures and NSFW content detection mechanisms already present in existing diffusion models.

Alaluf, Y., Garibi, D., Patashnik, O., Averbuch-Elor, H., and Cohen-Or, D. Cross-image attention for zero-shot appearance transfer. In ACM SIGGRAPH 2024 Conference Papers, pp. 1 12, 2024.

Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., and Lischinski, D. Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers, pp. 1 12, 2023.

Bansal, A., Chu, H.-M., Schwarzschild, A., Sengupta, S., Goldblum, M., Geiping, J., and Goldstein, T. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 843 852, 2023.

Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., and Zheng, Y. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22560 22570, 2023.

Chao, Y.-W., Wang, Z., He, Y., Wang, J., and Deng, J. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE international conference on computer vision, pp. 1017 1025, 2015.

Chen, H., Zhang, Y., Wu, S., Wang, X., Duan, X., Zhou, Y., and Zhu, W. Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation.

In The Twelfth International Conference on Learning Representations, 2024a.

Chen, M., Laina, I., and Vedaldi, A. Training-free layout control with cross-attention guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5343 5353, 2024b.

Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., and Zhao, H. Anydoor: Zero-shot object-level image customization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6593 6602, 2024c.

Cho, J., Yoon, Y., Lee, H., and Kwak, S. Grounded situation recognition with transformers. In British Machine Vision Conference (BMVC), 2021.

Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780 8794, 2021.

Epstein, D., Jabri, A., Poole, B., Efros, A., and Holynski, A. Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems, 36: 16222 16239, 2023.

Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., and Isola, P. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. Advances in Neural Information Processing Systems, 36:50742 50768, 2023.

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, 2023a.

Gal, R., Arar, M., Atzmon, Y., Bermano, A. H., Chechik, G., and Cohen-Or, D. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG), 42(4):1 13, 2023b.

Gu, J., Wang, Y., Zhao, N., Fu, T.-J., Xiong, W., Liu, Q., Zhang, Z., Zhang, H., Zhang, J., Jung, H., et al. Photoswap: Personalized subject swapping in images. Advances in Neural Information Processing Systems, 36: 35202 35217, 2023.

Gu, Y., Wang, X., Wu, J. Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R., Chang, S., Wu, W., et al. Mixof-show: Decentralized low-rank adaptation for multiconcept customization of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., and Cohen-Or, D. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023.

Event-Customized Image Generation

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020.

Huang, S., Gong, B., Feng, Y., Chen, X., Fu, Y., Liu, Y., and Wang, D. Learning disentangled identifiers for actioncustomized text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7797 7806, 2024a.

Huang, Z., Wu, T., Jiang, Y., Chan, K. C., and Liu, Z. Reversion: Diffusion-based relation inversion from images. In SIGGRAPH Asia 2024 Conference Papers, pp. 1 11, 2024b.

Jia, X., Isobe, T., Li, X., Wang, Q., Mu, J., Zhou, D., Lu, H., Tian, L., Sirasao, A., Barsoum, E., et al. Customizing text-to-image generation with inverted interaction. In ACM Multimedia 2024, 2024.

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015 4026, 2023.

Kumari, N., Zhang, B., Zhang, R., Shechtman, E., and Zhu, J.-Y. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931 1941, 2023.

Li, D., Li, J., and Hoi, S. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems, 36, 2024.

Lin, K. H., Mo, S., Klingher, B., Mu, F., and Zhou, B. Ctrlx: Controlling structure and appearance for text-to-image generation without guidance. Advances in Neural Information Processing Systems, 37:128911 128939, 2024.

Liu, Z., Feng, R., Zhu, K., Zhang, Y., Zheng, K., Liu, Y., Zhao, D., Zhou, J., and Cao, Y. Cones: concept neurons in diffusion models for customized generation. In Proceedings of the 40th International Conference on Machine Learning, pp. 21548 21566, 2023.

Mao, Q., Chen, L., Gu, Y., Fang, Z., and Shou, M. Z. Magedit: Localized image editing in complex scenarios via mask-based attention-adjusted guidance. In Proceedings of the 32nd ACM International Conference on Multimedia, pp. 6842 6850, 2024.

Mo, S., Mu, F., Lin, K. H., Liu, Y., Guan, B., Li, Y., and Zhou, B. Freecontrol: Training-free spatial control of any text-to-image diffusion model with any condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7465 7475, 2024.

Nichol, A. Q. and Dhariwal, P. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp. 8162 8171. PMLR, 2021.

Nichol, A. Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pp. 16784 16804. PMLR, 2022.

Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research.

Pratt, S., Yatskar, M., Weihs, L., Farhadi, A., and Kembhavi, A. Grounded situation recognition. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part IV 16, pp. 314 332. Springer, 2020.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 1(2):3, 2022.

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al. Grounded sam: Assembling open-world models for diverse visual tasks. ar Xiv preprint ar Xiv:2401.14159, 2024.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234 241. Springer, 2015.

Event-Customized Image Generation

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., and Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500 22510, 2023.

Ruiz, N., Li, Y., Jampani, V., Wei, W., Hou, T., Pritch, Y., Wadhwa, N., Rubinstein, M., and Aberman, K. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6527 6536, 2024.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35: 36479 36494, 2022.

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.

Tewel, Y., Gal, R., Chechik, G., and Atzmon, Y. Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, pp. 1 11, 2023.

Tumanyan, N., Geyer, M., Bagon, S., and Dekel, T. Plugand-play diffusion features for text-driven image-toimage translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1921 1930, 2023.

Voynov, A., Chu, Q., Cohen-Or, D., and Aberman, K. p+: Extended textual conditioning in text-to-image generation. ar Xiv preprint ar Xiv:2303.09522, 2023.

Wang, X., Fu, S., Huang, Q., He, W., and Jiang, H. Msdiffusion: Multi-subject zero-shot image personalization with layout guidance. In ICLR, 2025.

Xie, J., Li, Y., Huang, Y., Liu, H., Zhang, W., Zheng, Y., and Shou, M. Z. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7452 7461, 2023.

Xu, S., Huang, Y., Pan, J., Ma, Z., and Chai, J. Inversionfree image editing with natural language. In Conference on Computer Vision and Pattern Recognition 2024, 2024.

Ye, H., Zhang, J., Liu, S., Han, X., and Yang, W. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. ar Xiv preprint ar Xiv:2308.06721, 2023.

Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836 3847, 2023.

Zhou, D., Li, Y., Ma, F., Zhang, X., and Yang, Y. Migc: Multi-instance generation controller for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6818 6828, 2024.

Event-Customized Image Generation

Reference Image

bounding boxes and masks

event class: encouraging nouns: boy, woman

Control Net

MIGC Box Diff

Target Prompt: boy, woman

100 reference images with event class: encouraging

Image Retrieval

(a) SWi G-Event Sample (b) Retrieval-based Evaluation

Figure 7: (a) The SWi G-Event sample. (b) The process of quantitative evaluation and image retrieval.

The Appendix is organized as follows:

In Sec. A, we show more implementation details.

In Sec. B, we show more details of the SWi G-Event benchmark and the process of quantitative evaluation and image retrieval.

In Sec. C, we show quantitative results for path effectiveness.

In Sec. D, we show the results for attribute generation during event customization.

In Sec. E, we show more results for event-subject customization comparisions.

In Sec. F, we provide more discussion about the hyperparameter setting.

In Sec. G, we provide the discussion of our work s limitations.

In Sec. H, we show more qualitative comparison results of event customization on the Real-Event.

A. Implementation Details.

The denoising process was set with 50 steps. For entity switching path, for all blocks and layers containing the crossattention module, we apply the cross-attention guidance during the first 10 steps. And apply the cross-attention regulation during the whole 50 steps. For event transferring path, we perform spatial feature injection for block and layer at {decoder block 1 :[layer 1]} during the whole 50 steps. And perform self-attention injection for blocks and layers at {decoder block 1 :[layer 1, 2], decoder block 2 :[layer 0, 1, 2], decoder block 3 :[layer 0, 1, 2]} during the first 25 steps. We set the classifier-free guidance scale to 15.0.

B. Details of SWi G-Event and process of image retrieval.

As shown in Figure 7(a), each SWi G-Event sample consists of a reference image with labeled bounding boxes and masks for each reference entity, the nouns of each reference entity, and the event class. As shown in Figure 7(b), we constructed the target prompt as a list of reference entity nouns. The Control Net takes the semantic map merged from the masks as the layout condition, and Box Diff takes the bounding boxes with labeled entity nouns as the layout condition.

To compare the image retrieval performance, we retrieved the target image for its corresponding reference image across all the 100 reference images that have the same reference event class.

Event-Customized Image Generation

Model CLIP-I DINO Dream Sim CLIP-T Free Event 0.5771 0.2865 0.3877 0.3445 Free Event w/o guidance 0.5140 0.1058 0.2079 0.2979 Free Event w/o regulation 0.5459 0.1297 0.3011 0.3205 Free Event w/o injection 0.4945 0.0825 0.2694 0.3311

Table 3: Quantitative results for ablation study of two paths on Real-Event.

①Spiderman ②red apple

Reference Image

①Spiderman ②green apple ①Spiderman ②crystal apple

①old lady ②cake ①blonde lady ②cake ①noble lady ②cake

Figure 8: The results of attribute generation during event customization.

C. Quantitative results for Path Effectiveness.

We ran ablations of two paths on Real-Event. We evaluated the CLIP-I, Dream Sim, and DINO scores for image similarity between reference images and the generated images. We also evaluated the CLIP-T score for image-text similarity between text prompts and the generated images. Table 3 demonstrates the effectiveness of two paths.

D. Attribute Generation Results.

In this paper, we didn t explicitly model the attributes during generation. However, as the results are shown in Figure 8, since we can generate extra content for background and style by giving corresponding text descriptions, we thus tried to model the attributes by giving extra adjectives to the target prompt as an easy and natural exploration. Meanwhile, to ensure the accurate generation of the attributes, we applied the cross-attention guidance and regulation on each attribute using the mask of the entity they describe. As the results shown in Figure 8, our method successfully addresses the attributes of the corresponding entity (e.g., colors, materials, and ages). After all, while the attribute part is not the primary focus of our work, our approach shows potential and effectiveness in addressing it, and we would be happy to conduct further research in our future work.

E. More Event-Subject Customization Comparisons.

In this section, we provide more event-subject customization comparisons with exisitng subject swapping and multi-subject customization methods, including Anydoor (Chen et al., 2024c), Photo Swap (Gu et al., 2023) and MS-Diffuison (Wang et al., 2025). As shown in Figure 9, Free Event outperforms all other methods.

We need to clarify that the primary focus of this paper is event customization, while event-subject combined customization is only a potential capability of Free Event, rather than a key aspect we intend to emphasize or compare with existing methods. Moreover, Free Event serves as a plug-and-play framework for event-subject combined customization, making it unsuitable

Event-Customized Image Generation

①<V0> ②<V5>

①<V1> ②<V3>

<V3> <V4> <V5>

①<V4> ②bear on <V6>

①<v0> ②<v1> ③<v2> ④typewriter

Reference Image

Ours Photo Swap

MS-Diffusion

Target Prompt

Figure 9: Comparisions of Event-Subject Customization. Different colors and numbers show the associations between reference entities and their corresponding target prompts.

for direct compare with subject customization methods as their settings and applicable scenarios differ. We will provide further discussions and results on event-subject customization in future work.

F. Discussion about hyperparameter Setting.

As the first work on event customization, our goal is to enable Free Event to perform high-quality event customization across diverse reference images using a unified set of hyperparameters (see Appendix Sec A). This default setting ensures faithful event transferring while allowing flexibility in target entity generation.

However, when there is a large shape discrepancy between the target entity and the reference entity, the layout information from the reference entity may undesirably affect the appearance of the target entity. This reflects a fundamental trade-off between event transferring and entity switching: prioritizing accurate event customization based on the reference image may lead to some compromise in the generation of the target entity. As the example shown in Figure 10, when the shape differences are significant (horse vs. dinosaur), the default setup may result in suboptimal generation (Target Image 1).

A straightforward solution is to adjust the parameters of the event transferring and entity switching paths. Specifically,

Event-Customized Image Generation

①knight ②dinosaur

Reference Image Target Prompt Target Image1 Target Image2

Self-attention

Injection Spatial Feature

Injection Cross-attention

Regulation Cross-attention

25 steps 50 steps 50 steps 10 steps Target Image1

15 steps 30 steps 50 steps 15 steps Target Image2

Figure 10: Comparisions of Event Customization with different hyperparameters settings. Different colors and numbers show the associations between reference entities and their corresponding target prompts.

enhancing the entity switching to emphasize the generation of the dinosaur, while slightly reducing the strength of event transferring to mitigate layout constraints from the reference entity.

As the shown in Figure 10, we provide an updated result (Target Image 2) by increasing the number of cross-attention guidance steps (from 10 to 15) and reducing the number of injection steps to 60% of the original. This rebalancing enables a more suitable trade-off for this case, resulting in more prominent dinosaur features (e.g., shorter front claws and upright back legs) while still preserving the core event structure.

This case study also demonstrates a practical and accessible way for users to adjust the trade-off between event transferring and entity switching according to their own customization needs. Looking ahead, we plan to explore more flexible and general solutions, such as adaptive parameter scheduling during generation or more explicit entity switching mechanisms, to further improve the controllability and diversity of target entities while maintaining event fidelity.

G. Limitation.

The main limitation of Free Event lies in the complexity of events and the number of entities. The customization effect may be compromised when there are too many entities in an image, especially if they are too small. As the first work in this direction, we hope our method can unveil new possibilities for more complex customization and the generation of a greater number of richer, more diverse entities. Additionally, since our model is built on pretrained Stable Diffusion (SD) models, our performance depends on the generative capabilities of SD. This can lead to suboptimal results for entities that the current SD struggles with, such as human faces and hands.

H. More Qualitative Comparision Results.

We show more comparisons on Real-Event in Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15. Specifically, we list them by the order of entity numbers. And we use different combinations of target entities for the same reference image to generate diverse target images.

(The figures are in the next pages.)

Event-Customized Image Generation

①warrior ②sword

Control Net Box Diff Pn P MAG-Edit Dream Booth Re Version Ours Reference Image

①monkey ②otter

①tiger ②otter

①Spiderman ②Batman

①ape ②robot ,cartoon

①Spiderman ②Batman

①ape ②robot

①Jedi ②lightsaber

①girl ②carrot

①Spiderman ②batman Figure 11: Comparision of Event Customization. Different colors and numbers show the associations between reference entities and their corresponding target prompts.

Event-Customized Image Generation

①player ②baseball

Control Net Box Diff Pn P MAG-Edit Dream Booth Re Version Ours Reference Image

①Egyptian ②camel

①Spiderman ②cow

①old man ②tennis ball

①woman ②sheep

①old lady ②cake

①robot ②monkey

①Superman ②Ironman

①policeman ②female soldier

①Batman ②orange

Figure 12: Comparision of Event Customization. Different colors and numbers show the associations between reference entities and their corresponding target prompts. 18

Event-Customized Image Generation

①Spiderman ②Batman ③cat

①book ②sheep

①princess②unicorn

①soldier ②tank

①monkey ②horse

①Spiderman ②horse

①knight ②dinosaur

①Egyptian ②camel

①flower ②backpack

①bear ②cake

Control Net Box Diff Pn P MAG-Edit Dream Booth Re Version Ours Reference Image

Figure 13: Comparision of Event Customization. Different colors and numbers show the associations between reference entities and their corresponding target prompts.

Event-Customized Image Generation

Control Net Box Diff Pn P MAG-Edit Dream Booth Re Version Ours Reference Image

①tiger ②lion ③meat

①cat ②dog ③orange

①bear ②fox ③polar bear

①tiger ②cat ③lion

①bear ②Spiderman ③panther

①dog ②dog ③apple

①monkey ②lemon ③orange

①robot ②cake ③donut

①Spiderman ②orange ③strawberry

①Batman ②orange ③strawberry Figure 14: Comparision of Event Customization. Different colors and numbers show the associations between reference entities and their corresponding target prompts.

Event-Customized Image Generation

①Batman ②wolf ③bear ,cartoon

①Tarzan ②tiger ③lion

①bear ②tiger ③lion

①robot ②bird ③apple

①Spiderman ②cat ③tiger

①robot ②tiger ③lion

Control Net Box Diff Pn P MAG-Edit Dream Booth Re Version Ours Reference Image

①Wolverine ②Spiderman ③Deadpool ④Mac Book

①skeleton ②statue ③monkey ④book

①Spiderman ②robot ③bear ④monkey

①Spiderman ②robot ③bear ④monkey

Figure 15: Comparision of Event Customization. Different colors and numbers show the associations between reference entities and their corresponding target prompts.