# subjectdriven_texttoimage_generation_via_apprenticeship_learning__a59f32fe.pdf Subject-driven Text-to-Image Generation via Apprenticeship Learning Wenhu Chen Hexiang Hu Yandong Li Nataniel Ruiz Xuhui Jia Ming-Wei Chang William W. Cohen Google Deepmind Google Research {wenhuchen,hexiang,mingweichang,wcohen}@google.com with silver-tipped toes kicking a football. on the stage with bunny sticking its head out. with silver-tipped toes fancy boot jumping over a creek on a snowy day. reading a book with pink glasses on. canine dog (Inference Detail) New Rendition eating ice-cream in a bowl. under the stage lights A back view of watching a TV show about birds. on the bed with a nightcap. A stack of colorful fill the shelves of a toy store. floats lazily in the bathtub full of blue bubbles. waddles across the floor as the a puppy chases after it. Generalizable Instant Customization In-Context Learn Figure 1: We train a single Su TI model to generate novel scenes faithfully reflecting given subjects (unseen in training, specified only by 3-5 in-context text image demonstrations), without any optimization. Recent text-to-image generation models like Dream Booth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an expert model for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present Su TI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning. Given a few demonstrations of a new subject, Su TI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. Su TI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train a massive number of expert models, each specializing in a different subject. The apprentice Su TI model then learns to imitate the behavior of these fine-tuned experts. Su TI can generate high-quality and customized subject-specific images 20x faster than optimization-based So TA methods. On the challenging Dream Bench and Dream Bench-v2, human evaluation shows that Su TI significantly outperforms other existing models. Core Contribution 37th Conference on Neural Information Processing Systems (Neur IPS 2023). 1 Introduction Recent text-to-image generation models [1] have shown great progress in generating highly realistic, accurate, and diverse images from a given text prompt. These models are pre-trained on web-crawled image-text pairs like LAION [2] with autoregressive backend models [3, 4] or diffusion backend models [5, 1]. Though achieving unprecedented success in generating highly accurate images, these models are not able to customize to a given subject, like a specific dog, shoe, backpack, etc. Therefore, subject-driven text-to-image generation, the task of generating highly customized images with respect to a target subject, has attracted significant attention from the community. Subject-driven image generation is related to text-driven image editing but often needs to perform more sophisticated transformations to source images (e.g., rotating the view, zooming in/out, changing the pose of subject, etc.) so existing image editing methods are generally not suitable for this new task. Current subject-driven text-to-image generation approaches are slow and expensive. While different approaches like Dream Booth [6], Imagic [7], and Textual Inversion [8] have been proposed, they all require fine-tuning specific models for a given subject on one or a few demonstrated examples, which typically takes at least 10-20 minutes2 to specialize the text-to-image model checkpoint for the given subjects. These approaches are time-consuming as they require back-propagating gradients over the entire model for hundreds or even thousands of steps per customization. Moreover, they are space-consuming as they require storing a subject-specific checkpoint per subject. To avoid the excessive cost, Re-Imagen [9] proposed a retrieval-augmented text-to-image framework to train a subject-driven generation model in a weakly-supervised fashion. Since the retrieved neighbor images are not guaranteed to contain the same subjects, the model does not perform as good as Dream Booth [6] for the task of subject-driven image generation. To avoid excessive computation and memory costs, we propose to train a single subject-driven text-to-image generation model that can perform on-the-fly subject customization. Our method is dubbed Subject-driven Text-to-Image generator (Su TI), which is trained with a novel apprenticeship learning algorithm. Unlike standard apprenticeship learning which only focuses on learning from one expert, our apprentice model imitates the behaviors of a massive number of specialized expert models. After such training, Su TI can instantly adapt to unseen subjects and unseen or even compositional descriptions with only 3-5 in-context demonstrations within 30 seconds (on a Cloud TPU v4). Internet Image Text (Web LI) Apprenticeship Learning Constructed Demonstration 𝙴𝚡𝚙𝚎𝚛𝚝θA 𝙴𝚡𝚙𝚎𝚛𝚝θA 𝙴𝚡𝚙𝚎𝚛𝚝θA { } basket envelope handbag { } Group By URL Filter & Process Fine-tune Sample Input Context Specialized Experts Optimization Pa LM Pa LI Figure 2: Conceptual Diagram of the Learning Pipeline Figure 2 presents a conceptual diagram of the learning and data preparation pipeline. We first group the images in Web LI [10] by their source URL to form tiny image clusters because images from the same URL are likely to contain the same subject. We then performed extensive image-to-image and image-to-text similarity filtering to retain image clusters that contain highly similar content. For each subject image cluster, we fine-tuned an expert model to specialize in the given subject. Then, we use the fine-tuned experts to synthesize new images given unseen creative captions proposed by large language models. However, the tuned expert models are not perfect and prone to errors, therefore, we adopt a quality validation metric to filter out a large portion of degraded outputs. The remaining high-quality images are provided as a training signal to teach the apprentice model Su TI to perform subject-driven image generation with high fidelity. During inference, the trained Su TI can attend to a few in-context demonstrations to synthesize new images on the fly. We evaluate Su TI on various tasks such as subject re-contextualization, attribute editing, artistic style transfer, and accessorization. We compare Su TI with existing models on Dream Bench [6], which contains diverse subjects from wide categories accompanied by some prompt templates. We compute the CLIP-I/CLIP-T and DINO scores of Su TI s generated images on this dataset and compare them with Dream Booth. The results indicate that Su TI can outperform Dream Booth while having 20x faster inference speed and significantly less memory footprint. 2Running on A100 according to public colab: https://huggingface.co/sd-dreambooth-library and https://huggingface.co/docs/diffusers/training/text_inversion. Further, we manually created 220 diverse and compositional prompts regarding the subjects in Dream Bench for human evaluation, which is dubbed the Dream Bench-v2 dataset. We then comprehensively compare with other baselines like Instruct Pix2Pix [11], Null-Text Inversion [12], Imagic [7], Textual Inversion [8], Re-Imagen [9], and Dream Booth [6] on Dream Bench-v2. Our human evaluation results indicate that Su TI is 5% higher than Dream Booth and at least 30% better than the other baseline in terms of human evaluation. We conduct detailed fine-grained analysis and found that Su TI s textual alignment is significantly better than Dream Boothm while its subject alignment is slightly better than Dream Booth. However, Dream Booth s outputs are still better in the photorealism aspect, especially in terms of fine-grained detail preservation. We summarize our contributions in the following aspects: We introduce the Su TI model, a subject-driven text-to-image generator that performs instant and customized generation for a visual subject with few (image, text) exemplars, all in context. We employ the apprenticeship learning to train one single apprentice Su TI model to imitate half a million fine-tuned subject-specific experts on a large-scale seed dataset, leading to a generator model that generalizes to unseen subjects and unseen compositional descriptions. We perform a comprehensive set of automatic and human evaluations to show the capability of our model on generating highly faithful and creative images on Dream Bench and Dream Bench-v2. To facilitate the reproducibility of our model performance, we release the Su TI model API as a Google Cloud Vertex AI model service, under the production name Instant tuning 3. 2 Preliminary In this section, we introduce the key concepts and notations about subject-driven image-text data, then discuss the basics of text-to-image diffusion models. Diffusion Models. Diffusion models [13] are latent variable models, parameterized by Θ, in the form of pΘ(x0) := R pΘ(x0:T )dx1:T , where x1, , x T are noised latent versions of the input image x0 q(x0). Note that the dimensionality of both latents and the image is the same throughout the entire process, with x0:T Rd and d equals the product of . The process that computes the posterior distribution q(x1:T |x0) is also called the forward (or diffusion) process, and is implemented as a predefined Markov chain that gradually adds Gaussian noise to the data according to a schedule βt: q(x1:T |x0) = t=1 q(xt|xt 1); q(xt|xt 1) := N(xt; p 1 βtxt 1, βt I) (1) Diffusion models are trained to learn the image distribution by reversing the diffusion Markov chain. Theoretically, this reduces to learning to denoise xt q(xt|x0) into x0, with a time re-weighted square error loss see [14] for the complete proof: E(x0,c) D{Eϵ,t[wt ||ˆxθ(xt, c) x0||2 2]} (2) where D is the training dataset containing (image, condition) = (x0, c) pairs, the condition normally refers to the input text prompt. In practice, wt can be simplified as 1 according to [14, 15]. Subject-Driven Text-to-Image Generation. Existing subject-driven generation models [6, 16, 8] often fine-tune a pre-trained text-to-image diffusion model on a set of provided demonstrations Cs about a specific subject s. Formally, such demonstration contains a set of text and image pairs Cs = {(xk, ck)}Ks k , centered around the subject s. Images xk contains images of the same subject s, while cs is a short description of images xk. Dream Booth [6] also requires an additional Cs, which contains images about different subjects of the same category as s for prior preservation. To obtain a customized diffusion model ˆxθs(xt, c), we need to optimize the following loss function: θs = arg min θ E(x0,c) Cs Cs{Eϵ,t[||ˆxθ(xt, c) x0||2 2]} (3) The customized diffusion model ˆxθs(xt, c) has shown impressive capabilities to generate highly faithful images of the specified subject s. 3Generally available at https://cloud.google.com/vertex-ai/docs/generative-ai/image/fine-tune-model basket envelope handbag { } Fine-tune Specialized Experts kitchen mixer { } London tourist landmark { } B Apprenticeship Learning a red with a bouquet of flowers in the sunset In-Context Demonstrations Supervision Optimization basket envelope handbag Figure 3: Overview of the apprentice learning pipeline for Su TI. Left part shows the customization procedure for expert models, and the right parts shows the Su TI model that imitates the behaviors of differently customized experts. Note that this framework can cope with expert models of arbitary architecture and model family. 3 Apprenticeship Learning from Subject-specific Experts Notation. Figure 3 presents the concrete workflow of learning. Our method follows apprenticeship learning [17] with two major component, i.e., the expert diffusion models ˆxθs(xt, c) parameterized by θs regarding subject s S and apprentice diffusion model ˆxΘ(xt, c, Cs) parameterized by Θ. The apprentice model takes an additional set of image-text demonstrations Cs as input. We use S to denote the superset of subjects we include in the training set. Dataset. The training set DS contains a collection of {Cs, ps}s S, where each entry contains an image-text cluster Cs accompanied by an unseen prompt ps. The image-text cluster Cs contains a set of 3-10 image-text pairs. The unseen prompt is an imaginary caption proposed by Pa LM [18]. For example, if c is a photo of berry bowl , then ps would be an imaginary caption like a photo of berry bowl floating on the river . We describe the dataset construction process to section 4. Learning. To obtain an expert ˆxθs(xt, c) on a subject s, we fine-tune a pre-trained diffusion model [1] on the image cluster Cs with the denoising loss as: θs = arg min θ E(xs,c) Cs{Eϵ,t[||ˆxθ(xt, c) xs||2 2]} (4) where xt q(xt|xs). The training is similar to Eqn. 3 except that we do not have negative examples for prior preservation because finding the negative examples from the same class is expensive. Once an expert model is trained, we use it to sample images ys for the unseen text description ps to guide the apprentice Su TI model. We gather the outputs from the massive amount of expert models and then use CLIP filtering to construct a dataset G. Similarly, we fine-tune the apprentice model ˆxΘ(xt, ps, Cs) with the denoising loss on the pseudo target generated by the expert: Θ = arg min Θ E(ys,ps,Cs) G{Eϵ,t[||ˆxΘ(xt, ps, Cs) ys||2 2]} (5) where xt q(xt|ys), and the training triples (ys, ps, Cs) are drawn from G. Algorithm. We formally introduce our learning algorithm in the Algorithm 1. To improve the training efficiency, we use distributed training algorithm to accelerate the training process. At each training step, we randomly sample a batch {Bsi}K i=1 of size K from the dataset DS, with Bsi = (Csi, psi). We then fine-tune K expert models separately w.r.t. Eqn. 4 in parallel, across K different TPU cores. For every subject s inside the batch Bs, we use the corresponding expert model θs to synthesize the image ys given the unseen prompt ps. As not all expert models can generate highly faithful images, we introduce a quality assurance step to validate the synthesized images. Particularly, we measure the quality of an expert s generation by the delta CLIP score [19] (ys, Cs, ps), which is used to decide whether a sample should be included in the dataset G. This ensures the high quality of the text-to-image training signal for Su TI. Specifically, the delta CLIP score is computed as the increment of CLIP score of ys over the demonstrated images x Cs: (ys, Cs, ps) = CLIP(ys, ps) max x Cs CLIP(x, ps) (6) We then feed G as a training batch to update the parameter Θ of the apprentice model using Eqn. 5. In all our experiments, we set K = 400, with each TPU core training an expert model. Inference. To perform subject-driven text-to-image generation, the trained Su TI takes 3-5 imagetext pairs as the demonstration to generate new images based on the given text description. No Algorithm 1 Apprenticeship Learning from a Large Crowd of Specialized Expert Models 1: Input: Dataset DS = {(Cs, ps)}s S containing subject image cluster Cs and unseen prompt ps 2: Input: Pre-trained diffusion model parameterized by θ 3: Output: Apprentice diffusion model parameterized by Θ 4: Initialize a buffer G = 5: while DS = do 6: {Bsi}K i=1 = Dequeue(DS, K), where Bsi = (Csi, psi) 7: Fine-tune K expert models θs1, . . . , θs K on {Bsi}K i=1 in parallel, based on Eqn. 4 8: for i = 1 to K do 9: Sample a subject-specific generation ysi with DDPM using ˆxθsi(xt, psi) 10: if (ysi, Csi, psi) > λ then 11: G = Enqueue(G, (ys, Csi, psi)) 12: end if 13: end for 14: end while 15: Train ˆxΘ on the generated demonstration G, based on the Eqn. 5 optimization is needed during inference time. The only overhead of Su TI is the cost of encoding these 3-5 image-text pairs and the attention computation, which is more affordable. Our inference speed is roughly in the same order as the original text-to-image generator [1]. 4 Mining and Generating Subject-driven Text-to-Image Demonstrations In this section, we discuss how we created the seed dataset DS by mining images and text over the Internet. We construct the seed dataset using the Web LI [10, 20] dataset. Particularly, we derive our initial image cluster via subsampling the Episodic Web LI data [20], which grouped Web images from the same URL. Then we filter the clusters to ensure high intra-cluster visual similarity, using image matching models. The filtered set of image-text clusters is then denoted {Cs}s S. After obtaining the subject-driven image clusters, we further prompt a large language model [18] to generate a description about the subject, with the goal of creating descriptions of plausible imaginary visual scenes. The generating instances of the descriptions will require skills like subject re-contextualization, attribute editing, artistic style transfer, and accessorization. We denote the generated unseen captions as ps. Together with Cs, this forms the final dataset DS. More details regarding the grouping and filtering are provided in the Appendix. The dataset DS contains a total of 2M (Cs, ps) pairs. Using the aforementioned delta CLIP score filtering (using a high threshold λ = 0.02), we remove low-quality synthesized images ys from the expert model, finally obtaining a dataset G with 500K (Cs, ps) effective training pairs for the following apprenticeship learning. 5 Experiment In this paper, we only train Su TI on the text 64x64 diffusion model and retain the original 256x256 and 1024x1024 super-resolution as it is from Imagen [1]. Expert Models. The expert model is initialized from the original 2.1B Imagen 64x64 model. We tune each model on a single TPU core (32 GB) for 500 steps using Adafactor optimizer with a learning rate of 1e-5, which only takes 5 minutes to finish. We use classifier-free guidance to sample new images, where the guidance weight is set to 30. To avoid excessive memory costs, we use fine-tuned experts to sample pseudo-target images and then write the samples as separate files. Su TI will read these files asynchronously to maximize the training speed. Our expert models have a few distinctions from the Dream Booth [6]: 1) we adopt Adafactor instead of Adam optimizer, 2) we do not include any class word token like [DOG] dog in the prompt. 3) we do not include in-class negatives for prior preservation. Though our expert model is weaker than Dream Booth, such design choices significantly reduce time/space costs to enable us to train millions of experts with reasonable resources. Figure 4: Comparison with other Image Editing and Image Personalization Models. Apprentice Model. The apprentice model contains 2.5B parameters, which is 400M parameters larger than the original 2.1B Imagen 64x64 model. The added parameters are coming from the extra attention layers over the demonstrated image-text inputs. We adopt the same architecture as Re-Imagen [9], where the additional image-text pairs are encoded by re-using the UNet Down Stack, and the attention layers are added to the down UNet Down Stack and Up Stack at different resolutions. We initialize our model from Imagen s checkpoint. For the additional attention layers, we use random initialization. The apprentice training is performed on 128 Cloud TPU v4 chips. We train the model for a total of 150K steps. We use an Adafactor optimizer with a learning rate of 1e-4. We use 3 demonstrations during training, while the model can generalize to leverage any number of demonstrations during inference. We show our ablation studies in the following section. Inference. We normally provide 4 demonstration image-text pairs to Su TI during inference. Increasing the number of demonstrations does not improve the generation quality much. We use a lower classifier-free guidance weight of 15 with DDPM [14] sampling strategy. 5.1 Datasets and Metrics Dream Bench. In this paper, we use the Dream Bench dataset proposed by Dream Booth [6]. The dataset contains 30 subjects like backpacks, stuffed animals, dogs, cats, clocks, etc. These images are downloaded from Unsplash. The original dataset contains 25 prompt templates covering different skills like recontextualization, property modification, accessorization, etc. In total, there are a total of 750 unique prompts generated by the template. We follow the original paper to generate 4 images for each prompt to form the 3000 images for robust evaluation. We follow Dream Booth to adopt DINO, CLIP-I to evaluate the subject fidelity, and CLIP-T to evaluate the text fidelity. Dream Bench-v2. To further increase the difficulty and diversity of Dream Bench, we annotate 220 prompts for the 30 subjects in Dream Bench as Dream Bench-v2. We gradually increase the compositional levels of the prompt to increase the difficulty, like back view of [dog] back view of [dog] watching TV back view of [dog] watching TV about birds . This enables us to perform a breakdown analysis to understand the model s compositional capabilities. We use human evaluation to measure the generation quality in Dream Bench-v2. Specifically, we aim at measuring the following three aspects: (1) the subject fidelity score ss measures whether the subject is being preserved, (2) the textual fidelity score st measures whether it is aligned with the text description, (3) the photorealism score sp measures whether the image contains artifacts or blurry subjects. These are all binary scores, which are averaged over the entire dataset. We combine them as an overall score so = ss st sp, which is the most stringent score. 5.2 Main Results Baselines. We provide a comprehensive list of baselines to compare with the proposed Su TI model: Dream Booth [6]: a fine-tuning method with space consumption is |M| |S|, where |M| is the model size and |S| is the number of subjects. Textual Inversion [8]: a fine-tuning method with space consumption is |E| |S| with the embedding size of |E|, note that |E| |M|. Null-Text Inversion [12]: a fine-tuning method with space consumption is |T| |E| |S|. Imagic [7]: a fine-tuning-based method with largest space consumption among all models as it requires training |M| |S| |P|, where |P| is the number of a text prompt P = {ps} for the subject set S. Instruct Pix2Pix [11]: a non-tuning method, which can generate and edit a given image really fast within a few seconds. There is no additional space consumption. Re-Imagen [9]: a non-tuning method, which will take a few images as input and then attend to those retrievals to generate a new image. There is no additional space consumption. Experimental Results. We show our automatic evaluation results on the Dream Bench in Table 1. We can observe that Su TI can perform better or on par with Dream Booth on all of the metrics. Specifically, Su TI outperforms Dream Booth on the DINO score by 5%, which indicates that our method is better at preserving the subject s visual appearance. In terms of the CLIP-T score, our method is almost the same as Dream Booth, indicating an equivalent capability in terms of textual alignment. These results indicate that Su TI has achieved promising generalization to a wide variety of visual subjects, without being trained on the exact instances. Methods Backbone DINO CLIP-I CLIP-T Real Image (Oracle) - 0.774 0.885 - Dream Booth [6] Imagen [1] 0.696 0.812 0.306 Dream Booth [6] SD [21] 0.668 0.803 0.305 Textual Inversion [8] SD [21] 0.569 0.780 0.255 Re-Imagen [9] Imagen [1] 0.600 0.740 0.270 Ours: Su TI Imagen [1] 0.741 0.819 0.304 Table 1: Automatic Evaluation on the Dream Bench. We further show our human evaluation results on the Dream Bench-V2 in Table 2. It shows the related rankings for the additional storage cost and reported the average inference time measure for inferring on each subject. As can be seen, Su TI is able to outperform Dream Booth by 5% on the overall score mainly due to much higher textual alignment. In contrast, all the other existing baselines are getting much lower human evaluation score (< 42%). Comparisons. We compare our generation results with other methods in Figure 4. As can be seen, Su TI can generate images highly faithful to the demonstrated subjects. Though Su TI is still Methods Backbone Space Time Subject Text Photorealism Overall Models requiring test-time tuning Textual Inversion [8] SD [21] $ 30 mins 0.22 0.64 0.90 0.14 Null-Text Inversion [12] Imagen [1] $$ 5 mins 0.20 0.46 0.70 0.10 Imagic [7] Imagen [1] $$$$ 70 mins 0.78 0.34 0.68 0.28 Dream Booth [6] SD [21] $$$ 6 mins 0.74 0.53 0.85 0.47 Dream Booth [6] Imagen [1] $$$ 10 mins 0.88 0.82 0.98 0.77 Instruct Pix2Pix [11] SD [21] - 10 secs 0.14 0.46 0.42 0.10 Re-Imagen [9] Imagen [1] - 20 secs 0.70 0.65 0.64 0.42 Ours: Su TI Imagen [1] - 20 secs 0.90 0.90 0.92 0.82 Table 2: Human Evaluation on the Dream Bench-v2. We report an approximated average inference time (averaged over subjects) and the relative rankings of the space cost (more $: more expensive). Methods that do not fine-tune in test-time requires no additional storage (denoted by -). missing some local textual (words on the bowl gets blurred) or colorization (dog hair color gets darker), the nuance is almost unperceivable for humans. The other baselines like Instruct Pix2Pix [11], and Null-Text Inversion [12] are not able to perform very sophisticated transformations. Textual Inversion [8] cannot achieve satisfactory results even with 30 minutes of tuning. Re-Imagen [9] though gives reasonable outputs, the subject preservation is much weaker than Su TI. Imagic [7] also generates reasonable outputs, however, its failure rate is still much higher than ours. Dream Booth [6] however generates almost perfect images except for the blurry text on the berry bowl. Through the comparison, we can observe remarkable improvement in the output image quality. Skillset. We provide Su TI s generation to showcase its ability in re-contextualization, novel view synthesis, art rendition, property modification, and accessorization. We demonstrate these different skills in the Appendix Figure 8. In the first row, we show that Su TI is able to synthesize the subjects with different art styles. In the second row, we show that Su TI is able to synthesize the different view angles of the given subject. In the third row, we show that Su TI can modify subjects facial expressions like sad , screaming , etc. In the fourth row, we show that Su TI can alter the color of a given toy. In the last two rows, we show that Su TI can add different accessories (hats, clothes, etc) to the given subjects. Further, we found that Su TI can even compose two skills together to perform highly complex image generation. As depicted in Figure 5, we show that Su TI can combine re-contextualization with editing/accessorization/stylization to generate high-quality images. wolf plushie canine dog a back view of watching a TV show about birds. a back view of watching a TV show. Re-Context Re-Context + Editing clay teapot sitting on a glass table. sitting on a glass table, surrounded by delicate porcelain teacups. duck toy a Claude Monet styled painting of in the water. in the water. Re-Context Re-Context + Style Transfer Re-Context Re-Context + Accessorize playfully chasing a fox plushie. playfully chasing a fox plushie through a whimsical forest. Re-Context Re-Context + Re-Context Figure 5: Su TI not only re-contextualizes subjects but also composes multiple transformation, all in-context. 5.3 Model Analysis and Ablation Study We further conducted a set of ablation studies to show factors that impact the performance of Su TI. Impact of # Demonstrations. Figure 6 presents the Su TI s in-context generation with respect to an increasing number of subject-specific image examples. Interestingly, we observe a transition in the model s behavior as the number of in-context examples increases. When Cs = , Su TI generates images using it prior to the text, similar to traditional text-to-image generation models such as Imagen [1]. When |Cs| = 1, Su TI behaves similarly to an image editing model, attempting to edit the generation while preserving the foreground subject, and avoiding sophisticated transformation. When |Cs| = 5, Su TI unlocks the capability of rendering novel pose and shape of the demonstrated subject naturally in the targeted scene. In addition, we also observe that a bigger |Cs| would result in a more robust generation of high text and subject alignment, and better photorealism. We also performed a human evaluation on the Su TI s generation with respect to different numbers of demonstrations and visualizes the results in Figure 6 (right). It shows that as the number of demonstrations increases, the human evaluation score first increases drastically and then gradually converges. Quality of the expert dataset matters. We found that the Delta CLIP score is critical to ensure the quality of synthesized target images. Such a filtering mechanism is highly influential in terms of Su TI s final performance. We evaluated several versions to increase the threshold from None 0.0 0.025, we observe that the human evaluation can steadily increase from 0.54 to 0.82. A canine dog reading a book A canine dog A canine dog A canine dog A canine dog jumping over a creek k=1 k=2 k=3 k=4 k=5 Overall Score Figure 6: (Left) In-context generation by Su TI model, with an increasing # of demonstrations. (Right) Human evaluation score with with respect to the increasing % of demonstrations. Without such intensive filtering, the model s overall human score can go to a very low level (54%). With an increasing , although the size of the dataset G keeps decreasing from 1.8M to around 500K, the model s generation quality keeps improving until saturation. The empirical study indicates that = 0.02 strikes a good balance between the quality and quantity of the expert-generated dataset G. Methods Inference Time Subject Text Photorealism Overall Dream Booth 10 secs 0.88 0.82 0.98 0.77 Su TI 20 secs 0.90 0.90 0.92 0.82 Dream-Su TI 15 secs 0.92 0.92 0.94 0.87 Table 3: Quantitative Human evaluation of the Dream-Su TI model on Dream Bench-v2. Further fine-tuning Su TI improves generation quality We note that our model is not exclusive to methods that requires further fine-tuning, such as Dream Booth [6]. Instead, Su TI can be combined with Dream Booth [6] naturally to achieve better quality subject-driven generation (dubbed as Dream-Su TI). Specifically, given K reference images regarding a subject, we can randomly feed one image as the condition and use another differently sampled image as the target output. Through fine-tuning the Su TI model for 500 steps (without any auxiliary loss), the Dream-Su TI model can generate aligned and faithful results for a given subject. Table 3 shows a comparison of the Dream-Su TI, against Su TI and Dream Booth, suggesting that Dream-Su TI further improves the generation quality. Particularly, it improves the overall score from 0.82 to 0.87, yielding a 5% improvement over Su TI, and 10% improvement over Dream Booth. Since the fine-tuned Dream-Su TI model already trained on all subject images, only one subject image is needed to present during the inference time, which can further reduce the inference cost. Reference Dream Booth Su TI Dream-Su TI A robot is standing on the street of a neon-lit city with high-rise buildings. Prompt text Figure 7: Comaprison between Dream Booth, Su TI and Dream-Su TI. To gain better understanding of the quality, we show an example in Figure 7, where we pick a failure example from Su TI to investigate whether Dream-Su TI improves it. We observe that the Dream Booth does not have strong text alignment, while Su TI s subject lacks fidelity (the generated robot uses legs instead of wheels). With further fine-tuning on subject images, Dream-Su TI is able to generate images not only faithful to the subject but also to the text description. However, we would like to note that such subject-driven fine-tuned model share the same drawback of a typical Dreambooth model, which can no longer generalize well to a general distribution objects and hence requiring a copy of model parameter per subject. 6 Related Work Text-Guided Image Editing With the surge of diffusion-based models, [22, 8] have demonstrated the possibilities to manipulate given image without human intervention. Blended-Diffusion [23] and SDEdit [24] propose to blend the noise with the input image to guide the image synthesize process to maintain the original layout. Text2Live [25] generates an edit layer on top of the original image/video input. Prompt-to-Prompt [26] and Null-Text Inversion [12] aims at manipulating the attention map in the diffusion model to maintain the layout of the image while changing certain subjects. Imagic [7] propose an optimization based to achieve significant progress in manipulating visual details in a given image. Instruct Pix2Pix [11] propose to distill image editing training pairs synthesized from Prompt-to-Prompt into a single diffusion model to perform instruction-driven image editing. Our method resembles Instruct Pix2Pix in a sense that we are training the model on expert-generated images. However, our synthesized data is generated generated by fine-tuned experts, which are mostly natural images. In contrast, the images from Instruct Pix2Pix are synthetic images. In the experiment section, we comprehensively compare with these existing models to show the advantage of our model, especially on more challenging prompts. Subject-Driven Text-to-Image Generation Subject-Driven Image Generation tackles a new challenge, where the model needs to understand the visual subject contained in the demonstrations to synthesize totally new scene. Several GAN-based models [27, 28] pioneered to work on personalizing the image generation model to a particular instance. Later on, Dream Booth [6] and Textual Inversion [8, 29] propose optimization-based approach to adapt image generation to a specific unseen subject. However, these two methods are time and space-consuming, which makes them unrealistic in real-world applications. Another line of work adopt retrieval-augmented architecture for subject-driven generation including KNN-Diffusion [30], Re-Imagen [9], however, these methods are trained with weakly-supervised data leading to much worse faithfulness. In this paper, we aim at developing an apprenticeship learning paradigm to train the image generation model with stronger supervision demonstrated by fine-tuned experts. As a result, Su TI can generate customized images about a specified subject, without requiring any test-time fine-tuning. There are some concurrent and related works [31, 32] focusing on specific visual domains such as human faces and / or animals. To our best knowledge, Su TI is the first subject-driven text-to-image generator that operates fully in-context, generalizing across various visual domains. 7 Conclusion Our method Su TI has shown strong capabilities to generate personalized images instantly without test-time optimization. Our human evaluation indicates that Su TI is already better in the overall score than Dream Booth, however, we do identify a few weakness of our model: (1) Su TI s generations are less diverse than Dream Booth, and our model is less inclined to transform the subjects poses or views in the new image. (2) Su TI is less faithful to the low-level visual details than Dream Booth, especially for more complex and often manufactured subjects such as robots or rc cars where the subjects contain highly sophisticated visual details that could be arbitrarily different from the examples inside the training dataset. In the future, we plan to investigate how to further improve these two aspects to make Su TI s generation more diverse and detail-preserving. Acknowledgement We thank Boqing Gong, Kaifeng Chen for reviewing an early version of this paper in depth, with valuable comments and suggestions. We thank Neil Houlsby, Xiao Wang and also the Pa LI-X team for providing early access to their Episodic Web LI data. We also thank Jason Baldbridge, Andrew Bunner, Nicole Brichtova for discussions and feedback on the project. Broader Impact Subject-driven text-to-image generation has wide downstream applications, like adapting certain given subjects into different contexts. Previously, the process was mostly done manually by experts who are specialized in photo creation software. Such manual modification process is time-consuming. We hope that our model could shed light on how to automate such a process and save huge amount of labors and training. The current model is still highly immature, which can fall into several failure modes as demonstrated in the paper. For example, the model is still prone to certain priors presented in certain subject classes. Some low-level visual details in subjects are not perfectly preserved. However, it could still be used as an intermediate form to help accelerate the creation process. On the flip side, there are risks with such models including misinformation, abuse and bias. See the discussion of broader impacts in [1, 4] for more discussion. [1] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-toimage diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022. [2] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. [3] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821 8831. PMLR, 2021. [4] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. ar Xiv preprint ar Xiv:2206.10789, 2022. [5] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. [6] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. CVPR, 2023. [7] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. CVPR, 2023. [8] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. ar Xiv preprint ar Xiv:2208.01618, 2022. [9] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image generator. International Conference on Learning Representations, 2023. [10] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual languageimage model. ar Xiv preprint ar Xiv:2209.06794, 2022. [11] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. ar Xiv preprint ar Xiv:2211.09800, 2022. [12] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. ar Xiv preprint ar Xiv:2211.09794, 2022. [13] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256 2265. PMLR, 2015. [14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. [15] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020. [16] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [17] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004. [18] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022. [19] Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16420 16429, 2022. [20] Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. Pali-x: On scaling up a multilingual vision and language model. ar Xiv preprint ar Xiv2305.18565, 2023. [21] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021. [22] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085 2094, 2021. [23] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208 18218, 2022. [24] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021. [25] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part XV, pages 707 723. Springer, 2022. [26] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-toprompt image editing with cross attention control. ar Xiv preprint ar Xiv:2208.01626, 2022. [27] Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michal Drozdzal, and Adriana Romero Soriano. Instance-conditioned gan. Advances in Neural Information Processing Systems, 34:27517 27529, 2021. [28] Yotam Nitzan, Kfir Aberman, Qiurui He, Orly Liba, Michal Yarom, Yossi Gandelsman, Inbar Mosseri, Yael Pritch, and Daniel Cohen-Or. Mystyle: A personalized generative prior. ACM Transactions on Graphics (TOG), 41(6):1 10, 2022. [29] Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Designing an encoder for fast personalization of text-to-image models. ar Xiv preprint ar Xiv:2302.12228, 2023. [30] Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-diffusion: Image generation via large-scale retrieval. ar Xiv preprint ar Xiv:2204.02849, 2022. [31] Xuhui Jia, Yang Zhao, Kelvin CK Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, and Yu-Chuan Su. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. ar Xiv preprint ar Xiv:2304.02642, 2023. [32] Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. ar Xiv preprint ar Xiv:2304.03411, 2023. [33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748 8763. PMLR, 2021. A Supplementary Material A.1 Dataset Construction To validate the effectiveness, we provide an ablation study to show that higher precision is more important than recall in training the apprentice model. Particularly, when the threshold is set to a lower number (e.g., 0.01 or 0.015), Su TI becomes less stable. As our goal is to collect images of the same subject, we create an initial subject cluster by grouping all (image, alt-text) pairs that come from the same URL ( 45M clustrers), and filter the cluster with less than 3 instances ( 77.8% of the clusters). As a result, it leaves us with 10M image clusters. We then apply the pre-trained CLIP Vi T-L14 model [33] to filter out 81.1% of clusters that has the average intra-cluster visual similarity between 0.82 and 0.98 to ensure the quality of clusters. Though the mined clusters already contain (image, alt-text) information, the alt-text s noise level is too high. Therefore, we apply the state-of-the-art image captioning model [10] to generate descriptive text captions for every image of all image clusters, which forms the data triples of (image, alt-text, caption). However, current image captioning models tend to generate generic descriptions of the visual scene, which often occlude the detailed entity information about the subject. For example, generic captions like a pair of red shoes would greatly decrease the expert model s capability to preserve the subject s visual appearance. To increase the specificity of the visual captions, we propose to merge the alt-text, which normally contains specific meta information like brands, names, etc with the model-generated caption. For example, Given an alt-text of duggee talking puppet hey duggee chicco 12m and a caption of a toy on the table , we aim to combine them as a more concrete caption: Hey duggee toy on the table . To achieve this, we prompt the pre-trained large language models [18] to read all (alt-text, caption) pairs inside each image cluster, and output a short descriptive text about the visual subject. These refined captions with the mined images are used as the image-text cluster Cs w.r.t subject s, which will be used to fine-tune the expert models. A.2 Su TI Skillset We demonstrate the complete view of Su TI s skillset in Figure 8, including styled subject generation, multi-view subject rendering, subject expression modification, subject colorization, and subject accessorization. A.3 Failure Examples Figure 9 show some failure examples of Su TI. We show several types of failure modes: (1) the model has a strong prior about the subject and hallucinates the visual details based on its prior knowledge. For example, the generation model believes teapot should contain a lift handle . (2) some artifacts from the demonstration images are being transferred to the generated images. For example, the bed from the demonstration is being brought to the generation, (3) the subject s visual appearance is being modified through, mostly influenced by the context, like the candle contains non-existing artifacts when contextualized in the toilet . These three failure modes constitute most of the generation errors. (4) The models are not particularly good at handling compositional prompts like the bear plushie and sunglasses example. In the future, we plan to work on how to improve these aspects. A.4 More Qualitative Examples We demonstrate more examples from Dream Bench-v2 in the following figures: Pablo Picasso Rene Magritte Vincent van Gogh Rembrandt Top-down view Side view Bottom view Back view Depressed Joyous Sleepy Screaming A monster toy Blue Green Purple Pink Chef outfit Police outfit Nurse outfit Fire-Fighter outfit Angel outfit Superman outfit Witch outfit Ironman outfit Figure 8: Su TI s in-context generation that demonstrates its skill set. Results generated from a single model. First row: art rendition of the subject. Second row: multi-view synthesis of the subject. Third row: modifying expression for the subject. Fourth row: editing the color of the subject. Fifth row: adding accessories to the subject. Subject (image, text) and editing key words are annotated, with detailed template in the Appendix. Figure 9: Su TI s failure examples on Dream Bench-v2. canine dog border collie a back view of [S] watching a TV show about birds. [S] eating an ice- cream in a bowl. [S] jumping over a creek on a snowy day. [S] reading a book with a pink glasses on. [S] talking to a british shorthair in the garden. [S] herding a group of sheep in the thunder storm. [S] sniffing a backpack in the airport. [S] splashing through a river wearing a detective hat. [S] wearing goggles sticking its head out of a car window. [S] standing on a lush green field. duck toy fancy boot [S] floats lazily in the bathtub full of blue bubbles. [S] sits on a dusty shelf. [S] on the bed with a [S] waddles across the floor as the a puppy chases after it. a stack of colorful [S] fill the shelves of a toy store. [S] cross a street. [S] on the stage with bunny sticking its head out. [S] under the stage lights. [S] with silver-tipped toes kicking a football. [S] with silver-tipped toes. wolf plushie [S] playfully chasing a fox plushie through a whimsical forest. [S] jumping high over [S] reading a book. [S] sitting on a salad [S] walking towards a lush jungle landscape, with towering trees, exotic plants. Figure 10: Visualization of Su TI s generation on the Dream Bench-v2 (Part 1). Pink sunglasses A poop emoji toy A grey sloth plushie A red monster toy [S] climbing a tree. [S] dangles lazily from a [S] reading a paper. [S] wearing a T-shirt. An aged [S] [S] hang on the wall. [S] on a wooden deck overlooking a lake. [S] sitting on a river bank facing skycrapers. [S] on a clock tower. [S] under the Tokyo tower. [S] talking to a red heart emoji toy [S] sitting on a wing chair. [S] sitting on a wing chair with a teddy bear. [S] having sushi. [S] flying a kite in the desert. [S] on the book cover. [S] wearing a big nose funny glasses. [S] in a hot air ballon in the sunset. [S] in a yellow sunglass case. [S] in the microwave oven. A clay teapot [S] on the floor, surrounded by scattered tea leaves. [S] pouring steaming hot water into a teacup. [S] sitting on a glass table, surrounded by delicate porcelain teacups. [S] on the wooden table, together with a salmon sushi. [S] on a glass table. Figure 11: Visualization of Su TI s generation on the Dream Bench-v2 (Part 2). A robot toy A shiny sneaker A racing car toy A cartoon devil [S] driven by the super Mario. [S] in the shoe box at luxury boutique store. [S] in the shoe box. [S] on the treadmill. [S] on the roof. [S] perched on the edge of a rooftop, with a panoramic view of a lake. [S] eating a banana in a lush tropical jungle. [S] sitting in a comfortable [S] exploring a neon-lit city at night. [S] chasing a curious cat through a sunlit garden. [S] playing fencing. [S] sitting at a desk, typing on multiple keyboards. [S] playing guitar. [S] driving a car cruising down a scenic coastal road. [N] sleeping on the bed. [S] on the river bank. [S] zooms past another car toy and arrives at the finish line. [S] on a railway track. [S] on a railway track facing a train. [S] on the racing track. Figure 12: Visualization of Su TI s generation on the Dream Bench-v2 (Part 3). A Herschel backpack A Herschel backpack A Herschel backpack A Herschel backpack in Grand Canyon A Herschel backpack in the water A candle A candle A candle A candle sitting on a Mirror A candle decorated with flowers. A bear plushie A bear plushie A bear plushie Two bear plushies in the store. A bear plushie in a temple. Figure 13: In-context generation by Su TI model, with an increasing # of demonstration (More examples).