# optimizing_prompts_for_texttoimage_generation__ae38b769.pdf Optimizing Prompts for Text-to-Image Generation Yaru Hao , Zewen Chi , Li Dong, Furu Wei Microsoft Research https://github.com/microsoft/LMOps Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, we propose prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts. Specifically, we first perform supervised fine-tuning with a pretrained language model on a small collection of manually engineered prompts. Then we use reinforcement learning to explore better prompts. We define a reward function that encourages the policy to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that our method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Moreover, reinforcement learning further boosts performance, especially on out-of-domain prompts. The pretrained checkpoints are available at https://aka.ms/promptist. The demo can be found at https://aka.ms/promptist-demo. 1 Introduction Generative foundation models can be prompted to follow user instructions, including language models [Brown et al., 2020, Chowdhery et al., 2022, Smith et al., 2022], and text-to-image models [Ramesh et al., 2021a, 2022, Saharia et al., 2022, Rombach et al., 2022]. It has been recognized that prompt design plays an essential role in the generation quality. We need to adjust the prompt to make the model better understand our intentions and produce higher-quality results [Reynolds and Mc Donell, 2021, Zhou et al., 2022b]. The problem is severe in text-to-image models because the capacity of their text encoders, such as CLIP text encoder [Radford et al., 2021] in Stable Diffusion [Rombach et al., 2022], is relatively small. Empirical observations also confirm that common user input is often insufficient to produce aesthetically pleasing images with current models. Prior efforts implement manual prompt engineering towards specific text-to-image models [Liu and Chilton, 2021, Oppenlaender, 2022, Parsons, 2022], typically adding some modifiers to the original input. However, it is laborious and sometimes infeasible to conduct manual prompt engineering. Besides, the manually engineered prompts often cannot be transferred between various model versions. Therefore, it is necessary to find a systematic way to automatically align user intentions and various model-preferred prompts. In this work, we propose a prompt adaptation framework for automatic prompt engineering via reinforcement learning. Specifically, we first perform supervised fine-tuning with a pretrained language model (e.g., GPT) on a small collection of manually engineered prompts. The finetuned model is used to initialize the prompt policy network for reinforcement learning. Next, the model is trained by exploring optimized prompts of user inputs, where diverse beam search [Vijayakumar et al., 2016] is used to ensure generation quality and diversity. The training objective is to maximize Equal contribution. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Text-to-Image Text-to-Image Stage 1: Supervised Fine-Tuning Stage 2: Reinforcement Learning Train with PPO LM as Promptist LM as Promptist Figure 1: Overview of PROMPTIST training: (1) supervised fine-tuning (SFT) on manually engineered prompts; (2) reinforcement learning (RL) to increase the rewards of generated images after prompt optimization. the reward, which is defined as a combination of relevance scores and aesthetic scores of generated images. The relevance score reflects how much the original user intentions are retained after prompt adaptation. The aesthetic score indicates what degree the generated images are aesthetically pleasing. We conduct experiments with the publicly available Stable Diffusion models [Rombach et al., 2022]. We evaluate our method using both the automatic reward metric and human preference ratings. Experimental results show that the optimized prompts outperform human-engineered ones and the original inputs. Human preference ratings also show consistent improvements across in-domain and out-of-domain prompts. Moreover, we find that reinforcement learning is more favorable than supervised fine-tuning, especially on out-of-domain user inputs. Overall, we show that language models can serve as a prompt interface that optimizes user input into model-preferred prompts. Our contributions are as follows: We propose a general prompt optimization framework that adapts user input to modelpreferred prompts. We collect user queries and conduct extensive experiments on text-to-image generation. Experimental results show that our method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. The goal of our prompt adaptation framework is to automatically perform prompt engineering. Given user input of the text-to-image generator, our model learns to generate model-preferred prompts that obtain better output images while preserving their original intentions. Figure 1 presents the overview of our method. The prompt optimization model is named PROMPTIST, which is built upon a pretrained language model, such as GPT [Brown et al., 2020]. We first collect a set of human-engineered examples and use them to conduct supervised fine-tuning (Section 2.1). Next, we perform reinforcement learning (Section 2.3) to maximize the target reward (Section 2.2), which improves both relevance and quality of generated images. 2.1 Supervised fine-tuning Initialized with a pretrained generative language model, the policy model is first finetuned on a set of prompt pairs before reinforcement learning. A parallel prompt corpus D = {(x, y)} contains prompt pairs of original user inputs x and manually engineered examples y. The training objective is to maximize the log-likelihood with teacher forcing: LSFT = E(x,y) D log p(y|x) (1) where the finetuned weights are used to initialize the policy network in reinforcement learning. Collect human demonstrations We collect human-engineered prompts from Lexica2. Most prompts are composed of two parts, i.e., main content that describes the user s intention, and some modifiers that customize the art style, such as artist names, and popular elements. We use the crawled human-engineered prompts as targets. In order to obtain parallel data, we use three methods to construct their source inputs. First, we extract the main contents by trimming the modifiers and regard them as original user inputs. Second, we randomly remove or shuffle some modifiers and use the remaining texts as source inputs. Third, we use the Open AI API text-davinci-002 to rephrase the main contents and the human-engineered prompts, respectively. We find that the template [Input] Rephrase: works well in practice and translates input to a more user-friendly version. As shown in Table 1, given a target prompt dieselpunk blue wolf with fuzzy tail, concept art, dramatic, fantasy, pixiv , there are four source prompts collected. Table 1: An example of a human-engineered prompt and four types of our constructed source prompts. Human-engineered target prompt dieselpunk blue wolf with fuzzy tail, concept art, dramatic, fantasy, pixiv Main content dieselpunk blue wolf with fuzzy tail Main content with random modifiers dieselpunk blue wolf with fuzzy tail, dramatic Rephrasing of main content A blue wolf with a fuzzy tail that looks like it belongs in a dieselpunk setting. Rephrasing of target prompt This is a dieselpunk-style blue wolf with a fuzzy tail. It looks like it could be from a fantasy or dramatic piece of artwork. 2.2 Reward definition We measure the quality of optimized prompts from two aspects, namely relevance and aesthetics. The goal motivates us to define the reward function R( ) from the above two perspectives. First, we measure whether the generated images are relevant to the original input prompt after prompt adaptation. To be specific, we first sample images by the text-to-image model conditioned on the optimized prompt, respectively. Then, we compute CLIP [Radford et al., 2021] similarity scores to measure how relevant the generated images and the original input prompts are. The resulting relevance score is defined as: frel(x,y) = Eiy G(y)[frel(x, iy)] (2) frel(x, iy) =min(20 g CLIP(x, iy) 5.6, 0) (3) where iy G(y) means sampling images iy from the text-to-image model G with y as input prompt, and g CLIP( , ) stands for the CLIP similarity function. Notice that we always compute the similarity between the generated images and the original input prompt, which ensures the relevance score reflects the user preferences. We determine the specific form of the relevance score according to the approximate range of the clip score. Experiments show that this form works well in reinforcement learning. If the relevance score is relatively reasonable (larger than 0.28), we encourage the model to generate more aesthetically pleasing images. Second, we employ the aesthetic predictor3 to quantify aesthetic preferences. The predictor builds a linear estimator on top of a frozen CLIP model, which is trained by human ratings in the Aesthetic 2https://lexica.art 3https://github.com/christophschuhmann/improved-aesthetic-predictor Visual Analysis [Murray et al., 2012] dataset. The aesthetic score is defined as: faes(x, y) = Eix G(x),iy G(y)[gaes(iy) gaes(ix)] (4) where gaes( ) denotes the aesthetic predictor, and iy, ix are the images generated by the prompts y and x, respectively. Notice that both g CLIP( ) and gaes( ) require the CLIP model, so we can share the CLIP forward pass during reward computation. Finally, we define the overall reward by combining the above scores with an additional KL penalty, which is between the policy model πθ and the supervised finetuned model πSFT with coefficient η: R(x, y) = faes(x, y) + frel(x, y) η log πθ(y|x) πSFT(y|x) (5) The KL term is added to mitigate the overoptimization issue [Ouyang et al., 2022]. 2.3 Reinforcement learning Starting from the supervised fine-tuning, we further finetune our model with reinforcement learning. We employ proximal policy optimization (PPO) [Schulman et al., 2017], which is empirically dataefficient and of reliable performance. As a text generation problem, prompt optimization can be viewed as a Markov decision process (MDP) S, A, r, fst, γ with a finite state space S, action space A, reward function r, state-transition probability function fst, and a discount term γ. In an episode of prompt adaptation, the initial state x S is the input prompt with n tokens x = (x1, . . . , xn) where each token x is from a finite vocabulary V. At t-th time step, the agent selects an action yt V according to the current policy model yt π(y|x, y40 Proportion 0.21 0.48 0.2 0.07 0.04 User Input -0.48 -0.33 -0.18 -0.28 -0.22 PROMPTIST (Ours) -0.02 0.11 0.08 0.05 0.06