# finetuning_texttoimage_diffusion_models_for_fairness__8ae8a1f4.pdf Published as a conference paper at ICLR 2024 FINETUNING TEXT-TO-IMAGE DIFFUSION MODELS FOR FAIRNESS Xudong Shen 1, Chao Du 2, Tianyu Pang 2, Min Lin2 Yongkang Wong3, Mohan Kankanhalli 3 1ISEP programme, NUS Graduate School, National University of Singapore 2Sea AI Lab, Singapore 3School of Computing, National University of Singapore xudong.shen@u.nus.edu; {duchao, tianyupang, linmin}@sea.com; yongkang.wong@nus.edu.sg; mohan@comp.nus.edu.sg The rapid adoption of text-to-image diffusion models in society underscores an urgent need to address their biases. Without interventions, these biases could propagate a skewed worldview and restrict opportunities for minority groups. In this work, we frame fairness as a distributional alignment problem. Our solution consists of two main technical contributions: (1) a distributional alignment loss that steers specific characteristics of the generated images towards a user-defined target distribution, and (2) adjusted direct finetuning of diffusion model s sampling process (adjusted DFT), which leverages an adjusted gradient to directly optimize losses defined on the generated images. Empirically, our method markedly reduces gender, racial, and their intersectional biases for occupational prompts. Gender bias is significantly reduced even when finetuning just five soft tokens. Crucially, our method supports diverse perspectives of fairness beyond absolute equality, which is demonstrated by controlling age to a 75% young and 25% old distribution while simultaneously debiasing gender and race. Finally, our method is scalable: it can debias multiple concepts at once by simply including these prompts in the finetuning data. We share code and various fair diffusion model adaptors at https://sail-sg.github.io/finetune-fair-diffusion/. 1 INTRODUCTION Text-to-image (T2I) diffusion models (Nichol et al., 2021; Saharia et al., 2022) have witnessed an accelerated adoption by corporations and individuals alike. The scale of images generated by these models is staggering. To provide a perspective, DALL-E 2 (Ramesh et al., 2022) is used by over one million users (Bastian, 2022), while the open-access Stable Diffusion (SD) (Rombach et al., 2022) is utilized by over ten million users (Fatunde & Tse, 2022). These figures will continue to rise. However, this influx of content from diffusion models into society underscores an urgent need to address their biases. Recent scholarship has demonstrated the existence of occupational biases (Seshadri et al., 2023), a concentrated spectrum of skin tones (Cho et al., 2023), and stereotypical associations (Schramowski et al., 2023) within diffusion models. While existing diffusion debiasing methods (Friedrich et al., 2023; Bansal et al., 2022; Chuang et al., 2023; Orgad et al., 2023) offer some advantages, such as being lightweight, they struggle to adapt to a wide range of prompts. Furthermore, they only approximately remove the biased associations but do not offer a way to control the distribution of generated images. This is concerning because perceptions of fairness can vary across specific issues and contexts; absolute equality might not always be the ideal outcome. We frame fairness as a distributional alignment problem, where the objective is to align particular attributes of the generated images, such as gender, with a user-defined target distribution. Our solution consists of two main technical contributions. First, we design a loss function that steers the generated images towards the desired distribution while preserving image semantics. A key component is the distributional alignment loss (DAL). For a batch of generated images, DAL uses pre-trained classifiers to estimate class probabilities (e.g., male and female probabilities) and dynamically gen- Work done during internship at Sea AI Lab. Corresponding authors. Published as a conference paper at ICLR 2024 erates target classes that match the target distribution and have the minimum transport distance. To preserve image semantics, we regularize CLIP (Radford et al., 2021) and DINO (Oquab et al., 2023) similarities between images generated by the original and finetuned models. Second, we propose adjusted direct finetuning of diffusion models, adjusted DFT for short, illustrated in Fig. 2. While most diffusion finetuning methods (Gal et al., 2023; Zhang & Agrawala, 2023; Brooks et al., 2023; Dai et al., 2023) use the same denoising diffusion loss from pre-training, DFT aims to directly finetune the diffusion model s sampling process to minimize any loss defined on the generated images, such as ours. However, we show the exact gradient of the sampling process has exploding norm and variance, rendering the naive DFT ineffective (illustrated in Fig. 1). Adjusted DFT leverages an adjusted gradient to overcome these issues. It opens venues for more refined and targeted diffusion model finetuning and can be applied for objectives beyond fairness. Empirically, we show our method markedly reduces gender, racial, and their intersectional biases for occupational prompts. The debiasing is effective even for prompts with unseen styles and contexts, such as A philosopher reading. Oil painting and bartender at willard intercontinental makes mint julep (Fig. 3). Our method is adaptable to any component of the diffusion model being finetuned. Ablation study shows that finetuning the text encoder while keeping the U-Net unchanged hits a sweet spot that effectively mitigates biases and lessens potential negative effects on image quality. Surprisingly, finetuning as few as five soft tokens as a prompt prefix is able to largely reduces gender bias, demonstrating the effectiveness of soft prompt tuning (Lester et al., 2021; Li & Liang, 2021) for fairness. These results underscore the robustness of our method and the efficacy of debiasing T2I diffusion models by finetuning their language understanding components. A salient feature of our method is its flexibility, allowing users to specify the desired target distribution. For example, we can effectively adjust the age distribution to achieve a 75% young and 25% old ratio (Fig. 4) while simultaneously debiasing gender and race (Tab. 5). We also show the scalability of our method. It can debias multiple concepts at once, such as occupations, sports, and personal descriptors, by expanding the set of prompts used for finetuning. Generative AI is set to profoundly influence society. It is well-recognized that LLMs require social alignment finetuning post pre-training (Christiano et al., 2017; Bai et al., 2022). However, this analogous process has received less attention for T2I models, or multimedia generative AI overall. Biases and stereotypes can manifest more subtly within visual outputs. Yet, their influence on human perception and behavior is substantial and long-lasting (Goff et al., 2008). We hope our work inspire further development in promoting social alignment across multimedia generative AI. 2 RELATED WORK Bias in diffusion models. T2I diffusion models are known to produce biased and stereotypical images from neutral prompts. Cho et al. (2023) observe that Stable Diffusion (SD) has an overall tendency to generate males when prompted with occupations and the generated skin tone is concentrated on the center few tones from the Monk Skin Tone Scale (Monk, 2023). Seshadri et al. (2023) observe SD amplifies gender-occupation biases from its training data. Besides occupations, Bianchi et al. (2023) find simple prompts containing character traits and other descriptors also generate stereotypical images. Luccioni et al. (2023) develop a tool to compare collections of generated images with varying gender and ethnicity. Wang et al. (2023a) propose a text-to-image association test and find SD associates females more with family and males more with career. Bias mitigation in diffusion models. Existing techniques for mitigating bias in T2I diffusion models remain limited and predominantly focus on prompting. Friedrich et al. (2023) propose to randomly include additional text cues like male or female if a known occupation is detected in the prompts, to generate images with a more balanced gender distribution. However, this approach is ineffective for debiasing occupations that are not known in advance. Bansal et al. (2022) suggest incorporating ethical interventions into the prompts, such as appending if all individuals can be a lawyer irrespective of their gender to a photo of a lawyer . Kim et al. (2023) propose to optimize a soft token V* such that the prompt V* a photo of a doctor generates doctor images with a balanced gender distribution. Nevertheless, the efficacy of their method lacks robust validation, as they only train the soft token for one specific occupation and test it on two unseen ones. Besides prompting, debias VL (Chuang et al., 2023) proposes to project out biased directions in text embeddings. Concept Algebra (Wang et al., 2023b) projects out biased directions in the score predictions. The Published as a conference paper at ICLR 2024 TIME (Orgad et al., 2023) and UCE methods (Gandikota et al., 2023), which modify the attention weight, can also be used for debiasing. Similar issues of fairness and distributional control have also been explored in other image generative models (Wu et al., 2022). Finetuning diffusion models. Finetuning is a powerful way to enhance a pre-trained diffusion model s specific capabilities, such as adaptability (Gal et al., 2023), controllability (Zhang & Agrawala, 2023), instruction following (Brooks et al., 2023), and image aesthetics (Dai et al., 2023). Concurrent works (Clark et al., 2023; Wallace et al., 2023) also explore the direct finetuning of diffusion models, albeit with goals diverging from fairness and solutions different from ours. Adjusted DFT complements them because we identify and address shared challenges inherent in DFT. 3 BACKGROUND ON DIFFUSION MODELS Diffusion models (Ho et al., 2020) assume a forward diffusion process that gradually injects Gaussian noise to a data distribution q(x0) according to a variance schedule β1, . . . , βT : q(x1:T |x0) = YT t=1 q(xt|xt 1), q(xt|xt 1) = N(xt| p 1 βtxt 1, βt I), (1) where T is a predefined total number of steps (typically 1000). The schedule {βt}t [T ] is chosen such that the data distribution q(x0) is gradually transformed into an approximately Gaussian distribution q T (x T ) N(x T |0, I). Diffusion models then learn to approximate the data distribution by reversing such diffusion process, starting from a Gaussian distribution p(x T ) = N(x T |0, I): pθ(x0:T ) = p(x T ) YT t=1 pθ(xt 1|xt), pθ(xt 1|xt) = N(xt 1|µθ(xt, t), σt I), (2) where µθ(xt, t) is parameterized using a noise prediction network ϵθ(xt, t) with µθ(xt, t) = 1 αt (xt βt 1 αt ϵθ(xt, t)), αt = 1 βt, αt = Qt s=1 αs, and {σt}t [T ] are pre-determined noise variances. After training, generating from diffusion models involves sampling from the reverse process pθ(x0:T ), which begins by sampling a noise variable x T p(x T ), and then proceeds to obtain x0 as follows: xt 1 = 1 αt (xt βt 1 αt ϵθ(xt, t)) + σtwt, wt N(0, I). (3) Latent diffusion models. Rombach et al. (2022) introduce latent diffusion models (LDM), whose forward/reverse diffusion processes are defined in the latent space. With image encoder f Enc and decoder f Dec, LDMs are trained on latent representations z0 = f Enc(x0). To generate an image, LDMs first sample a latent noise z T , run the reverse process to obtain z0, and decode it with x0 = f Dec(z0). Text-to-image diffusion models. In T2I diffusion models, the noise prediction network ϵθ accepts an additional text prompt P, i.e., ϵθ(gϕ(P), xt, t), where gϕ represents a pretrained text encoder parameterized by ϕ. Most T2I models, including Stable Diffusion (Rombach et al., 2022), further employ LDM and thus use a text-conditional noise prediction model in the latent space, denoted as ϵθ(gϕ(P), zt, t), which serves as the central focus of our work. Sampling from T2I diffusion models additionally utilizes the classifier-free guidance technique (Ho & Salimans, 2021). Our method consists of (i) a loss design that steers specific attributes of the generated images towards a target distribution while preserving image semantics, and (ii) adjusted direct finetuning of the diffusion model s sampling process. 4.1 LOSS DESIGN General case For a clearer introduction, we first present the loss design for a general case, which consists of the distributional alignment loss Lalign and the image semantics preserving loss Limg. We start with the distributional alignment loss (DAL) Lalign. Suppose we want to control a categorical attribute of the generated images that has K classes and align it towards a target distribution D. Each class is represented as a one-hot vector of length K and D is a discrete distribution over these classes. We first generate a batch of images I = {x(i)}i [N] using the diffusion model being finetuned and some prompt P. For every generated image x(i), we use a pre-trained classifier h to produce a class probability vector p(i) = [p(i) 1 , , p(i) K ] = h(x(i)), with p(i) k denoting the estimated probability Published as a conference paper at ICLR 2024 that x(i) is from class k. Assume we have another set of vectors {u(i)}i [N] that represents the target distribution and where every u(i) is a one-hot vector representing a class, we can compute the optimal transport (OT) (Monge, 1781) from {p(i)}i [N] to {u(i)}i [N]: σ = argmin σ SN i=1 |p(i) u(σi)|2, (4) where SN denotes all permutations of [N], σ = [σ1, , σN], and σi [N]. Intuitively, σ finds, in the class probability space, the most efficient modification of the current images to match the target distribution. We construct {u(i)}i [N] to be i.i.d. samples from the target distribution and compute the expectation of OT: q(i) = Eu(1), ,u(N) D [u(σ i )], i [N]. (5) q(i) is a probability vector where the k-th element is the probability that image x(i) should have target class k, had the batch of generated images indeed followed the target distribution D. The expectation of OT can be computed analytically when the number of classes K is small or approximated by empirical average when K increases. We note one can also construct a fixed set of {u(i)}i [N], for example half male and half female to represent a balanced gender distribution. But a fixed split poses a stronger finite-sample alignment objective and neglects the sensitivity of OT. Finally, we generate target classes {y(i)}i [N] and confidence of these targets {c(i)}i [N] by: y(i) = arg max(q(i)) [K], c(i) = max(q(i)) [0, 1], i [N]. We define DAL as the cross-entropy loss w.r.t. these dynamically generated targets, with a confidence threshold C, i=1 1[c(i) C]LCE(h(x(i)), y(i)). (6) We also use an image semantics preserving loss Limg. We keep a copy of the frozen, not finetuned diffusion model and penalize the image dissiminarity measured by CLIP and DINO: h (1 cos(CLIP(x(i)), CLIP(o(i)))) + (1 cos(DINO(x(i)), DINO(o(i)))) i , (7) where I = {o(i)}i [N] is the batch of images generated by the frozen model using the same prompt P. We call them original images. We require every pair of finetuned image x(i) and original image o(i) are generated using the same initial noise. We use both CLIP and DINO because CLIP is pretrained with text supervision and DINO is pretrained with image self-supervision. In implementation, we use the laion/CLIP-Vi T-H-14-laion2B-s32B-b79K and the dinov2-vitb14 (Oquab et al., 2023). We caution that CLIP and DINO can have their own biases (Wolfe et al., 2023). Adaptation for face-centric attributes In this work, we focus on face-centric attributes such as gender, race, and age. We find the following adaptation from the general case yields the best results. First, we use a face detector dface to retrieve the face region dface(x(i)) from every generated image x(i). We apply the classifier h and the DAL Lalign only on the face regions. Second, we introduce another face realism preserving loss Lface, which penalize the dissimilarity between the generated face dface(x(i)) and the closest face from a set of external real faces DF , N (1 min F DF cos(emb(dface(x(i))), emb(F)), (8) where emb( ) is a face embedding model. Lface helps retain realism of the faces, which can be substantially edited by the DAL. In our implementation, we use the Celeb A (Liu et al., 2015) and the Fair Face dataset (Karkkainen & Joo, 2021) as external faces. We use the SFNet-20 (Wen et al., 2022) as the face embedding model. Our final loss L is a weighted sum: L = Lalign + λimg Limg + λface Lface. Notably, we use a dynamic weight λimg. We use a larger λimg,1 if the generated image x(i) s target class y(i) agrees with the original image o(i) s class h(dface(o(i))). Intuitively, we encourage minimal change between x(i) and o(i) if the original image o(i) already satisfies the distributional alignment objective. For other images x(i) whose target class y(i) does not agree with the corresponding original image o(i) s class h(dface(o(i))), we use a smaller weight λimg,2 for the non-face region and the smallest weight λimg,3 for the face region. Intuitively, these images do require editing, particularly on the face regions. Finally, if an image does not contain any face, we only apply Limg but not Lalign and Lface. If an image contains multiple faces, we focus on the one occupying the largest area. Published as a conference paper at ICLR 2024 0 200 400 600 800 1000 0 traininig iteration training loss 1000 800 600 400 200 0 Figure 1: The left figure plots the training loss during direct fine-tuning, w/ three distinct gradients. Each reported w/ 3 random runs. The right figure estimates the scale of these gradients at different time steps. Mean and 90% CI are computed from 20 random runs. Read Section 4.2 for details. 4.2 ADJUSTED DIRECT FINETUNING OF DIFFUSION MODEL S SAMPLING PROCESS Consider that the T2I diffusion model generates an image x0 = f Dec(z0) using a prompt P and an initial noise z T . Our goal is to finetune the diffusion model to minimize a differentiable loss L(x0). We begin by considering the naive DFT, which computes the exact gradient of L(x0) in the sampling process, followed by gradient-based optimization. To see if naive DFT works, we test it for the image semantics preserving loss Limg using a fixed image as the target and optimize a soft prompt. This resembles a textual inversion task (Gal et al., 2023). Fig. 1a shows the training loss has no decrease after 1000 iterations. It suggests the naive DFT of diffusion models is not effective. By explicitly writing down the gradient, we are able to detect why the naive DFT fails. To simplify the presentation, we analyze the gradient w.r.t. the U-Net parameter, d L(x0) dθ . But the same issue arises when finetuning the text encoder or the prompt. d L(x0) dθ = d L(x0) dx0 dx0 dz0 dz0 dθ , with dz0 1 α1 β1 1 α1 | {z } A1 1 αt βt 1 αt | {z } At where ϵ(t) denotes the U-Net function ϵθ(gϕ(P), zt, t) evaluated at time step t. Importantly, the recurrent evaluations of U-Net in the reverse diffusion process lead to a factor Bt that scales exponentially in t. It leads to two issues. First, dz0 dθ becomes dominated by the components At Bt ϵ(t) θ for values of t close to T = 1000. Second, due to the fact that Bt encompasses all possible products between { ϵ(s) zs }s t 1, this coupling between partial gradients of different time steps introduces substantial variance to dz0 dθ . Fig. 2a illustrates this issue. We empirically show these problems indeed exist in naive DFT. Since directly computing the Jacobian matrices ϵ(t) zs is too expen- sive, we assume d L(x0) dx0 dx0 dz0 is a random Gaussian matrix R N(0, 10 4 I) and plot the values of |RAt Bt ϵ(t) θ |, |RAt ϵ(t) θ |, and |R ϵ(t) θ | in Fig. 1b. It is apparent both the scale and variance of RAt Bt ϵ(t) θ explodes as t 1000, but neither RAt ϵ(t) θ nor R ϵ(t) Having detected the cause of the issue, we propose adjusted DFT, which uses an adjusted gradient that sets At = 1 and Bt = I: ( dz0 dθ )adjusted = PT t=1 ϵ(t) θ . It is motivated from the unrolled expression of the reverse process: t=1 Atϵθ(gϕ(P), zt, t) + 1 αT z T + XT t=2 1 αt 1 wt, wt N(0, I). (10) When we set Bt = I, we are essentially considering zt as an external variable and independent of the U-Net parameters θ, rather than recursively dependent on θ. Otherwise, by the chain rule, it generates all the coupling between partial gradients of different time steps in Bt. But setting Bt = I does preserve all uncoupled gradients, i.e., ϵ(t) θ , t [T]. When we set At = 1, we standardize the influence of ϵθ(gϕ(P), zt, t) from different time steps t in z0. It is known that weighting different time steps properly can accelerate diffusion training (Ho et al., 2020; Hang et al., 2023). Finally, we implement adjusted DFT in Appendix Algorithm A.1. Fig. 2b provides a schematic illustration. We test the proposed adjusted gradient and a variant that does not standardize Ai for the same image semantics preserving loss w/ the same fixed target image. The results are shown in Fig. 1a. Published as a conference paper at ICLR 2024 = uncoupled grads: + coupled grads: (a) Naive DFT. (b) Adjusted DFT, which also standardize Ai to 1. Figure 2: Comparison of naive and adjusted direct finetuning (DFT) of the diffusion model. Gray solid lines denote the sampling process. Red dashed lines highlight the gradient computation w.r.t. the model parameter (θ). Variables zt and ϵ(t) represent data and noise prediction at time step t. Di and Ii denote the direct and indirect gradient paths between adjacent time steps. For instance, at t = 3, naive DFT computes the exact gradient A3B3 ϵ(3) θ (defined in Eq. 9), which involve other time step s noise predictions (through the gradient paths I1I2I3I4I5, I1I2D2I5, and D1I3I4I5). Adjusted DFT leverages an adjusted gradient, which removes the coupling with other time steps and standardizes Ai to 1, for more effective finetuning. Read Section 4.2 for details. We find that both adjusted gradients effectively reduce the training loss, suggesting Bi is indeed the underlying issue. Moreover, standardizing Ai further stabilizes the optimization process. We note that, to reduce the memory footprint, in all experiments we (i) quantize the diffusion model to float16, (ii) apply gradient checkpointing (Chen et al., 2016), and (iii) use DPM-Solver++ (Lu et al., 2022) as the diffusion scheduler, which only requires around 20 steps for T2I generations. 5 EXPERIMENTS 5.1 MITIGATING GENDER, RACIAL, AND THEIR INTERSECTIONAL BIASES We apply our method to runwayml/stable-diffusion-v1-5 (SD for short), a T2I diffusion model openly accessible from Hugging Face, to reduce gender, racial, and their intersectional biases. We consider binary gender and recognize its limitations. Enhancing the representation of non-binary identities faces additional challenges from the intricacies of visually representing non-binary identities and the lack of public datasets, which are beyond the scope of this work. We adopt the eight race categories from the Fair Face dataset but find trained classifiers struggle to distinguish between certain categories. Therefore, we consolidate them into four broader classes: WMELH={White, Middle Eastern, Latino Hispanic}, Asian={East Asian, Southeast Asian}, Black, and Indian. The gender and race classifiers used in DAL are trained on the Celeb A or Fair Face datasets. We consider a uniform distribution over gender, race, or their intersection as the target distribution. We employ the prompt template a photo of the face of a {occupation}, a person and use 1000/50 occupations for training/test. For the main experiments and except otherwise stated, we finetune Lo RA (Hu et al., 2021) with rank 50 applied on the text encoder. Appendix A.2 provides other experiment details. Evaluation. We train separate gender and race classifiers for evaluation. We generate 60, 80, or 160 images for each prompt to evaluate gender, racial, or intersectional biases, respectively. For every prompt P, we compute the following metric: bias(P) = 1 K(K 1)/2 P i,j [K]:i