# reproducibility_study_of_itigen_inclusive_texttoimage_generation__b73233d1.pdf Published in Transactions on Machine Learning Research (05/2024) Reproducibility Study of ITI-GEN: Inclusive Text-to-Image Generation Daniel Gallo Fernández University of Amsterdam Răzvan-Andrei Matis,an University of Amsterdam Alejandro Monroy Muñoz University of Amsterdam Janusz Partyka University of Amsterdam Reviewed on Open Review: https: // openreview. net/ forum? id= d3Vj360Wi2 Text-to-image generative models often present issues regarding fairness with respect to certain sensitive attributes, such as gender or skin tone. This study aims to reproduce the results presented in ITI-Gen: Inclusive Text-to-Image Generation by Zhang et al. (2023a), which introduces a model to improve inclusiveness in these kinds of models. We show that most of the claims made by the authors about ITI-Gen hold: it improves the diversity and quality of generated images, it is scalable to different domains, it has plug-and-play capabilities, and it is efficient from a computational point of view. However, ITI-Gen sometimes uses undesired attributes as proxy features and it is unable to disentangle some pairs of (correlated) attributes such as gender and baldness. In addition, when the number of considered attributes increases, the training time grows exponentially and ITI-Gen struggles to generate inclusive images for all elements in the joint distribution. To solve these issues, we propose using Hard Prompt Search with negative prompting, a method that does not require training and that handles negation better than vanilla Hard Prompt Search. Nonetheless, Hard Prompt Search (with or without negative prompting) cannot be used for continuous attributes that are hard to express in natural language, an area where ITI-Gen excels as it is guided by images during training. Finally, we propose combining ITI-Gen and Hard Prompt Search with negative prompting. 1 Introduction Generative AI models that solve text-to-image tasks pose a series of societal risks related to fairness. Some of them come from training data biases, where certain categories are unevenly distributed. As a consequence, the model may ignore some of these categories when it generates images, which leads to societal biases on minority groups. In order to tackle this issue, Zhang et al. (2023a) introduce Inclusive Text-to-Image Generation (ITI-Gen), a method that generates inclusive tokens that can be appended to the text prompts. By concatenating these fair tokens to the text prompts, they are able to generate diverse images with respect to a predefined set of attributes (e.g. gender, race, age). For example, we can add a woman token to the text prompt a headshot of a person to ensure that the person in the generated image is a woman. In this work, we aim to focus on: Published in Transactions on Machine Learning Research (05/2024) [Reproducibility study] Reproducing the results from the original paper. We verify the claims illustrated in Section 2 by reproducing some of the experiments of Zhang et al. (2023a). [Extended Work] Proxy features. Motivated by attributes for which ITI-Gen does not perform well, we carry out experiments to study the influence of diversity and entanglement in the reference image datasets. [Extended Work] Generating images using negative prompts. Hard Prompt Search (HPS) (Ding et al., 2021) with Stable Diffusion (Rombach et al., 2022) is used as a baseline in the paper. We study the effect of adding negative prompts to Stable Diffusion as an alternative way of handling negations in natural language and compare the results with ITI-Gen. We also highlight the potential of combining negative prompting with ITI-Gen, using each method for different types of attributes. [Extended Work] Modifications to the original code. We improve the performance at inference by fixing a bug that prevented the use of large batch sizes, integrate ITI-Gen with Control Net (Zhang et al., 2023b), as well as provide the code to run our proposed method that handles negations. At the same time, we include bash scripts to make it easy to reproduce our experiments. In the next section, we introduce the main claims made in the original paper. Then, we describe the methodology of our study, highlighting the models and datasets that we used, as well as the experimental setup and computational requirements. An analysis of the results and a discussion about them will follow. 2 Scope of reproducibility In the original paper, the authors make the following main claims: 1) Inclusive and high-quality generation. ITI-Gen improves inclusiveness while preserving image quality using a small number of reference images during training. The authors support this claim by using KL divergence and FID score (Heusel et al., 2017) as metrics. 2) Scalability to different domains. ITI-Gen can learn inclusive prompts in different scenarios such as human faces and landscapes. 3) Plug-and-play capabilities. Trained fair tokens can be used with other similar text prompts in a plug-and-play manner. Also, the tokens can be used in other text-to-image generative models such as Control Net (Zhang et al., 2023b). 4) Data and computational efficiency. Only a few dozen images per category are required, and training and inference last only a couple of minutes. 5) Scalability to multiple attributes. ITI-Gen obtains great results when used with different attributes at the same time. In this work, we run experiments to check the authors statements above. Additionally, we study some failure cases of ITI-Gen and propose other methods that handle negations in natural language. 3 Methodology The authors provide an open-source implementation on Git Hub1 that we have used as the starting point. To make our experiments completely reproducible, we design a series of bash scripts for launching them. Finally, since the authors did not provide any code for integrating ITI-Gen with Control Net (Zhang et al., 2023b), we implement it ourselves to check the compatibility of ITI-Gen with other generative models. All of these are better detailed in Section 3.4. 1https://github.com/humansensinglab/ITI-GEN Published in Transactions on Machine Learning Research (05/2024) 3.1 Model description ITI-Gen is a method that improves inclusiveness in text-to-image generation. It outputs a set of fair tokens that are appended to a text prompt in order to guide the model to generate images in a certain way. It achieves this by using reference images for each attribute category. For example, if we use the prompt a headshot of a person and provide images of men and women, the model will learn two tokens, one for each gender. Given a set of M attributes where each attribute m has Km different attribute categories, ITI-Gen learns a set of fair tokens that represent each attribute category. For each combination, the corresponding fair tokens are aggregated and concatenated to the original text prompt T to build a inclusive prompt P . We denote the set of all inclusive prompts with P, and the set of all inclusive prompts that correspond to category i of attribute m with Pm i . The training procedure involves two losses. The first one is the directional alignment loss, which relates the provided reference images to the inclusive prompts. In order to compare images and text, CLIP (Radford et al., 2021) is used to map them to a common embedding space using its text encoder Etext and image encoder Eimg. Following the original code, the Vi T-L/14 pre-trained model is used. The directional alignment loss is defined by 1 i