# languageinformed_visual_concept_learning__286ad468.pdf Published as a conference paper at ICLR 2024 LANGUAGE-INFORMED VISUAL CONCEPT LEARNING Sharon Lee Yunzhi Zhang Shangzhe Wu Jiajun Wu Stanford University Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g., color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g., a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training. Project page at https://cs.stanford.edu/ yzzhang/projects/concept-axes. 1 INTRODUCTION In order to make sense of the myriad visual entities in the world, humans develop an abstracted generative model of them and organize the underlying sources of variation into visual concepts, such as different colors or different types of objects. Designing systems that can recognize visual concepts within images as humans do has been a longstanding goal in the fields of computer vision and artificial intelligence (Russakovsky et al., 2015; Krizhevsky et al., 2012; Girshick et al., 2014). To facilitate efficient reasoning and communication of these concepts, humans created symbolic depictions that have evolved into natural language. Such natural language grounding of visual data has been instrumental in the recent proliferation of powerful large vision-language models that are capable of semantically identifying objects in images (Radford et al., 2021; Kirillov et al., 2023) or generating photo-realistic images from arbitrary text prompts (Ramesh et al., 2021; Rombach et al., 2022; Saharia et al., 2022; Yu et al., 2022). While different concept axes can be easily specified by words, such as category and style, it is much less intuitive to delineate the subtleties of low-level visual nuances along each axis using language, such as one particular style of a painting. In this work, our goal is to distill from large pre-trained vision-language models a function that extracts visual concepts along a set of language-specified concept axes from images. As illustrated in Figure 1, once these concepts are extracted, we can recompose them across different image instances at inference time to produce new images with novel concept combinations. To learn this function, rather than collecting a large-scale dataset of human annotations for each specific visual concept, we design a language-informed visual concept representation, and simply distill from a pre-trained Text-to-Image (T2I) generation model. There are three fundamental properties we seek in this visual concept representation. First, unlike T2I generation, which relies on generic words as visual concept descriptors, we would like to capture fine-grained visual nuances using continuous concept embeddings. One common technique is to invert the text-to-image generation process by optimizing an embedding with the objective of reproducing a given input image using a pre-trained T2I model, often referred to as Textual Inversion (Gal et al., 2022). However, most existing Textual Inversion methods (Gal et al., Equal contribution; alphabetically ordered. Published as a conference paper at ICLR 2024 Training Data composed input Inference Concept Recomposition Figure 1: Language-Informed Visual Concept Learning. Our goal is to learn a visual concept representation grounded on a set of language-informed concept axes, e.g.,category, color, and material, by simply distilling from pre-trained text-to-image generation models without manual annotations. After training, the concept encoders extract disentangled axis-specific embeddings from an image, which can be remixed to generate new images with novel concept compositions. a photo of red banana a photo of Recomposition Source Vanilla text prompt Figure 2: Learned Disentangled Concept Embeddings Improve Compositionality. Left: Vanilla text-to-image model may fail to adhere to text prompts of uncommon combinations of concepts even with prompt engineering, e.g. red banana . Right: With the same backbone T2I generator, our learned disentangled concept embeddings greatly enhance concept compositionality. 2022) optimize embeddings for individual image instances independently, overlooking the shared nature of visual concepts across instances. For instance, the concept of red is shared between a red apple and a red dress . Moreover, the concepts of red and yellow also are instances of the property of color. Hence, the second desired property of the visual concept representation is to preserve such common concept structures among various visual instances. Instead of optimizing on individual image instances independently, we design a set of concept encoders, where each encoder learns to encode the visual characteristics of an input image pertaining to one concept axis specified by language. This ensures that the inverted concept embeddings can be shared across different instances and remixed to generate new images. The third crucial aspect of this representation is to ascertain that different concept axes are disentangled, allowing for changes to be made specifically on single concept axis without modifying other axes. To do so, we reuse the disentangled nature of linguistic concepts and ground the predictions to a set of discrete text anchors in the concept embeddings space, which can be obtained by querying a pre-trained generic Visual Question Answering (VQA) model, e.g., BLIP-2 (Li et al., 2023b). This soft anchoring constraint significantly improves the disentanglement of concept embeddings across different axes while still retaining sufficient leeway to capture nuanced visual variations that BLIP-2 struggles to discern, e.g., the style of an art piece in Figure 6. Putting these ideas together, we design a generic framework for learning disentangled and compositional visual concepts grounded to linguistic structures by exploiting pre-trained text-to-image generation and visual question answering models. We show that these concept encoders can be trained purely on synthetic images generated by a pre-trained T2I model, and extract concept embeddings from real images at test time, which capture the fine-grained visual nuances. Our contributions can be summarized as follows: 1. We propose a generic framework for learning language-informed visual concepts by simply distilling pretrained vision-language models. 2. At inference time, the trained concept encoders extract concept embeddings from a test image, which can be remixed to generate images with novel compositions of concepts. 3. Using a light-weight test-time finetuning procedure, these encoders can also be quickly adapted to extract novel concepts unseen during training. Published as a conference paper at ICLR 2024 4. Experiments show that this visual concept representation achieves better disentanglement and compositionality, compared to text-based prompting baselines, as shown in Figures 2 and 6. 2 RELATED WORK 2.1 VISUAL CONCEPT LEARNING Designing learning-based systems to discover various visual concepts in natural images has been a long-standing goal in machine perception and intelligence. Early attempts typically rely on extensive semantic annotations done by humans, such as object classification (Barnard et al., 2003; Fei Fei et al., 2006; Fergus et al., 2005), which were later epitomized by the effort of Image Net (Russakovsky et al., 2015). Visual concepts are intrinsically linked to concepts in language, and such end-to-end supervised learning paradigms can be seen as learning a direct mapping between visual concepts and discrete linguistic concepts. Other approaches attempt to better exploit this inherent structure in language by constructing a structured representation of visual concepts such as scene graphs (Zhong et al., 2021) and symbolic programs (Mao et al., 2019; Han et al., 2019). More recently, the success of natural language modeling (Devlin et al., 2018; Brown et al., 2020; Raffel et al., 2020) has paved the way for grounding visual concepts to open vocabularies, unlike category labels or fixed symbolic programs, by training large Vision-Language Models (VLMs) on massive image captioning datasets (Schuhmann et al., 2022). This has powered recent Text-to-Image (T2I) generation models to turn linguistic concepts from free-form text prompts into photo-realistic images (Rombach et al., 2022; Saharia et al., 2022). These T2I models have been leveraged by Personalization methods for extracting individual visual concepts from one or a few images. This is done by either by optimizing token embeddings (Gal et al., 2022; Vinker et al., 2023; Avrahami et al., 2023; Chefer et al., 2023b; Liu et al., 2023), finetuning the backbone denoiser (Ruiz et al., 2023), or training additional encoders for amortized optimization (Gal et al., 2023; Arar et al., 2023; Li et al., 2023a). We also distill visual concepts from a pre-trained T2I model, but unlike existing works, we train encoders to adhere to a set of language-specified concept axes, preserving the disentangled and compositional nature of language. Ranasinghe & Ryoo (2023) also explores language-defined concepts but focuses on video action recognition tasks while we focus on image generation. A separate line of work focuses on unsupervised visual concept disentanglement without explicitly leveraging language, typically by simply imposing information constraints in the latent space of a generative model, like VAEs and GANs (Higgins et al., 2017; Chen et al., 2016; Hsu et al., 2023). Here, we are interested in learning visual concepts that are explicitly grounded to language. 2.2 CONTROLLABLE IMAGE GENERATION The success of GAN-based image generation (Goodfellow et al., 2014; Brock et al., 2018; Karras et al., 2019) has spawned a series of works that discover controllable directions in the GAN latent space (Voynov & Babenko, 2020; H ark onen et al., 2020). More recently, the advancements of diffusion-based T2I models have unlocked new possibilities for controllable image generation, where photo-realistic images can be generated from free-form text prompts. Recent works proposed to further improve the alignment of image samples and input text conditions by manipulating attention maps within T2I models (Chefer et al., 2023a; Epstein et al., 2023). Another form of controllable image generation is compositional generation. Liu et al. (2022) proposes to improve the quality of T2I diffusion models for composing multiple pre-given concepts, specified via text prompts, by modifying the inference procedure. In this work, instead of assuming that concepts are given and are in a text format, we tackle the task of identifying disentangled concepts which can be used for composition. Image generation can also be controlled with image analogies (ˇSubrtov a et al., 2023; Hertzmann et al., 2001), a form of visual prompting. These works typically do not explicitly extracts visual concepts from inputs unlike ours. In this work, we amalgamate both visual prompts and text queries, employing them as the editing interface. Fig. 3 gives an overview of our proposed learning framework. Our goal in this work is to extract visual concepts from images along a number of concept axes specified by language, such as Published as a conference paper at ICLR 2024 Input Image Color Encoder Visual Question Answering (VQA) Q: What object? Q: What color? Q: What material? Concept Encoders Text-to Image (T2I) a in color & material 𝐞! < > 𝐞" < > 𝐞# < > )(&*'$ Pretrained Embedding Figure 3: Training Pipeline. During training, an input image is processed by a set of concept encoders that predict concept embeddings specific to given concept axes. These embeddings are trained to (1) retain information in order to reproduce visual inputs via a pre-trained Text-to-Image model given an axis-informed text template, and (2) ensure disentanglement across different axes by anchoring to text embeddings obtained from a pre-trained Visual Question Answering model. category, color, and material, so as to enable the flexible composition of concepts into high-quality image generations. To achieve this, we train a set of visual concept encoders by distilling concept guidance from pretrained vision-language models. Specifically, the encoders are trained to extract concept embeddings from an image in order to fulfill two objectives. First, they should be recomposed to explain the input image through a pretrained text-to-image (T2I) generation model, given a concept-axis-informed text prompt. Second, these visual concept embeddings should be anchored to the corresponding text embeddings obtained from a pre-trained visual question answering (VQA) model, further exploiting the disentangled nature of linguistic concepts for better disentanglement of visual concepts. 3.1 VISUAL CONCEPT ENCODING BY INVERTING TEXT-TO-IMAGE GENERATION Our understanding of the visual world is centered around various concept axes, to which we have often assigned words due to their significance in communication and reasoning. This vision-language grounding has fueled recent explosion of text-to-image generation models (Rombach et al., 2022; Saharia et al., 2022; Ramesh et al., 2022), allowing them to generate photo-realistic images with various combinations of concepts defined by words. Here, we are interested in the reverse direction of text-to-image generation, where the goal is to extract language-grounded visual concepts present in natural images. Specifically, given K concept axes of interest defined by language, we would like to learn K concept encoders {fk( )}K k=1, each of which extracts a concept representation ek = fk(x) along a concept axis from an input image x. In order to train these concept encoders {fk( )}, instead of relying on extensive human labeling, we opt to exploit the vision-language grounding embedded within large pre-trained T2I generation models. Using the technique of Textual Inversion (Gal et al., 2022), one can optimize a token embedding <*> to capture a visual entity in a given image, through the objective of regenerating the image with the T2I model from a text template, such as a photo of <*> . Here, we adopt a similar objective, but instead of inverting a specific embedding capturing the overall identity of an individual image instance, we would like to predict embeddings ek that are grounded to a number of meaningful concept axes, using an axis-informed text template, such as a photo of with color and material . This allows the extracted concept embeddings to be shared across different images, encapsulating the common visual characteristics pertinent to one concept axis. Specifically, given an image x, the concept encoders {fk( )} extract a set of concept embeddings {ek RD}, which have the same dimension D as the text embeddings so that they can be directly inserted into the text embeddings of the axis-informed text template. To simplify the notations, let fγ( ) denote the function that takes in the image and produces the final sequence of embeddings Published as a conference paper at ICLR 2024 of the template and the predicted concept embeddings, and γ be the parameters of all the encoders which will be optimized during training. Let cθ be the part of the T2I model s text encoder that takes in a sequence of text embeddings and outputs a conditioning vector for the T2I model s denoising network ˆϵθ, where θ denotes network parameters. We use Deep Floyd (Stability AI; Saharia et al., 2022) as the backbone T2I model, which utilizes a pre-trained T5 model (Raffel et al., 2020) as the text encoder, and keep the parameters θ frozen in all experiments. To train the encoders, we reuse the training objective for the backbone diffusion model: Lrecon(x; γ) = Eϵ N(0,I),t U([0,1]) ˆϵθ(x, t, cθ(fγ(x))) ϵ 2 2 , (1) where the noise ϵ is sampled from a standardmultivariate Gaussian distribution and the timestep t is sampled from a uniform distribution in [0, 1]. Minimizing Lrecon amounts to finding concept embeddings within the space of the pre-trained T2I model that can best reproduce the input image x, resembling a reconstrucion objective. Compared to per-instance token optimization in vanilla Textual Inversion, the advantages of training these concept encoders are two-fold. First, the concept embedding space is naturally shared across different image instances, encapsulating the common understanding of the corresponding concept axes. Second, it makes training more efficient by amortizing the optimization across all instances, and more crucially, it allows for test-time inference in a feed-forward pass. 3.2 CONCEPT DISENTANGLEMENT USING TEXT ANCHORS The objective of Lrecon ensures that the extracted concept embeddings can sufficiently reconstruct the concept of a given image through a pre-trained text-to-image generation model. However, with this loss alone, there is little guarantee that each embedding encodes only the information pertinent to a particular concept axis. In practice, we found that this baseline results in poor disentanglement of different concept axes when remixing the concept embeddings to generate new images, potentially due to the imprecise vision-language grounding in the pre-trained T2I model. For instance, as shown in Figure 8, the extracted category embedding for red berries cannot be remixed with various color embeddings e.g., orange , as is highly entangled with the concept of a red color due to the bias in natural images. To encourage better disentanglement of different concept axes, we further incorporate a sparse set of text anchors into the concept embedding space. Along each concept axis like color, we have often named some prominent modes, such as red or yellow , and these text labels entail clearly disentangled concepts. Therefore, we would like to reuse this disentangled nature of linguistic concepts to improve the disentanglement of visual concepts. To this end, we make use of the text predictions from a pre-trained Visual Question Answering (VQA) model, BLIP-2 (Li et al., 2023b), as pseudo ground-truth anchors for the concept embeddings. Specifically, for each training image x and for each concept axis of interest (e.g., color) indexed by k, we query the BLIP-2 model Ψ with the image x and a question qk in natural language that is specific to this concept axis, e.g., what is the color of the object in the image . Denote the answer from BLIP-2, also in the form of natural language, as Ψ(x, qk). We encode this answer with the pre-trained text encoder cθ to obtain a text embedding ek = cθ(Ψ(x, qk)). The prediction of our concept encoders fk,γ is encouraged to stay close to this anchor text embedding: Lanchor k (x; γ) = fk,γ(x) ek 2 2, where ek = cθ(Ψ(x, qk)). (2) It is crucial to highlight that we use these BLIP-2 predictions only as anchors by assigning a small weight to this anchor loss Lanchor k during training. Otherwise, the embeddings predicted by the concept encoders could easily collapse to a set of discrete text embeddings and fail to capture the visual nuances in images. 3.3 TRAINING AND INFERENCE Training. Given a collection of training images D containing various combinations of concepts along each axis, the final objective to train the concept encoders consists of the two parts: Ltotal(γ) = Ex D Lrecon(x; γ) + k=1 λk Lanchor k (x; γ) Published as a conference paper at ICLR 2024 Inference. At inference time, given a new test image, the concept encoders extract embeddings {ek} capturing its characteristics along each concept axis of interest. These embeddings can be remixed across different images, or be replaced by embeddings converted from explicit words, to produce images with new compositions of visual concepts through the backbone T2I generator. Generalization to Unseen Concepts via Test-Time Finetuning. While the encoders can precisely extract an axis-specific concept that has been seen during training from a new test image, they tend to be less robust to concepts unseen at training. However, with a lightweight test-time optimization procedure, where we use only the reconstruction objective Lrecon to update the parameters for all encoders, γ, these encoders can generalize to novel concepts unseen during training. Note that Lanchor is omitted here in order to capture the visual nuances without over-committing to the coarse text anchors. After training, the encoders have learned to generate outputs within a relatively narrow region of the embedding space, which allows the model to adapt to the test images shown in Figure 5 within around 600 iterations while maintaining disentanglement and compositional capability. 4 EXPERIMENTS 4.1 EXPERIMENT SETUP Training Data Generation. We train the concept encoders only using synthetic images generated by Deep Floyd from 5 different domains, including fruits, figurines, furniture, art, and clothing. More details of our dataset can be found in A.2. For each dataset, we consider 2-3 concept axes, such as category, color, material, style, and season. For example, considering category and color for the fruits dataset, we generate training images by prompting Deep Floyd with text prompts describing varying combinations of categories and colors, e.g. a photo of an apple which is red in color . Note that these text prompts are used only for data generation and not for training, as they may not be reliable (Figure 2). On average, we obtain 669 training images for each dataset. Implementation Details. Inspired by Gal et al. (2023), we leverage a pre-trained CLIP Vi T/L14 model for image encoding (Radford et al., 2021; Dosovitskiy et al., 2021), which was trained with a contrastive objective aligning image and text features, and hence well-suited for our task. We extract image features from CLIP Vi T and train K separate concept encoders fk on top of the features, which share the same architecture but maintains separate weights. Specifically, we take the [CLS] tokens from each CLIP layer and process each token with a distinct linear layer. This is different from Gal et al. (2023), which extracts [CLS] tokens from every other layer and uses a single shared linear layer for all token features. The transformed features are then aggregated with average pooling followed by a Leaky Re LU (Xu et al., 2015) activation, and passed into another linear layer that produces the final predicted concept embeddings. To ground the concept embeddings to the concept axes, we adapt the text templates from CLIP (Radford et al., 2021), which were originally used to assemble captions with class categories from Image Net (Russakovsky et al., 2015). For training, we use Adam W (Loshchilov & Hutter, 2017) optimizer with learning rate 0.02, and randomly flip the images horizontally. For test-time finetuning, we use the Adam W optimizer with learning rate 0.001. We set λk = 0.0001 (Equation (3)) for the category axis and λ = 0.001 for others. We use IF-I-XL from Deep Floyd as the backbone model, with training resolution 64 64. Training on one dataset takes approximately 12 hours on one NVIDIA Ge Force RTX 3090 GPU. Generated images are upsampled 256 256 using IF-II-L for visualization purpose. 4.2 QUALITATIVE RESULTS Visual Concept Extraction, Recomposition and Extrapolation. Once trained, the concept encoders can extract disentangled concept embeddings specific to each concept axis from different test images, which can recomposed to generate new images with various concept compositions. As shown in Figure 4, across various datasets, our method is able to recompose axis-specific visual concepts from different images and consistently generate new images depicting the recomposed concepts. More examples can be found in Appendix A.1, where images generated from individual decomposed concept embeddings are also presented. This disentangled concept representation also allows us to extrapolate along a particular concept axis for visual exploration, as shown in Figure 7. For instance, we can ask BLIP-2 what is the style Published as a conference paper at ICLR 2024 Source Source