# cobit_a_contrastive_bidirectional_imagetext_generation_model__7aa42065.pdf Published as a conference paper at ICLR 2024 COBIT: A CONTRASTIVE BI-DIRECTIONAL IMAGETEXT GENERATION MODEL Haoxuan You1 , Mandy Guo2 , Zhecan Wang1, Kai-Wei Chang3, Jason Baldridge2, Jiahui Yu2 1Columbia University, 2Google Research, 3UCLA haoxuan.you@cs.columbia.edu, {xyguo,jasonbaldridge,jiahuiyu}@google.com Image Generation Image Understanding Multimodal Understanding Retrieval Co BIT Zero-shot & Fine-tune Input: A racoon astronaut under helmet dreaming of stars. Zero-shot Bi-directional Generation (T2I and I2T) Results: Text + Image Generated: rocket raccoon as an astronaut Input: An alien octopus reading a newspaper. Generated: octopus with a green face under the sea Input: A bunny rabbit delivering letters door to door,colorized 1840s photograph Generated: illustration of a rabbit knocking Generated: Generated: Generated: Figure 1: Co BIT can address a variety of vision and vision-language tasks in zero-shot and finetuning settings. The right-side displays the zero-shot generated images by Co BIT given novel prompts, and the zero-shot generated captions by Co BIT given the previously generated images as input. The field of Vision-and-Language (VL) has witnessed a proliferation of pretrained foundation models. Current techniques typically employ only one type of training objective, whether it s (1) contrastive objectives (like CLIP), (2) image-totext generative objectives (like Pa LI), or (3) text-to-image generative objectives (like Parti). However, all these three objectives are mutually relevant and are all based on image-text pairs. Intuitively, the first two objectives can be considered as complementary projections between two modalities, and contrastive learning can preserve global alignment and generations facilitate fine-grained understanding. Inspired by this, we present a Contrastive Bi-directional Image-Text generation model (Co BIT) to first time unify the three pre-training objectives in one framework. Specifically, Co BIT employs a novel unicoder-decoder structure consisting of an image unicoder, a text unicoder, and a cross-modal decoder. The image/text unicoders can switch between encoding and decoding in different tasks, enabling flexibility and shared knowledge that benefits both image-to-text and text-to-image generations. Co BIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE), and text-based content creation, particularly in zero-shot scenarios. 1 INTRODUCTION Recently, there has been rising interest in developing multimodal foundation models for visionlanguage tasks. By mapping text and image representation in the same space, the models can (1) generate images from text (Ramesh et al., 2021; Yu et al., 2022b; Chang et al., 2022; 2023), (2) generate captions from images (Wang et al., 2022a; Chen et al., 2022; Wang et al., 2021; Alayrac et al., 2022), and (3) retrieve images from text and vice verse (Radford et al., 2021; Yao et al., 2021; Mu et al., 2022; You et al., 2022). Although these tasks are highly relevant and can be operationalized on the same set of image-text pairs. They are often considered separately, and the corresponding foundation models are trained with different pre-training losses designed for the corresponding task. This work was done when Haoxuan was an intern at Google. Published as a conference paper at ICLR 2024 Specifically, there are three pre-training losses that are widely used in the literature: (1) contrastive objectives, (2) image-to-text generative objectives, and (3) text-to-image generative objectives. Most models are trained with only one of these objectives, while some are trained with two. For example, Co Ca (Yu et al., 2022a) combines contrastive learning and image-to-text generation. OFA (Wang et al., 2022b) and Unified IO (Lu et al., 2022) integret image-to-text and text-to-image generation. However, none of the approaches has considered using all these three losses although they are highly relevant and can be trained with the same set of image-text pairs. Intuitively, these pre-training objectives complement each other. Specifically, contrastive learning drives high-level image-text matching, whereas image/text generation encourages the model to learn fine-grained image and text representations. Therefore, it is intuitive to utilize them in the same framework. It is worth noting that these three pre-training losses can share part of the computational graphs. Therefore, optimizing them jointly does not increase much overhead compared to only optimizing one. In this paper, we propose to unify the three commonly used pre-training VL objectives: cross-modal contrastive learning, image-to-text generation, and text-to-image generation, and consolidate their strengths in one framework. Our key innovation is a simple and unified Contrastive Bi-directional Image-Text generation model (Co BIT), which consists of an image unicoder and a text unicoder, as well as a cross-attention decoder. The proposed image/text unicoder uses the Transformer architecture. It alternates between two modes: unimodal image/text encoding and decoding depending on the pre-training tasks. Importantly, the same set of Transformer parameters are used for both encoding and decoding, with only the input embedding and attention masks differing. As shown in Fig. 2, when optimizing contrastive objective, image unicoder, and text unicoder work as two encoders. When optimizing text/image generation loss, image/text unicoder extracts features in encoding mode, and the text/image unicoder works in autoregressive decoding mode, then the cross-attention decoder will let autoregressive text/image features cross-attend to encoded image/text feature, serving as a fuser and generator. Each unicoder efficiently shares the knowledge between encoding and decoding and, therefore, can jointly improve both T2I and I2T generation without increasing the number of parameters, exhibiting excellent parameter efficiency. In such a way, all three pre-training paradigms are unified in our framework. Our extensive experiments demonstrate Co BIT s superior performance, and more importantly, first time verifies the compatibility of the three objectives. Benefiting from the compatible objectives, Co BIT subsumes strong zero-shot and transferable capacities of unimodal visual understanding, image-text matching, image-text understanding, and text-to-image generation. For example, Co BIT achieves 82.7% accuracy in zero-shot Image Net classification, 9.37 FID in zero-shot text-to-image generation, 44.8 CIDEr score in zero-shot image-to-text captioning. After fine-tuning, Co BIT further achieves 86.44% linear probing accuracy on Image Net, 4.62 FID on text-to-image generation, and 78.3 VQA score. 2 RELATED WORK Learning Visual Representation from Text. Recent works studied pre-training a visual backbone supervised by paired text data and produced transferable visual representations. CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) are prominent examples of global contrasting between image-text pairs. Florence (Yuan et al., 2021), BASIC (Pham et al., 2021), and Li T (Zhai et al., 2022b) further scale both datasets and models. FILIP (Yao et al., 2021) proposes to employ local token features from images and text for fine-grained contrastive learning. MS-CLIP (You et al., 2022) and CLIPPO (Tschannen et al., 2022) study sharing the model parameters between vision and text. VL Pre-training. Another line of research focuses on learning a solid joint multimodal embedding through pre-training. Some pre-train with mask-reconstruction loss (Li et al., 2019; Wang et al., 2022c; Li et al., 2022; Chen et al., 2019; Shen et al., 2021; Li et al., 2021), i.e., mask partial image and text tokens in input and require the model to predict the masked tokens. Others pre-train models by generating text autoregressively (Wang et al., 2021; Chen et al., 2022; Wang et al., 2022a; Alayrac et al., 2022). Both perform strongly in downstream VL understanding tasks, such as VQA (Antol et al., 2015) and captioning. Published as a conference paper at ICLR 2024 Text Tokens Contrastive Text Generation loss Image Generation loss Cross-Attention Downstream: Retrieval; Image Understanding. Downstream: Image Captioning; Vision-Language Understanding Downstream: Text-to-Image Creation Two dogs run ing in field Two dogs run ing in field Image Unicoder (Encoding Mode) Text Unicoder (Encoding Mode) Text Unicoder (Decoding Mode) Image Unicoder (Decoding Mode) Text Unicoder (Encoding Mode) Cross-Modal Decoder Image Unicoder (Encoding Mode) Cross-Attention 198 877 99 Cross-Modal Decoder Img Generation loss & Text Generation loss Contrastive loss Image Tokenizer Image Detokenizer Image Unicoder Text Unicoder Cross-Modal Decoder Two dogs run ing in field Two dogs 198 877 Image Tokens Text Tokens Text Tokens Text Tokens Image Tokens Two dogs run ing in field (a) Pre-training Pipeline of Co BIT (b) Contrastive Objective in Co BIT (c) T2I Objective in Co BIT (d) I2T Objective in Co BIT Figure 2: (a): Overview of Co BIT pre-training pipeline; (b): When optimizing contrastive objective, image unicoder and text unicoder work as two encoders; (c) and (d): When optimizing image/text generation loss, text/image unicoder extracts features in encoding mode and image/text unicoder works in autoregressive decoding mode, then the cross-attention decoder will let autoregressive image/text features cross-attend to encoded text/image feature. Text-to-Image Generation. Text-guided image creation is a challenging problem that has attracted intense interest in the past two years. Two families of methods are widely studied: diffusion-based and token-based. Diffusion-based models (Rombach et al., 2022; Saharia et al., 2022; Ramesh et al., 2022) are based on a process that iteratively adds noise to images and then learns to reverse the noising process while conditioning on textual descriptions of the image. With token-based methods, raw images are quantized into image tokens by an image tokenizer; then, given a text input, Transformer models predict image tokens autoregressively like machine translation (Ramesh et al., 2021; Yu et al., 2022b) or by iteratively predicting image tokens in parallel(Chang et al., 2022; 2023). As these three broad lines of research have demonstrated great transferable ability to various downstream tasks, there have been many efforts to unify some of them (Yu et al., 2022a; Wang et al., 2022b; Lu et al., 2022; Zhang et al., 2021a; Kim et al., 2022; Huang et al., 2021). Our work, Co BIT, serves as the first effort to integrate contrastive loss, image-to-text generation, and text-to-image loss under one unified pre-training framework. We begin with describing the input processing and then present the model architecture, which includes a proposed unicoder module that shares the merit of both unimodal encoding and decoding. Finally, we explain the pre-training of Co BIT and discuss a comparison with other unified models. To cover various tasks, Co BIT supports three inputs: text tokens, discrete image tokens, and raw images. Text Tokens. Following the default process in past works (Raffel et al., 2020; Jia et al., 2021; Yu et al., 2022a), we tokenize text inputs using a Sentence Piece model with a 64k vocabulary trained on the sampled pre-training datasets. The maximum text token length is 64. Discrete Image Tokens. Co BIT generates images in an autoregressive manner, which requires tokenizing 2D images into a sequence of image tokens (Ramesh et al., 2021; Ding et al., 2021; 2022; Gafni et al., 2022; Yu et al., 2022b). Following Parti (Yu et al., 2022b), we employ a pre-trained and frozen Vi T-VQGAN (Yu et al., 2021) as the tokenizer. Specifically, each 256 256 image is tokenized into a 32 32 grid of image tokens, with 8192 image token classes in the codebook. Published as a conference paper at ICLR 2024 Model Image Unicoder Text Unicoder Cross-modal Decoder Total Params Layers Dims Layers Dims Layers Dims Co BIT-Base 12 768 12 768 18 1024 626M Co BIT-Large 20 1024 12 1024 30 1024 1082M Table 1: Size variants of Co BIT. We append the codebook to the text vocabulary as additional tokens. In inference, to generate images, we decode the image tokens one-by-one and feed them into the decoder in Vi T-VQGAN to reconstruct the raw image. Raw Image. For image and image-text understanding tasks, we input raw images, and each image is divided into non-overlapped patches following the de facto process in Vi Ts. In default, unless specified, the image resolution is 288x288, and the patch size is 18x18. 3.2 ARCHITECTURE As shown in Fig. 2, Co BIT comprises one image unicoder, one text unicoder, and one cross-attention decoder. We term them unicoders because they can act as either encoders or decoders, depending on the role they play for each task. The incorporation of text/image unicoder is inspired by Dong et al. (2019); Bao et al. (2020); Zhou et al. (2020), which demonstrated that one Transformer model can perform both bidirectional encoding for understanding tasks and autoregressive decoding for generation tasks. In our scenario, compared with plain image/text encoders, unicoders in decoding mode can take advantage of the common knowledge shared with encoding to produce unimodal autoregressive features as a decent prior for cross-modal generative objective. Experimental ablation also validates that unicoders boost both T2I generation and multimodal understanding. Image Unicoder. Recently, Vision Transformers (Vi T) (Dosovitskiy et al., 2020; Touvron et al., 2021; Liu et al., 2021) has been established as the strongest approach for image feature encoding. As decoders, Transformers are used in autoregressive image token generation (Ramesh et al., 2021; Gafni et al., 2022; Yu et al., 2022b). We combine these two functionalities into a single image unicoder. The image unicoder has two working modes: (1) In the encoding mode, following Vi T, each 2D patch in the raw image is projected into a feature vector by a trainable linear projection layer. Then, the projected feature sequence is input into cascaded Transformer layers to obtain the encoded image features, where the attention mask is bi-directional. (2) In the decoding mode, firstly, the input processing is different. As described in Sec. 3.1, we tokenize the raw image into image tokens and initialize an embedding layer where token embeddings are indexed. Then, the same Transformer layers in encoding mode are reused in decoding mode to process the features; however, to guarantee the causal decoding ability, we use causal conv-shaped attention mask (Ramesh et al., 2021; Yu et al., 2022b; Child et al., 2019) instead. Overall, the two modes share the Transformer layers parameters, and only differ in input processing and attention masks. We assume that, compared with the design of plain image encoders as in previous works (Yu et al., 2022a; Wang et al., 2022a), the additional decoding mode can exploit the common knowledge learned in image encoding to generate image autoregressive features, which we hypothesize should boost the (text-to-)image generation capacity. Text Unicoder. Similar to the image unicoder mentioned above, the text unicoder also has both encoding and decoding modes, which reuse the Transformer parameters. In both modes, the same tokenizer and embedding layer are utilized to obtain token features, given that they share the same input formats. A causal attention mask is applied in decoding mode. During encoding of text, there are two options in previous works: bi-directional mask (Devlin et al., 2018; Raffel et al., 2020; Yu et al., 2022b), or causal mask (Brown et al., 2020; Radford et al., 2021; Yao et al., 2021). We empirically found that two masks make no difference in performance and use causal masking as the default in the reported experiments. Cross-modal Decoder The cross-modal decoder performs as a fusion-and-generation module, which structure-wise follows the cross-attention decoder (Vaswani et al., 2017; Yu et al., 2022a). When generating text, the input is the text autoregressive feature from the text unicoder in decoding mode; encoded image features will be treated as cross-attention information, i.e., key and value in cross-attention layers. When generating the image, symmetrically, the image token autoregressive feature from the image unicoder in decoding mode is input and cross-attends to encoded text features. Also, different from text generation, where the plain causal (autoregressive) mask is used in the cross-modal decoder, image generation employs a conv-shaped masked sparse attention (Ramesh Published as a conference paper at ICLR 2024 et al., 2021; Yu et al., 2022b; Child et al., 2019), which can save memory and computation brought by long sequences of image tokens. 3.3 PRE-TRAINING The pre-training of Co BIT subsumes three fundamental objectives: image-text contrastive loss, I2T generation loss and T2I generation loss. Here, we provide details on the losses and also clarify the scaling and initialization strategy. Contrastive Loss. We input raw image and text into the image unicoder and the text unicoder, respectively (both in encoding mode), to get encoded image and text features. For text, as with CLIP (Radford et al., 2019) and ALIGN (Jia et al., 2021), we take the feature vector of the CLS token appended at the end of the input sequence as the global representation. For images, however, the unicoder outputs a sequence of features. To aggregate them, following (Yu et al., 2022a), we apply an attention pooler, which is a single multi-head attention layer with one learnable query and unicoder output features as key and value. After obtaining two global features of image and text, a contrastive loss is applied to optimize the paired image-text against others in the same batch: i log exp(x T i yi/τ) PN j=1 exp(x T i yj/τ) + i log exp(y T i xi/τ) PN j=1 exp(y T i xj/τ) ), (1) where xi and yj denote the normalized global embeddings of i-th image and j-th text. τ is a learnable temperature for adjusting the scale of the loss. I2T and T2I Generation Loss. We formulate two generation tasks as token generation problems. As shown in Fig. 2, by cascading the image unicoder, text unicoder, and cross-modal decoder, we can perform two tasks seamlessly by only switching the working modes of unicoders. A crossentropy loss is applied on top of the cross-modal decoder to maximize the conditional likelihood of the ground-truth token under the forward autoregressive factorization. t=1 log Pθ(yt|y Two dogs run ing in field Image Encoder Text Encoder Text Encoder Cross-Modal Decoder Image Encoder Cross-Attention Cross-Modal Decoder 198 877 99 Text Tokens Text Tokens Image Encoder + Text Encoder: Image Tokens Two dogs run ing in field (b) Contrastive Objective (c) T2I Objective (d) I2T Objective Text Tokens Contrastive Text Generation loss Image Generation loss Cross-Attention Two dogs run ing in field Two dogs run ing in field Image Unicoder (Encoding Mode) Text Encoder Image Unicoder (Decoding Mode) Text Encoder Cross-Modal Decoder Image Unicoder (Encoding Mode) Cross-Attention 198 877 99 Cross-Modal Decoder Text Tokens Text Tokens Image Tokens Two dogs run ing in field Image Unicoder + Text Encoder: (b) Contrastive Objective (c) T2I Objective (d) I2T Objective Text Tokens Contrastive Text Generation loss Image Generation loss Cross-Attention Two dogs run ing in field Two dogs run ing in field Image Encoder Text Unicoder (Encoding Mode) Text Unicoder (Decoding Mode) Text Unicoder (Encoding Mode) Cross-Modal Decoder Image Encoder Cross-Attention 198 877 99 Cross-Modal Decoder Text Tokens Text Tokens Image Tokens Two dogs run ing in field (b) Contrastive Objective (c) T2I Objective (d) I2T Objective Image Encoder + Text Unicoder: Embedding Embedding Figure 4: Diagram of three compared models in the ablation of unicoder vs. encoder. Top: Replacing both image unicoder and text unicoder with image encoder and text encoder respectively. Middle: Replacing text unicoder with text encoder while keeping image unicoder. Bottom: Replacing image unicoder with image encoder while keeping text unicoder. 1, we ablate the loss of I2T. Then given T2I and I2T loss both fixed, the weight of contrastive loss is ablated. As we can see in Tab. 9, a high weight of I2T such as 1 will hurt the image generation heavily but also improve VQA. On the other hand, a high weight of contrastive loss like 0.4 will not essentially improve image recognition and hurts both VQA and image generation. Overall, we chose Con.:T2I:I2T = 0.1:0.2:1 as our default setting, as it achieves a good trade-off between three losses. 6.3 ABLATION ON PRE-TRAINING DATA Table 9: Ablation on weights of three pre-training datasets. ZS IN. denotes zero-shot Image Net classification. LP. means linear probing. ZS IG. denotes zero-shot text-to-image generation on MS-COCO, which is evaluated by FID and lower FID is better. Datasets Evaluation ZS IN. LP. IN. VQA ZS IG. ( ) JFT 71.6 81.4 64.8 14.6 ALIGN 70.9 81 67.2 13.8 Web LI 70.0 80.2 66.2 13.4 Published as a conference paper at ICLR 2024 Here we ablate the three pre-training datasets. To make a fair comparison, the batch size is kept the same, the training is conducted for 200k steps on base model. As we can see in Tab. 9, JFT is beneficial to classification tasks, focusing on basic and precise semantics of image; Web LI has higher-quality image data and is specifically beneficial to text-to-image generation; ALIGN is relatively noisy but covers broad semantics. That s the reason why we mixed them for a more balanced training. 6.4 DETAILS OF EXPERIMENTS 6.4.1 HYPERPARAMETERS IN FINE-TUNING In Tab. 7, we present the hyperparameters we used in fine-tuning/linear probing of Co BIT. 6.4.2 ZERO-SHOT IMAGE CLASSIFICATION we apply the same set of prompts to transfer labels into sentences, such as a photo of {class} . Similar to the contrastive loss computed in Sec. 3.3, we input raw image/text into image/text unicoders in encoding mode to obtain the global image and text features. Then, we compute their similarity to match images and labels. 6.4.3 ZERO-SHOT TEXT-TO-IMAGE GENERATION In decoding, we employ Top-K sampling to sample 16 images for each text and use a reranker to select the best image for evaluation. Following the de facto process, we compute FID score (Heusel et al., 2017) on MS-COCO appendixdata (lower FID is better). 6.4.4 VQA FINE-TUNING we use the VQA v2, and the task is formulated as a classification problem over 3,129 most frequent answers in the training set. To accomplish this, the raw image is fed into the image unicoder using encoding mode, while the question is processed by the text unicoder in decoding mode. Subsequently, the cross-modal decoder utilizes the text decoding features as input and cross-attends to the encoded image features. The final token output feature of the cross-modal decoder is considered the fused global feature. To predict the answer, a linear classifier is trained on top of this feature. 6.4.5 SETUP OF ABLATION TRAINING Specifically, the total batch size is 4,352, containing 4,096 for contrastive and I2T loss and 256 for T2I loss, and the total training step is 200k without high-resolution pre-training. 6.5 DETAILED COMPARISON WITH OTHER UNIFIED WORKS. v.s. Unified Diffusion-based Models. Some recent works utilize diffusion models to jointly learn text-to-image and image-to-text learning, such as Versatile Diffusion (Xu et al., 2022), Co Di (Tang et al., 2023), Hu et al. (2022), Uni Diffuser (Bao et al., 2023). Although they work well in image generation, they tend to perform worse in text generation and fail to handle image-text understanding tasks such as VQA, retrieval, etc. Moreover, they all initialize from Stable Diffusion, while Co BIT is mostly trained from scratch and learns superior image understanding capability. v.s. Unified Auto-Regressive Models. Unified-IO (Lu et al., 2022), OFA (Wang et al., 2022b) and CM3Leon (Yu et al., 2023) also train text-to-image and image-to-text jointly. However, they use a plain decoder or encoder-decoder model without considering the contrastive alignment as in Co BIT. BEIT-3 (Wang et al., 2022c) conducts the image-to-text generation by mask-and-reconstruction with an encoder model but could not handle the text-to-image generation task. 6.6 ILLUSTRATION OF REPLACING UNICODERS WITH ENCODERS IN COBIT In Sec.4.4, we ablate Unicoder vs. Encoder and demonstrate the effectiveness of proposed unicoders. In Fig. 4 we show the diagram of using image and text encoders, image encoder+text unicoder, and image unicoder+text encoder. As we can see, encoders can only encode visual or textual Published as a conference paper at ICLR 2024 Space cruise White sports shoes made with real A teddy bear shaped like an An oil painting of a robot hanging from a hot air balloon and a snowy mountain range in the background Flying car logo An elephant under the sea Photo of a living corgi made of stained glass. the corgi is walking outside in a park A giraffe looking into an airplane window from the outside A very beautiful painting of a silhouette of a moose looking over a dramatic alaska mountain landscape Failure: A geico that looks like a cat Failure: a portrait of a british shorthair wearing a colorful bow tie and sunglasses holding a sign that says drawit Failure: A red cube beside a smaller yellow sphere Figure 5: Qualitative results of zero-shot text-to-image generation from Co BIT-Large with both good and failed cases. features while unicoders can perform both encoding and decoding, which shares the knowledge and boosts the generation result as shown in previous ablation. It s noted that in pre-training, unicoder doesn t add extra parameters compared to encoders because encoding and decoding in unicoder reuse the same set of parameters; In the finetuning of text-to-image and image-to-text tasks, the uni- Published as a conference paper at ICLR 2024 coder design indeed brings more parameters than encoder. Therefore, we mainly evaluate zero-shot captioning and zero-shot image generation in this ablation (Tab. 5) to eliminate the difference of parameter numbers. 6.7 MORE VISUALIZATION In Fig. 5, We attach more visualization of Co BIT-Large on zero-shot text-to-image generation with novel prompts in Parti Prompts Yu et al. (2022b). For better visualization when zoom-in, we employ Sahak et al. (2023) as the super-resolution module to upsample generated 256x256 images to 1024x1024 images. It s noted that when computing FID, we still use 256x256 images and the highresolution ones are only used for visualization. In failed cases, we find that: (1) Co BIT sometimes messes up the size attributes of two objects. For example, in the last example, yellow sphere ought to be smaller. (2) Co BIT sometimes couldn t render the details of words in text very well. In the second last example, DRAWIT is rendered as DRAWMI? . (3) Co BIT occasionally misunderstands the text. In the third last example, we expect a geico that looks like a cat whereas Co BIT first renders GEICO THAT LOOK then generates a cat. It s indeed a new way to interpret the text but not the desired way of humans.