# coca_contrastive_captioners_are_imagetext_foundation_models__ba03f4df.pdf Published in Transactions on Machine Learning Research (08/2022) Co Ca: Contrastive Captioners are Image-Text Foundation Models Jiahui Yu jiahuiyu@google.com Zirui Wang ziruiw@google.com Vijay Vasudevan Mojtaba Seyedhosseini Google Research Equal contribution. Reviewed on Open Review: https: // openreview. net/ forum? id= Ee277P3AYC Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (Co Ca), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like Sim VLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, Co Ca omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. Co Ca is pretrained end-to-end and from scratch on both webscale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, Co Ca achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (Image Net, Kinetics400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, No Caps). Notably on Image Net classification, Co Ca obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and 91.0% with a finetuned encoder. 1 Introduction Deep learning has recently witnessed the rise of foundation language models (Bommasani et al., 2021) such as BERT (Devlin et al., 2018), T5 (Raffel et al., 2019), GPT-3 (Brown et al., 2020), where models are pretrained on web-scale data and demonstrate generic multi-tasking capabilities through zero-shot, few-shot or transfer learning. Compared with specialized individual models, pretraining foundation models for massive downstream Published in Transactions on Machine Learning Research (08/2022) Image Encoder Captioning Loss Contrastive Loss Unimodal Text Decoder Multimodal Text Decoder Image Encoder Visual Recognition (single-encoder models) Image Encoder Unimodal Text Decoder Crossmodal Alignment (dual-encoder models) Image Encoder Unimodal Text Decoder Multimodal Text Decoder Image Captioning & Multimodal Understanding (encoder-decoder models) Pretraining Zero-shot, frozen-feature or finetuning classification alignment image captioning & multimodal representation Figure 1: Overview of Contrastive Captioners (Co Ca) pretraining as image-text foundation models. The pretrained Co Ca can be used for downstream tasks including visual recognition, vision-language alignment, image captioning and multimodal understanding with zero-shot transfer, frozen-feature evaluation or end-toend finetuning. tasks can amortize training costs, providing opportunities to push the limits of model scale (Barham et al., 2022) for human-level intelligence. For vision and vision-language problems, several foundation model candidates have been explored: (1) Pioneering works (Girshick et al., 2014; Long et al., 2015; Simonyan & Zisserman, 2014) have shown the effectiveness of single-encoder models pretrained with cross-entropy loss on image classification datasets such as Image Net (Deng et al., 2009). The image encoder provides generic visual representations that can be adapted for various downstream tasks including image and video understanding (Dai et al., 2021; Zhang et al., 2021a). However, these models rely heavily on image annotations as labeled vectors and do not bake in knowledge of free-form human natural language, hindering their application to downstream tasks that involving both vision and language modalities. (2) Recently, a line of research (Radford et al., 2021; Jia et al., 2021; Yuan et al., 2021) has shown the feasibility of image-text foundation model candidates by pretraining two parallel encoders with a contrastive loss on web-scale noisy image-text pairs. In addition to the visual embeddings for vision-only tasks, the resulting dual-encoder models can additionally encode textual embeddings to the same latent space, enabling new crossmodal alignment capabilities such as zero-shot image classification and image-text retrieval. Nonetheless, these models are not directly applicable for joint vision-language understanding tasks such as visual question answering (VQA), due to missing joint components to learn fused image and text representations. (3) Another line of research (Vinyals et al., 2015; Wang et al., 2021b; 2022) has explored generative pretraining with encoder-decoder models to learn generic vision and multimodal representations. During pretraining, the model takes images on the encoder side and applies Language Modeling (LM) loss (or Prefix LM (Raffel et al., 2019; Wang et al., 2021b)) on the decoder outputs. For downstream tasks, the decoder outputs can then be used as joint representations for multimodal understanding tasks. While superior vision-language results (Wang et al., 2021b) have been attained with pretrained encoder-decoder models, they do not produce text-only representations aligned with image embeddings, thereby being less feasible and efficient for crossmodal alignment tasks. In this work, we unify single-encoder, dual-encoder and encoder-decoder paradigms, and train one image-text foundation model that subsumes the capabilities of all three approaches. We propose a simple model family named Contrastive Captioners (Co Ca) with a modified encoder-decoder architecture trained with both contrastive loss and captioning (generative) loss. As shown in Figure 1, we decouple the decoder transformer into two parts, a unimodal decoder and a multimodal decoder. We omit cross-attention in unimodal decoder layers to encode text-only representations, and cascade multimodal decoder layers cross-attending to image encoder outputs to learn multimodal image-text representations. We apply both the contrastive objective between outputs of the image encoder and unimodal text decoder, and the captioning objective at the Published in Transactions on Machine Learning Research (08/2022) output of the multimodal decoder. Furthermore, Co Ca is trained on both image annotation data and noisy image-text data by treating all labels simply as text. The generative loss on image annotation text provides a fine-grained training signal similar to the single-encoder cross-entropy loss approach, effectively subsuming all three pretraining paradigms into a single unified method. The design of Co Ca leverages contrastive learning for learning global representations and captioning for fine-grained region-level features, thereby benefiting tasks across all three categories shown in Figure 1. Co Ca shows that a single pretrained model can outperform many specialized models using zero-shot transfer or minimal task-specific adaptation. For example, Co Ca obtains 86.3% zero-shot accuracy on Image Net and better zero-shot crossmodal retrieval on MSCOCO and Flickr30k. With a frozen-encoder, Co Ca achieves 90.6% on Image Net classification, 88.0%/88.5%/81.1% on Kinetics-40/600/700 and 47.4% on Moments-in-Time. After lightweight finetuning, Co Ca further achieves 91.0% on Image Net, 82.3% on VQA and 120.6 CIDEr score on No Caps. 2 Related Work Vision Pretraining. Pretraining Conv Nets (Krizhevsky et al., 2012) or Transformers (Vaswani et al., 2017) on large-scale annotated data such as Image Net (Girshick et al., 2014; Long et al., 2015; Simonyan & Zisserman, 2014), Instagram (Mahajan et al., 2018) or JFT (Zhai et al., 2021a) has become a popular strategy towards solving visual recognition problems including classification, localization, segmentation, video recognition, tracking and many other problems. Recently, self-supervised pretraining approaches have also been explored. BEi T (Bao et al., 2021) proposes a masked image modeling task following BERT (Devlin et al., 2018) in natural language processing, and uses quantized visual token ids as prediction targets. MAE (He et al., 2021) and Sim MIM (Xie et al., 2021) remove the need for an image tokenizer and directly use a light-weight decoder or projection layer to regress pixel values. Nonetheless, these methods only learn models for the vision modality and thus they are not applicable to tasks that require joint reasoning over both image and text inputs. Vision-Language Pretraining. In recent years, rapid progress has been made in vision-language pretraining (VLP), which aims to jointly encode vision and language in a fusion model. Early work (e.g. LXMERT (Tan & Bansal, 2019), UNITER (Chen et al., 2020), Vin VL (Zhang et al., 2021b), VL-T5 (Cho et al., 2021)) in this direction relies on pretrained object detection modules such as Fast(er) R-CNN (Ren et al., 2015) to extract visual representations. Later efforts such as Vi LT (Kim et al., 2021) and VLMo (Wang et al., 2021a) unify vision and language transformers, and train a multimodal transformer from scratch. More recently, a line of work has also explored zero-shot/few-shot learning for vision-language tasks by re-using pretrained large language models (Yang et al., 2022b; Jin et al., 2021; Tsimpoukelli et al., 2021). Compared to prior methods, this paper focuses on training a unified model from scratch subsuming the capability of multimodal understanding and generation. Image-Text Foundation Models. Recent work has proposed image-text foundation models that can subsume both vision and vision-language pretraining. CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) demonstrate that dual-encoder models pretrained with contrastive objectives on noisy image-text pairs can learn strong image and text representations for crossmodal alignment tasks and zero-shot image classification. Florence (Yuan et al., 2021) further develops this method with unified contrastive objective (Yang et al., 2022a), training foundation models that can be adapted for a wide range of vision and image-text benchmarks. To further improve zero-shot image classification accuracy, Li T (Zhai et al., 2021b) and BASIC (Pham et al., 2021a) first pretrain model on an large-scale image annotation dataset with cross-entropy and further finetune with contrastive loss on an noisy alt-text image dataset. Another line of research (Wang et al., 2021b; 2022; Piergiovanni et al., 2022) proposes encoder-decoder models trained with generative losses and shows strong results in vision-language benchmarks while the visual encoder still performs competitively on image classification. In this work, we focus on training an image-text foundation model from scratch in a single pretraining stage to unify these approaches. While recent works (Singh et al., 2021; Li et al., 2021; 2022) have also explored image-text unification, they require multiple pretraining stages of unimodal and multimodal modules to attain good performance. For example, ALBEF (Li et al., 2021) combines contrastive loss with masked language modelling (MLM) with a dual-encoder design. However, our approach is simpler Published in Transactions on Machine Learning Research (08/2022) and more efficient to train while also enables more model capabilities: (1) Co Ca only performs one forward and backward propagation for a batch of image-text pairs while ALBEF requires two (one on corrupted inputs and another without corruption), (2) Co Ca is trained from scratch on the two objectives only while ALBEF is initialized from pretrained visual and textual encoders with additional training signals including momentum modules. (3) The decoder architecture with generative loss is preferred for natural language generation and thus directly enables image captioning. We begin with a review of three foundation model families that utilize natural language supervision differently: single-encoder classification pretraining, dual-encoder contrastive learning, and encoder-decoder image captioning. We then introduce Contrastive Captioners (Co Ca) that share the merits of both contrastive learning and image-to-caption generation under a simple architecture. We further discuss how Co Ca models can quickly transfer to downstream tasks with zero-shot transfer or minimal task adaptation. 3.1 Natural Language Supervision Single-Encoder Classification. The classic single-encoder approach pretrains a visual encoder through image classification on a large crowd-sourced image annotation dataset (e.g., Image Net (Deng et al., 2009), Instagram (Mahajan et al., 2018) or JFT (Zhai et al., 2021a)), where the vocabulary of annotation texts is usually fixed. These image annotations are usually mapped into discrete class vectors to learn with a cross-entropy loss as LCls = p(y) log qθ(x), (1) where p(y) is a one-hot, multi-hot or smoothed label distribution from ground truth label y. The learned image encoder is then used as a generic visual representation extractor for downstream tasks. Dual-Encoder Contrastive Learning. Compared to pretraining with single-encoder classification, which requires human-annotated labels and data cleaning, the dual-encoder approach exploits noisy web-scale text descriptions and introduces a learnable text tower to encode free-form texts. The two encoders are jointly optimized by contrasting the paired text against others in the sampled batch: i log exp(x i yi/σ) PN j=1 exp(x i yj/σ) | {z } image-to-text i log exp(y i xi/σ) PN j=1 exp(y i xj/σ) | {z } text-to-image where xi and yj are normalized embeddings of the image in the i-th pair and that of the text in the j-th pair. N is the batch size, and σ is the temperature to scale the logits. In addition to the image encoder, the dual-encoder approach also learns an aligned text encoder that enables crossmodal alignment applications such as image-text retrieval and zero-shot image classification. Empirical evidence shows zero-shot classification is more robust (Radford et al., 2021; Jia et al., 2021; Andreassen et al., 2021) on corrupted or out-of-distribution images. Encoder-Decoder Captioning. While the dual-encoder approach encodes the text as a whole, the generative approach (a.k.a. captioner) aims for detailed granularity and requires the model to predict the exact tokenized texts of y autoregressively. Following a standard encoder-decoder architecture, the image encoder provides latent encoded features (e.g., using a Vision Transformer (Dosovitskiy et al., 2021) or Conv Nets (He et al., 2016)) and the text decoder learns to maximize the conditional likelihood of the paired text y under the forward autoregressive factorization: t=1 log Pθ(yt|y