# coca_contrastive_captioners_are_imagetext_foundation_models__ba03f4df.pdf

Published in Transactions on Machine Learning Research (08/2022)

Co Ca: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu jiahuiyu@google.com

Zirui Wang ziruiw@google.com

Vijay Vasudevan

Mojtaba Seyedhosseini

Google Research Equal contribution.

Reviewed on Open Review: https: // openreview. net/ forum? id= Ee277P3AYC

Exploring large-scale pretrained foundation models is of signiﬁcant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (Co Ca), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like Sim VLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, Co Ca omits cross-attention in the ﬁrst half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed eﬃciently with minimal overhead. Co Ca is pretrained end-to-end and from scratch on both webscale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, Co Ca achieves state-of-the-art performance with zero-shot transfer or minimal task-speciﬁc adaptation on a broad range of downstream tasks, spanning visual recognition (Image Net, Kinetics400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, No Caps). Notably on Image Net classiﬁcation, Co Ca obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classiﬁcation head, and 91.0% with a ﬁnetuned encoder.

1 Introduction

Deep learning has recently witnessed the rise of foundation language models (Bommasani et al., 2021) such as BERT (Devlin et al., 2018), T5 (Raﬀel et al., 2019), GPT-3 (Brown et al., 2020), where models are pretrained on web-scale data and demonstrate generic multi-tasking capabilities through zero-shot, few-shot or transfer learning. Compared with specialized individual models, pretraining foundation models for massive downstream

Published in Transactions on Machine Learning Research (08/2022)

Image Encoder

Captioning Loss

Contrastive Loss

Unimodal Text Decoder

Multimodal Text Decoder

Image Encoder

Visual Recognition (single-encoder models)

Image Encoder

Unimodal Text Decoder

Crossmodal Alignment (dual-encoder models)

Image Encoder

Unimodal Text Decoder

Multimodal Text Decoder

Image Captioning & Multimodal Understanding (encoder-decoder models)

Pretraining Zero-shot, frozen-feature or finetuning

classification alignment

image captioning & multimodal representation

Figure 1: Overview of Contrastive Captioners (Co Ca) pretraining as image-text foundation models. The pretrained Co Ca can be used for downstream tasks including visual recognition, vision-language alignment, image captioning and multimodal understanding with zero-shot transfer, frozen-feature evaluation or end-toend ﬁnetuning.

tasks can amortize training costs, providing opportunities to push the limits of model scale (Barham et al., 2022) for human-level intelligence.

For vision and vision-language problems, several foundation model candidates have been explored: (1) Pioneering works (Girshick et al., 2014; Long et al., 2015; Simonyan & Zisserman, 2014) have shown the eﬀectiveness of single-encoder models pretrained with cross-entropy loss on image classiﬁcation datasets such as Image Net (Deng et al., 2009). The image encoder provides generic visual representations that can be adapted for various downstream tasks including image and video understanding (Dai et al., 2021; Zhang et al., 2021a). However, these models rely heavily on image annotations as labeled vectors and do not bake in knowledge of free-form human natural language, hindering their application to downstream tasks that involving both vision and language modalities. (2) Recently, a line of research (Radford et al., 2021; Jia et al., 2021; Yuan et al., 2021) has shown the feasibility of image-text foundation model candidates by pretraining two parallel encoders with a contrastive loss on web-scale noisy image-text pairs. In addition to the visual embeddings for vision-only tasks, the resulting dual-encoder models can additionally encode textual embeddings to the same latent space, enabling new crossmodal alignment capabilities such as zero-shot image classiﬁcation and image-text retrieval. Nonetheless, these models are not directly applicable for joint vision-language understanding tasks such as visual question answering (VQA), due to missing joint components to learn fused image and text representations. (3) Another line of research (Vinyals et al., 2015; Wang et al., 2021b; 2022) has explored generative pretraining with encoder-decoder models to learn generic vision and multimodal representations. During pretraining, the model takes images on the encoder side and applies Language Modeling (LM) loss (or Preﬁx LM (Raﬀel et al., 2019; Wang et al., 2021b)) on the decoder outputs. For downstream tasks, the decoder outputs can then be used as joint representations for multimodal understanding tasks. While superior vision-language results (Wang et al., 2021b) have been attained with pretrained encoder-decoder models, they do not produce text-only representations aligned with image embeddings, thereby being less feasible and eﬃcient for crossmodal alignment tasks.

In this work, we unify single-encoder, dual-encoder and encoder-decoder paradigms, and train one image-text foundation model that subsumes the capabilities of all three approaches. We propose a simple model family named Contrastive Captioners (Co Ca) with a modiﬁed encoder-decoder architecture trained with both contrastive loss and captioning (generative) loss. As shown in Figure 1, we decouple the decoder transformer into two parts, a unimodal decoder and a multimodal decoder. We omit cross-attention in unimodal decoder layers to encode text-only representations, and cascade multimodal decoder layers cross-attending to image encoder outputs to learn multimodal image-text representations. We apply both the contrastive objective between outputs of the image encoder and unimodal text decoder, and the captioning objective at the

Published in Transactions on Machine Learning Research (08/2022)

output of the multimodal decoder. Furthermore, Co Ca is trained on both image annotation data and noisy image-text data by treating all labels simply as text. The generative loss on image annotation text provides a ﬁne-grained training signal similar to the single-encoder cross-entropy loss approach, eﬀectively subsuming all three pretraining paradigms into a single uniﬁed method.

The design of Co Ca leverages contrastive learning for learning global representations and captioning for ﬁne-grained region-level features, thereby beneﬁting tasks across all three categories shown in Figure 1. Co Ca shows that a single pretrained model can outperform many specialized models using zero-shot transfer or minimal task-speciﬁc adaptation. For example, Co Ca obtains 86.3% zero-shot accuracy on Image Net and better zero-shot crossmodal retrieval on MSCOCO and Flickr30k. With a frozen-encoder, Co Ca achieves 90.6% on Image Net classiﬁcation, 88.0%/88.5%/81.1% on Kinetics-40/600/700 and 47.4% on Moments-in-Time. After lightweight ﬁnetuning, Co Ca further achieves 91.0% on Image Net, 82.3% on VQA and 120.6 CIDEr score on No Caps.

2 Related Work

Vision Pretraining. Pretraining Conv Nets (Krizhevsky et al., 2012) or Transformers (Vaswani et al., 2017) on large-scale annotated data such as Image Net (Girshick et al., 2014; Long et al., 2015; Simonyan & Zisserman, 2014), Instagram (Mahajan et al., 2018) or JFT (Zhai et al., 2021a) has become a popular strategy towards solving visual recognition problems including classiﬁcation, localization, segmentation, video recognition, tracking and many other problems. Recently, self-supervised pretraining approaches have also been explored. BEi T (Bao et al., 2021) proposes a masked image modeling task following BERT (Devlin et al., 2018) in natural language processing, and uses quantized visual token ids as prediction targets. MAE (He et al., 2021) and Sim MIM (Xie et al., 2021) remove the need for an image tokenizer and directly use a light-weight decoder or projection layer to regress pixel values. Nonetheless, these methods only learn models for the vision modality and thus they are not applicable to tasks that require joint reasoning over both image and text inputs.

Vision-Language Pretraining. In recent years, rapid progress has been made in vision-language pretraining (VLP), which aims to jointly encode vision and language in a fusion model. Early work (e.g. LXMERT (Tan & Bansal, 2019), UNITER (Chen et al., 2020), Vin VL (Zhang et al., 2021b), VL-T5 (Cho et al., 2021)) in this direction relies on pretrained object detection modules such as Fast(er) R-CNN (Ren et al., 2015) to extract visual representations. Later eﬀorts such as Vi LT (Kim et al., 2021) and VLMo (Wang et al., 2021a) unify vision and language transformers, and train a multimodal transformer from scratch. More recently, a line of work has also explored zero-shot/few-shot learning for vision-language tasks by re-using pretrained large language models (Yang et al., 2022b; Jin et al., 2021; Tsimpoukelli et al., 2021). Compared to prior methods, this paper focuses on training a uniﬁed model from scratch subsuming the capability of multimodal understanding and generation.

Image-Text Foundation Models. Recent work has proposed image-text foundation models that can subsume both vision and vision-language pretraining. CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) demonstrate that dual-encoder models pretrained with contrastive objectives on noisy image-text pairs can learn strong image and text representations for crossmodal alignment tasks and zero-shot image classiﬁcation. Florence (Yuan et al., 2021) further develops this method with uniﬁed contrastive objective (Yang et al., 2022a), training foundation models that can be adapted for a wide range of vision and image-text benchmarks. To further improve zero-shot image classiﬁcation accuracy, Li T (Zhai et al., 2021b) and BASIC (Pham et al., 2021a) ﬁrst pretrain model on an large-scale image annotation dataset with cross-entropy and further ﬁnetune with contrastive loss on an noisy alt-text image dataset. Another line of research (Wang et al., 2021b; 2022; Piergiovanni et al., 2022) proposes encoder-decoder models trained with generative losses and shows strong results in vision-language benchmarks while the visual encoder still performs competitively on image classiﬁcation. In this work, we focus on training an image-text foundation model from scratch in a single pretraining stage to unify these approaches. While recent works (Singh et al., 2021; Li et al., 2021; 2022) have also explored image-text uniﬁcation, they require multiple pretraining stages of unimodal and multimodal modules to attain good performance. For example, ALBEF (Li et al., 2021) combines contrastive loss with masked language modelling (MLM) with a dual-encoder design. However, our approach is simpler

Published in Transactions on Machine Learning Research (08/2022)

and more eﬃcient to train while also enables more model capabilities: (1) Co Ca only performs one forward and backward propagation for a batch of image-text pairs while ALBEF requires two (one on corrupted inputs and another without corruption), (2) Co Ca is trained from scratch on the two objectives only while ALBEF is initialized from pretrained visual and textual encoders with additional training signals including momentum modules. (3) The decoder architecture with generative loss is preferred for natural language generation and thus directly enables image captioning.

We begin with a review of three foundation model families that utilize natural language supervision diﬀerently: single-encoder classiﬁcation pretraining, dual-encoder contrastive learning, and encoder-decoder image captioning. We then introduce Contrastive Captioners (Co Ca) that share the merits of both contrastive learning and image-to-caption generation under a simple architecture. We further discuss how Co Ca models can quickly transfer to downstream tasks with zero-shot transfer or minimal task adaptation.

3.1 Natural Language Supervision

Single-Encoder Classiﬁcation. The classic single-encoder approach pretrains a visual encoder through image classiﬁcation on a large crowd-sourced image annotation dataset (e.g., Image Net (Deng et al., 2009), Instagram (Mahajan et al., 2018) or JFT (Zhai et al., 2021a)), where the vocabulary of annotation texts is usually ﬁxed. These image annotations are usually mapped into discrete class vectors to learn with a cross-entropy loss as LCls = p(y) log qθ(x), (1) where p(y) is a one-hot, multi-hot or smoothed label distribution from ground truth label y. The learned image encoder is then used as a generic visual representation extractor for downstream tasks.

Dual-Encoder Contrastive Learning. Compared to pretraining with single-encoder classiﬁcation, which requires human-annotated labels and data cleaning, the dual-encoder approach exploits noisy web-scale text descriptions and introduces a learnable text tower to encode free-form texts. The two encoders are jointly optimized by contrasting the paired text against others in the sampled batch:

i log exp(x i yi/σ) PN j=1 exp(x i yj/σ) | {z } image-to-text

i log exp(y i xi/σ) PN j=1 exp(y i xj/σ) | {z } text-to-image

where xi and yj are normalized embeddings of the image in the i-th pair and that of the text in the j-th pair. N is the batch size, and σ is the temperature to scale the logits. In addition to the image encoder, the dual-encoder approach also learns an aligned text encoder that enables crossmodal alignment applications such as image-text retrieval and zero-shot image classiﬁcation. Empirical evidence shows zero-shot classiﬁcation is more robust (Radford et al., 2021; Jia et al., 2021; Andreassen et al., 2021) on corrupted or out-of-distribution images.

Encoder-Decoder Captioning. While the dual-encoder approach encodes the text as a whole, the generative approach (a.k.a. captioner) aims for detailed granularity and requires the model to predict the exact tokenized texts of y autoregressively. Following a standard encoder-decoder architecture, the image encoder provides latent encoded features (e.g., using a Vision Transformer (Dosovitskiy et al., 2021) or Conv Nets (He et al., 2016)) and the text decoder learns to maximize the conditional likelihood of the paired text y under the forward autoregressive factorization:

t=1 log Pθ(yt|y<t, x). (3)

The encoder-decoder is trained with teacher-forcing (Williams & Zipser, 1989) to parallelize computation and maximize learning eﬃciency. Unlike prior methods, the captioner approach yields a joint image-text representation that can be used for vision-language understanding, and is also capable of image captioning applications with natural language generation.

Published in Transactions on Machine Learning Research (08/2022)

Image Encoder

two dogs running in a field ,

[s] two dogs running in a field [CLS]

Unimodal Text Decoder

Multimodal Text Decoder

two dogs running in a field [/s]

attentional pooling cls-token Contrastive Loss

Captioning Loss

Figure 2: Detailed illustration of Co Ca architecture and training objectives.

3.2 Contrastive Captioners Pretraining

Figure 2 depicts the proposed contrastive captioner (Co Ca): a simple encoder-decoder approach that seamlessly combines the three training paradigms. Similar to standard image-text encoder-decoder models, Co Ca encodes images to latent representations by a neural network encoder, for example, vision transformer (Vi T) (Dosovitskiy et al., 2021) (used by default; it can also be other image encoders like Conv Nets (He et al.,

2016)), and decodes texts with a causal masking transformer decoder. Unlike standard decoder transformers, Co Ca omits cross-attention in the ﬁrst half of the decoder layers to encode unimodal text representations, and cascades the rest of the decoder layers, cross-attending to the image encoder for multimodal image-text representations. As a result, the Co Ca decoder simultaneously produces both unimodal and multimodal text representations that allow us to apply both contrastive and generative objectives as

LCo Ca = λCon LCon + λCap LCap, (4)

where λCon and λCap are loss weighting hyper-parameters. We note that the single-encoder cross-entropy classiﬁcation objective can be interpreted as a special case of the generative approach applied on image annotation data, when the vocabulary is the set of all label names.

Decoupled Text Decoder and Co Ca Architecture. The captioning approach optimizes the conditional likelihood of text while the contrastive approach uses an unconditional text representation. To address this dilemma and combine these two methods into a single model, we propose a simple decoupled decoder design where we split the decoder into unimodal and multimodal components, by skipping the cross-attention mechanism in the unimodal decoder layers. That is, the bottom nuni unimodal decoder layers encode the input text as latent vectors with causally-masked self-attention, and the top nmulti multimodal layers further apply causally-masked self-attention and together with cross-attention to the output of the visual encoder. All decoder layers prohibit tokens from attending to future tokens, and it is straightforward to use the multimodal text decoder output for the captioning objective LCap. For the contrastive objective LCon, we append a learnable [CLS] token at the end of the input sentence and use its corresponding output of unimodal decoder as the text embedding. We split the decoder in half such that nuni = nmulti. Following ALIGN (Jia et al., 2021), we pretrain with image resolution of 288 288 and patch size 18 18, resulting in a total of 256 image tokens. Our largest Co Ca model ("Co Ca" in short) follows the Vi T-giant setup in Zhai et al. (2021a) with 1B-parameters in the image encoder and 2.1B-parameters altogether with the text decoder. We also explore two smaller variants of Co Ca-Base and Co Ca-Large detailed in Table 1.

Attentional Poolers. It is noteworthy that the contrastive loss uses a single embedding for each image while the decoder usually attends to a sequence of image output tokens in an encoder-decoder captioner

Published in Transactions on Machine Learning Research (08/2022)

(Wang et al., 2021b). Our preliminary experiments show that a single pooled image embedding helps visual recognition tasks as a global representation, while more visual tokens (thus more ﬁne-grained) are beneﬁcial for multimodal understanding tasks which require region-level features. Hence, Co Ca adopts task-speciﬁc attentional pooling (Lee et al., 2019) to customize visual representations to be used for diﬀerent types of training objectives and downstream tasks. Here, a pooler is a single multi-head attention layer with nquery learnable queries, with the encoder output as both keys and values. Through this, the model can learn to pool embeddings with diﬀerent lengths for the two training objectives, as shown in Figure 2. The use of task-speciﬁc pooling not only addresses diﬀerent needs for diﬀerent tasks but also introduces the pooler as a natural task adapter. We use attentional poolers in pretraining for generative loss nquery = 256 and contrastive loss nquery = 1.

Pretraining Eﬃciency. A key beneﬁt of the decoupled autoregressive decoder design is that it can compute two training losses considered eﬃciently. Since unidirectional language models are trained with causal masking on complete sentences, the decoder can eﬃciently generate outputs for both contrastive and generative losses with a single forward propagation (compared to two passes for a bidirectional approach (Li et al., 2021)). Therefore, the majority of the compute is shared between the two losses and Co Ca only induces minimal overhead compared to standard encoder-decoder models. On the other hand, while many existing methods (Zhai et al., 2021b; Pham et al., 2021a; Singh et al., 2021; Wang et al., 2021a; Li et al., 2021; 2022) train model components with multiple stages on various data sources and/or modalities, Co Ca is pretrained end-to-end from scratch directly with various data sources (i.e., annotated images and noisy alt-text images) by treating all labels as texts for both contrastive and generative objectives.

3.3 Contrastive Captioners for Downstream Tasks

Zero-shot Transfer. A pretrained Co Ca model performs many tasks in a zero-shot manner by leveraging both image and text inputs, including zero-shot image classiﬁcation, zero-shot image-text cross-retrieval, zero-shot video-text cross-retrieval. Following previous practices (Radford et al., 2021; Zhai et al., 2021b), zero-shot here is diﬀerent from classical zero-shot learning in that during pretraining, the model may see relevant supervised information, but no supervised examples are used during the transfer protocol. For the pretraining data, we follow strict de-duplication procedures introduced in Jia et al. (2021); Zhai et al. (2021b) to ﬁlter all near-domain examples to our downstream tasks.

Frozen-feature Evaluation. As discussed in the previous section, Co Ca adopts task-speciﬁc attentional pooling (Lee et al., 2019) (pooler for brevity) to customize visual representations for diﬀerent types downstream tasks while sharing the backbone encoder. This enables the model to obtain strong performance as a frozen encoder where we only learn a new pooler to aggregate features. It can also beneﬁt to multi-task problems that share the same frozen image encoder computation but diﬀerent task-speciﬁc heads. As also discussed in He et al. (2021), linear-evaluation struggles to accurately measure learned representations and we ﬁnd the attentional poolers are more practical for real-world applications.

Image Encoder

video frame 1

Image Encoder

video frame 2

Image Encoder

video frame

Image Encoder

video frame 6

attentional pooling

softmax cross-entropy

Figure 3: Co Ca for video recognition.

Co Ca for Video Action Recognition. We use a simple approach to enable a learned Co Ca model for video action recognition tasks. We ﬁrst take multiple frames of a video and feed each frame into the shared image encoder individually as shown in Figure 3. For frozen-feature evaluation or ﬁnetuning, we learn an additional pooler on top of the spatial and temporal feature tokens with a softmax cross-entropy loss. Note the pooler has a single query token thus the computation of pooling over all spatial and temporal tokens is not expensive. For zero-shot video-text retrieval, we use an even simpler approach by computing the mean embedding of 16 frames of the video (frames are uniformly sampled from a video). We also encode the captions of each video as target embeddings when computing retrieval metrics (similar to the image-text case).

Published in Transactions on Machine Learning Research (08/2022)

Image Encoder Text Decoder Image / Text

Model Layers MLP Params nuni nmulti MLP Params Hidden Heads Total Params

Co Ca-Base 12 3072 86M 12 12 3072 297M 768 12 383M Co Ca-Large 24 4096 303M 12 12 4096 484M 1024 16 787M Co Ca 40 6144 1B 18 18 5632 1.1B 1408 16 2.1B

Table 1: Size variants of Co Ca. Both image encoder and text decoder are Transformers (Dosovitskiy et al., 2021; Vaswani et al., 2017).

4 Experiments

In this section, we ﬁrst describe the details of our experimental setup. The main results are presented next organized as visual recognition tasks, crossmodal alignment tasks, image captioning and multimodal understanding tasks. Our main results are conducted under three categories for downstream tasks: zero-shot transfer, frozen-feature evaluation and ﬁnetuning. We also present ablation experiments including training objectives and architecture designs.

4.1 Training Setup

Data. As discussed in Section 3.2, Co Ca is pretrained from scratch in a single stage on both web-scale alt-text data and annotated images by treating all labels simply as texts. We use the JFT-3B dataset (Zhai et al., 2021a) with label names as the paired texts, and the ALIGN dataset (Jia et al., 2021) with noisy alt-texts. Similar to Pham et al. (2021a), we randomly shuﬄe and concatenate label names of each image in JFT together with a prompt sampled from Radford et al. (2021). An example of the resulting text label of a JFT image would look like a photo of the cat, animal . Unlike prior models (Zhai et al., 2021b; Pham et al., 2021a) that also use the combination of these two datasets, we train all model parameters from scratch at the same time without pretraining an image encoder with supervised cross-entropy loss for simplicity and pretraining eﬃciency. To ensure fair evaluation, we follow the strict de-duplication procedures introduced in (Zhai et al., 2021b; Jia et al., 2021) to ﬁlter all near-domain examples (3.6M images are removed in total) to our downstream tasks. To tokenize text input, we use a sentence-piece model (Sennrich et al., 2015; Kudo, 2018) with a vocabulary size of 64k trained on the sampled pretraining dataset.

Optimization. Our models are implemented in the Lingvo framework (Shen et al., 2019) with GSPMD (Huang et al., 2019; Xu et al., 2020; Lepikhin et al., 2020; Xu et al., 2021) for scaling performance. Following (Pham et al., 2021a), we use a batch size of 65,536 image-text pairs, where half of each batch comes from JFT and ALIGN, respectively. All models are trained on the combined contrastive and captioning objectives in Eq.(4) for 500k steps, roughly corresponding to 5 epochs on JFT and 10 epochs on ALIGN. As shown later in our studies, we ﬁnd a larger captioning loss weight is better and thus λCap = 2.0 and λCon = 1.0. Following Jia et al. (2021), we apply a contrastive loss with a trainable temperature τ with an initial value of 0.07. For memory eﬃciency, we use the Adafactor (Shazeer & Stern, 2018) optimizer with β1 = 0.9, β2 = 0.999 and decoupled weight decay (Loshchilov & Hutter, 2017) ratio of 0.01. We warm up the learning rate for the ﬁrst 2% of training steps to a peak value of 8 10 4, and linearly decay it afterwards. Pretraining Co Ca takes about 5 days on 2,048 Cloud TPUv4 chips. Following Radford et al. (2021); Jia et al. (2021); Yuan et al. (2021), we continue pretraining for one epoch on a higher resolution of 576 576. For ﬁnetuning evaluation, we mainly follow simple protocols and directly train Co Ca on downstream tasks without further metric-speciﬁc tuning like CIDEr scores (details in Appendix A and B).

4.2 Main Results

We extensively evaluate the capabilities of Co Ca models on a wide range of downstream tasks as a pretrained foundation model. We mainly consider core tasks of three categories that examine (1) visual recognition, (2) crossmodal alignment, and (3) image captioning and multimodal understanding capabilities. Since Co Ca produces both aligned unimodal representations and fused multimodal embeddings at the same time, it is easily transferable to all three task groups with minimal adaption. Figure 4 summarizes the performance on key benchmarks of Co Ca compared to other dual-encoder and encoder-decoder foundation models and

Published in Transactions on Machine Learning Research (08/2022)

Figure 4: Comparison of Co Ca with other image-text foundation models (without task-speciﬁc customization) and multiple state-of-the-art task-specialized models.

Model Image Net

ALIGNa 88.6 Florenceb 90.1 Meta Pseudo Labelsc 90.2 Co At Netd 90.9 Vi T-Ge 90.5 + Model Soupsf 90.9

Co Ca (frozen) 90.6 Co Ca (ﬁnetuned) 91.0

Model K-400 K-600 K-700 Moments-in-Time

Vi Vi Tg 84.8 84.3 - 38.0 Mo Vi Neth 81.5 84.8 79.4 40.2 VATTi 82.1 83.6 - 41.1 Florenceb 86.8 88.0 - - Mask Featk 87.0 88.3 80.4 Co Ve Rl 87.2 87.9 78.5 46.1

Co Ca (frozen) 88.0 88.5 81.1 47.4 Co Ca (ﬁnetuned) 88.9 89.4 82.7 49.0

Table 2: Image classiﬁcation and video action recognition with frozen encoder or ﬁnetuned encoder. Model reference: a(Jia et al., 2021) b(Yuan et al., 2021) c(Pham et al., 2021b) d(Dai et al., 2021) e(Zhai et al., 2021a) g(Wortsman et al., 2022) g(Arnab et al., 2021) h(Kondratyuk et al., 2021) i(Akbari et al., 2021) k(Wei et al., 2021) l(Zhang et al., 2021a).

state-of-the-art task-specialized methods. Co Ca sets new state-of-the-art results on tasks of all three categories with a single pretrained checkpoint.

4.2.1 Visual Recognition Tasks

Our visual recognition experiments are conducted on Image Net (Deng et al., 2009) as image recognition benchmark, and multiple video datasets including Kinetics-400 (Kay et al., 2017), Kinetics-600 (Carreira et al., 2018), Kinetics-700 (Carreira et al., 2019), Moments-in-Time (Monfort et al., 2019) as test-beds for video action recognition; it is noteworthy that Co Ca pretrains on image data only, without accessing any extra

Published in Transactions on Machine Learning Research (08/2022)

(a) Finetuned Image Net Top-1 Accuracy.

(b) Zero-Shot Image Net Top-1 Accuracy.

Figure 5: Image classiﬁcation scaling performance of model sizes.

Flickr30K (1K test set) MSCOCO (5K test set) Image Text Text Image Image Text Text Image

Model R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10

CLIP (Radford et al., 2021) 88.0 98.7 99.4 68.7 90.6 95.2 58.4 81.5 88.1 37.8 62.4 72.2 ALIGN (Jia et al., 2021) 88.6 98.7 99.7 75.7 93.8 96.8 58.6 83.0 89.7 45.6 69.8 78.6 FLAVA (Singh et al., 2021) 67.7 94.0 - 65.2 89.4 - 42.7 76.8 - 38.4 67.5 - FILIP (Yao et al., 2021) 89.8 99.2 99.8 75.0 93.4 96.3 61.3 84.3 90.4 45.9 70.6 79.3 Florence (Yuan et al., 2021) 90.9 99.1 - 76.7 93.6 - 64.7 85.9 - 47.2 71.4 -

Co Ca-Base 89.8 98.8 99.8 76.8 93.7 96.8 63.8 84.7 90.7 47.5 72.4 80.9 Co Ca-Large 91.4 99.2 99.9 79.0 95.1 97.4 65.4 85.6 91.4 50.1 73.8 81.8 Co Ca 92.5 99.5 99.9 80.4 95.7 97.7 66.3 86.2 91.8 51.2 74.2 82.0

Table 3: Zero-shot image-text retrieval results on Flickr30K (Plummer et al., 2015) and MSCOCO (Chen et al., 2015) datasets.

video datasets. We apply the Co Ca encoder on video frames individually (Section 3.3) without early fusion of temporal information, yet the resulting Co Ca-for-Video model performs better than many spatio-temporal early-fused video models.

Frozen-feature. We apply a pretrained frozen Co Ca model on both image classiﬁcation and video action recognition. The encoder is used for both tasks while the decoder is discarded. As discussed in Section 3.3, an attentional pooling is learned together with a softmax cross-entropy loss layer on top of the embedding outputs from Co Ca encoder. For video classiﬁcation, a single query-token is learned to weight outputs of all tokens of spatial patches temporal frames. We set a learning rate of 5 10 4 on both attentional pooler and softmax, batch size of 128, and a cosine learning rate schedule (details in Appendix A). For video action recognition, we compare Co Ca with other approaches on the same setup (i.e., without extra supervised video data and without audio signals as model inputs). As shown in Table 2, without ﬁnetuning full encoder, Co Ca already achieves competitive Top-1 classiﬁcation accuracies compared to specialized image and outperforms prior best-performing specialized methods on video tasks.

Finetuning. Based on the architecture of frozen-feature evaluation, we further ﬁnetune Co Ca encoders on image and video datasets individually with a smaller learning rate of 1 10 4. More experimental details are summarized in the Appendix A. The ﬁnetuned Co Ca has improved performance across these tasks. Notably, Co Ca obtains 91.0% Top-1 accuracy on Image Net, as well as better video action recognition results compared with recent video approaches. More importantly, Co Ca models use much less parameters than other methods in the visual encoder as shown in Figure 5a. These results suggest the proposed framework eﬃciently combines text training signals and thus is able to learn high-quality visual representation better than the classical single-encoder approach.

Published in Transactions on Machine Learning Research (08/2022)

Model Image Net Image Net-A Image Net-R Image Net-V2 Image Net-Sketch Object Net Average

CLIP (Radford et al., 2021) 76.2 77.2 88.9 70.1 60.2 72.3 74.3 ALIGN (Jia et al., 2021) 76.4 75.8 92.2 70.1 64.8 72.2 74.5 FILIP (Yao et al., 2021) 78.3 - - - - - - Florence (Yuan et al., 2021) 83.7 - - - - - - Li T (Zhai et al., 2021b) 84.5 79.4 93.9 78.7 - 81.1 - BASIC (Pham et al., 2021a) 85.7 85.6 95.7 80.6 76.1 78.9 83.7

Co Ca-Base 82.6 76.4 93.2 76.5 71.7 71.6 78.7 Co Ca-Large 84.8 85.7 95.6 79.6 75.7 78.6 83.3 Co Ca 86.3 90.2 96.5 80.7 77.6 82.7 85.7

Table 4: Zero-shot image classiﬁcation results on Image Net (Deng et al., 2009), Image Net-A (Hendrycks et al., 2021b), Image Net-R (Hendrycks et al., 2021a), Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019) and Object Net (Barbu et al., 2019).

4.2.2 Crossmodal Alignment Tasks

Unlike other fusion-based foundation methods (Wang et al., 2021b; Singh et al., 2021; Wang et al., 2022), Co Ca is naturally applicable to crossmodal alignment tasks since it generates aligned image and text unimodal embeddings (see Appendix C for results on video-text retrieval). In particular, we are interested in the zero-shot setting where all parameters are frozen after pretraining and directly used to extract embeddings. Here, we use the same embeddings used for contrastive loss during pretraining, and thus the multimodal text decoder is not used.

Zero-Shot Image-Text Retrieval. We evaluate Co Ca on the two standard image-text retrieval benchmarks: MSCOCO (Chen et al., 2015) and Flickr30K (Plummer et al., 2015). Following the CLIP setting (Radford et al., 2021), we ﬁrst independently feed each image/text to the corresponding encoder and obtain embeddings for all image/text in the test set. We then retrieve based on cosine similarity scores over the whole test set. As shown in Table 3, Co Ca signiﬁcantly improves over prior methods on both image-to-text and text-to-image retrievals on all metrics. In addition, our model is parameter-eﬃcient, with Co Ca-Base already outperforming strong baselines (CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021)) and Co Ca-Large outperforming Florence (Yuan et al., 2021) (which contains a parameter count comparable to Vi T-Huge). This shows that Co Ca learns good unimodal representations and aligns them well across modalities.

Zero-Shot Image Classiﬁcation. Following prior work (Radford et al., 2021; Jia et al., 2021), we use the aligned image/text embeddings to perform zero-shot image classiﬁcation by matching images with label names without ﬁnetuning. We follow the exact setup in Radford et al. (2021) and apply the same set of prompts used for label class names. As shown in Table 4, Co Ca sets new state-of-the-art zero-shot classiﬁcation results on Image Net. Notably, Co Ca uses fewer parameters than prior best model (Pham et al., 2021a) while smaller Co Ca variants already outperform strong baselines (Radford et al., 2021; Yuan et al., 2021), as shown in Figure 5b. In addition, our model demonstrates eﬀective generalization under zero-shot evaluation, consistent with prior ﬁndings (Radford et al., 2021; Jia et al., 2021), with Co Ca improving on all six datasets considered. Lastly, while prior models (Zhai et al., 2021b; Pham et al., 2021a) found sequentially pretraining with single-encoder and dual-encoder methods in multiple stages is crucial to performance gains, our results show it is possible to attain strong performance by unifying training objectives and datasets in a single-stage framework.

4.2.3 Image Captioning and Multimodal Understanding Tasks

Another key advantage of Co Ca is its ability to process multimodal embeddings as an encoder-decoder model trained with the generative objective. Therefore, Co Ca can perform both image captioning and multimodal understanding downstream tasks without any further fusion adaptation (Shen et al., 2021; Dou et al., 2021). Overall, experimental results suggest Co Ca reaps the beneﬁt of a encoder-decoder model to obtain strong multimodal understanding and generation capabilities, in addition to the vision and retrieval capabilities as a dual-encoder method.

Published in Transactions on Machine Learning Research (08/2022)

VQA SNLI-VE NLVR2

Model test-dev test-std dev test dev test-p

UNITER (Chen et al., 2020) 73.8 74.0 79.4 79.4 79.1 80.0 Vin VL (Zhang et al., 2021b) 76.6 76.6 - - 82.7 84.0 CLIP-Vi L (Shen et al., 2021) 76.5 76.7 80.6 80.2 - - ALBEF (Li et al., 2021) 75.8 76.0 80.8 80.9 82.6 83.1 BLIP (Li et al., 2022) 78.3 78.3 - - 82.2 82.2 OFA (Wang et al., 2022) 79.9 80.0 90.3 90.2 - - VLMo (Wang et al., 2021a) 79.9 80.0 - - 85.6 86.9 Sim VLM (Wang et al., 2021b) 80.0 80.3 86.2 86.3 84.5 85.2 Florence (Yuan et al., 2021) 80.2 80.4 - - - - METER (Dou et al., 2021) 80.3 80.5 - - - -

Co Ca 82.3 82.3 87.0 87.1 86.1 87.0

Table 5: Multimodal understanding results comparing vision-language pretraining methods. OFA uses both image and text premises as inputs while other models utilize the image only.

MSCOCO No Caps

Valid Test B@4 M C S C S C S

CLIP-Vi L (Shen et al., 2021) 40.2 29.7 134.2 23.8 - - - - BLIP (Li et al., 2022) 40.4 - 136.7 - 113.2 14.8 - - Vin VL(Zhang et al., 2021b) 41.0 31.1 140.9 25.4 105.1 14.4 103.7 14.4 Sim VLM (Wang et al., 2021b) 40.6 33.7 143.3 25.4 112.2 - 110.3 14.5 LEMON (Hu et al., 2021) 41.5 30.8 139.1 24.1 117.3 15.0 114.3 14.9 LEMONSCST (Hu et al., 2021) 42.6 31.4 145.5 25.5 - - - - OFA (Wang et al., 2022) 43.5 31.9 149.6 26.1 - - - -

Co Ca 40.9 33.9 143.6 24.7 122.4 15.5 120.6 15.5

Table 6: Image captioning results on MSCOCO and No Caps (B@4: BLEU@4, M: METEOR, C: CIDEr, S: SPICE). Models ﬁnetuned with CIDEr optimization.

Multimodal Understanding. As shown in Wang et al. (2021b), the output of encoder-decoder models can jointly encode image and text inputs, and can be used for tasks that require reasoning over both modalities. We consider three popular multimodal understaning benchmarks: visual question answering (VQA v2 (Goyal et al., 2017)), visual entailment (SNLI-VE (Xie et al., 2019)), and visual reasoning (NLVR2 (Suhr et al., 2018)). We mainly follow the settings in Wang et al. (2021b) and train linear classiﬁers on top of the decoder outputs to predict answers (more details in Appendix B). Our results in Table 5 suggest that Co Ca outperforms strong vision-language pretraining (VLP) baselines and obtains the best performance on all three tasks. While prior dual-encoder models (Radford et al., 2021; Yuan et al., 2021) do not contain fusion layers and thus require an additional VL pretraining stage for downstream multimodal understanding tasks, Co Ca subsumes the three pretraining paradigms and obtains better performance on VL tasks with lightweight ﬁnetuning.

Image Captioning. In addition to multimodal classiﬁcation tasks, Co Ca is also directly applicable to image captioning tasks as an encoder-decoder model. We ﬁnetune Co Ca with the captioning loss LCap only on MSCOCO (Chen et al., 2015) captioning task and evaluate on both MSCOCO Karpathy-test split and No Caps (Agrawal et al., 2019) online evaluation. As shown by experiments in Table 6, Co Ca outperforms strong baselines trained with cross-entropy loss on MSCOCO, and achieves results comparable to methods with CIDEr metric-speciﬁc optimization (Rennie et al., 2017). It is noteworthy that we do not use CIDEr-speciﬁc optimization (Rennie et al., 2017) for simplicity. On the challenging No Caps benchmark, Co Ca obtains better results on both validation and test splits (generated examples shown in Figure 6). These results showcase the generative capability of Co Ca as an image-text foundation model.

Published in Transactions on Machine Learning Research (08/2022)

a hand holding a san francisco 49ers football

a row of cannons with the eiffel tower in the background

a truck is reflected in the side mirror of a car a person sitting on a wooden

bridge holding an umbrella a white van with a license plate that says we love flynn

Figure 6: Curated samples of text captions generated by Co Ca with No Caps images as input.

LCls 81.0 85.1 LCap 82.1 84.9

(a) Encoder-decoder vs. single-encoder models (trained on JFT).

loss ZS VQA TPU cost

LCon 70.7 59.2 1 LCap - 68.9 1.17 LCo Ca 71.6 69.0 1.18

(b) Training objectives ablation.

λCap : λCon ZS VQA

1:1 71.5 68.6 1:2 71.0 68.1 2:1 71.6 69.0

(c) Training objectives weights.

nuni ZS VQA

3 70.2 69.0 6 71.6 69.0 9 71.4 68.8

(d) Unimodal decoder layers.

variant AE MSCOCO

1 [CLS] 80.7 41.4 + text tokens 80.3 40.2 8 [CLS] 80.3 36.9 + text tokens 80.4 40.3

(e) Contrastive text embedding design ablation.

variant ZS VQA

parallel 71.2 68.7 cascade 71.6 69.0 nquery = 0 71.5 69.0 nquery = 1 69.3 64.4 nquery = 32 71.2 68.2

(f) Attentional pooler design ablation.

Table 7: Co Ca ablation experiments. On Image Net classiﬁcation, we report top-1 accuracy for: zero-shot (ZS), linear evaluation (LE), attentional evaluation (AE) using pooler on frozen feature, and ﬁnetuning (FT). On MSCOCO retrieval, we report the average of image-to-text and text-to-image R@1. On VQA, we report the dev-set vqa score. The default Co Ca setting is bold.

4.3 Ablation Analysis

We extensively ablate the properties of Co Ca on a smaller model variant. Speciﬁcally, we train Co Ca-Base with a reduced 12 decoder layers and a total batch size of 4,096. We mainly evaluate using zero-shot image classiﬁcation and VQA, since the former covers both visual representation quality and crossmodal alignment, while the later is representative for multimodal reasoning.

Captioning vs. Classiﬁcation. We ﬁrst examine the eﬀectiveness of captioning loss on image annotation datasets. To do this, we train a naive encoder-decoder model using LCap on the JFT-3B dataset, and compare with a standard Vi T-Base single-encoder model trained with LCls in Table 7a. We ﬁnd encoder-decoder models to perform on par with single-encoder pretraining on both linear evaluation and ﬁnetuned results. This suggests that the generative pretraining subsumes classiﬁcation pretraining, consistent with our intuition that LCls is a special case of LCap when text vocabulary is the set of all possible class names. Thus, our Co Ca model can be interpreted as an eﬀective uniﬁcation of the three paradigms. This explains why Co Ca does not need a pretrained visual encoder to perform well.

Training Objectives. We study the eﬀects of the two training objectives and compare Co Ca with singleobjective variants in Table 7b. Compared to the contrastive-only model, Co Ca signiﬁcantly improves both zero-shot alignment and VQA (notice that the contrastive-only model requires additional fusion for VQA). Co Ca performs on par with the captioning-only model on VQA while it additionally enables retrieval-style tasks such as zero-shot classiﬁcation. Table 7c further studies loss ratios and suggests that the captioning loss not only improves VQA but also zero-shot alignment between modalities. We hypothesize that generative objectives learn ﬁne-grained text representations that further improve text understanding. Finally, we compare

Published in Transactions on Machine Learning Research (08/2022)

training costs in Table 7b (measured in TPUv3-core-days; larger is slower) and ﬁnd Co Ca to be as eﬃcient as the captioning-only model (a.k.a.naive encoder-decoder with same architecture as Co Ca) due to the sharing of compute between two objectives. These suggest combining the two losses induces new capabilities and better performance with minimal extra cost.

Unimodal Textual Representation. Co Ca introduces a novel decoder design and we ablate its components. In Table 7d, we vary the number of unimodal decoder layers (while keeping the total number of layers the same). Intuitively, fewer unimodal text layers leads to worse zero-shot classiﬁcation due to lack of capacity for good unimodal text understanding, while fewer multimodal layers reduces the model s power to reason over multimodal inputs such as VQA. Overall, we ﬁnd decoupling the decoder in half maintains a good balance. One possibility is that global text representation for retrieval doesn t require deep modules (Pham et al., 2021a) while early fusion for shallow layers may also be unnecessary for multimodal understanding. Another key design of unimodal textual representation is the application of [CLS] tokens. In particular, we experiment with the number of learnable [CLS] tokens as well as the aggregation design. For the later, we aggregate over either the [CLS] tokens only (denoted as N [CLS]) or the concatenation of [CLS] and the original input sentence (denoted as N [CLS] + text tokens). Interestingly, in Table 7e we ﬁnd training a single [CLS] token without the original input is preferred for both vision-only and crossmodal retrieval tasks. This indicates that learning an additional simple sentence representation mitigates interference between contrastive and captioning loss, and is powerful enough for strong generalization.

Attentional Poolers. Co Ca exploits attentional poolers in its design both for diﬀerent pretraining objectives and objective-speciﬁc downstream task adaptations. In pretraining, we compare a few design variants on using poolers for contrastive loss and generative loss: (1) the parallel design which extracts both contrastive and generative losses at the same time on Vision Transformer encoder outputs as shown in Figure 2, and (2) the cascade design which applies the contrastive pooler on top of the outputs of the generative pooler. Table 7f shows the results of these variants. Empirically, we ﬁnd at small scale the cascade version (contrastive pooler on top of the generative pooler) performs better and is used by default in all Co Ca models. We also study the eﬀect of number of queries where nquery = 0 means no generative pooler is used (thus all Vi T output tokens are used for decoder cross-attention). Results show that both tasks prefer longer sequences of detailed image tokens at a cost of slightly more computation and parameters. As a result, we use a generative pooler of length 256 to improve multimodal understanding benchmarks while still maintaining the strong frozen-feature capability.

5 Conclusion

In this work we present Contrastive Captioners (Co Ca), a new image-text foundation model family that subsumes existing vision pretraining paradigms with natural language supervision. Pretrained on image-text pairs from various data sources in a single stage, Co Ca eﬃciently combines contrastive and captioning objectives in an encoder-decoder model. Co Ca obtains a series of state-of-the-art performance with a single checkpoint on a wide spectrum of vision and vision-language problems. Our work bridges the gap among various pretraining approaches and we hope it motivates new directions for image-text foundation models.

Broader Impact Statement

This work presents an image-text pretraining approach on web-scale datasets that is capable of transferring to a wide range of downstream tasks in a zero-shot manner or with lightweight ﬁnetuning. While the pretrained models are capable of many vision and vision-language tasks, we note that our models use the same pretraining data as previous methods (Jia et al., 2021; Zhai et al., 2021a;b; Pham et al., 2021a) and additional analysis of the data and the resulting model is necessary before the use of the models in practice. We show Co Ca models are more robust on corrupted images, but it could still be vulnerable to other image corruptions that are not yet captured by current evaluation sets or in real-world scenarios. For both the data and model, further community exploration is required to understand the broader impacts including but not limited to fairness, social bias and potential misuse.

Published in Transactions on Machine Learning Research (08/2022)

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948 8957, 2019.

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34, 2021.

Anders Andreassen, Yasaman Bahri, Behnam Neyshabur, and Rebecca Roelofs. The evolution of out-ofdistribution robustness throughout ﬁne-tuning. ar Xiv preprint ar Xiv:2106.15831, 2021.

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836 6846, 2021.

Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. ar Xiv preprint ar Xiv:2106.08254, 2021.

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.

Paul Barham, Aakanksha Chowdhery, JeﬀDean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al. Pathways: Asynchronous distributed dataﬂow for ml. ar Xiv preprint ar Xiv:2203.12533, 2022.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. ar Xiv preprint ar Xiv:2108.07258, 2021.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. ar Xiv preprint ar Xiv:1808.01340, 2018.

Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. ar Xiv preprint ar Xiv:1907.06987, 2019.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. ar Xiv preprint ar Xiv:1504.00325, 2015.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In ECCV, 2020.

Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pp. 1931 1942. PMLR, 2021.

Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34:3965 3977, 2021.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Published in Transactions on Machine Learning Research (08/2022)

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Yicb Fd NTTy.

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, Michael Zeng, et al. An empirical study of training end-to-end vision-and-language transformers. ar Xiv preprint ar Xiv:2111.02387, 2021.

Ross Girshick, JeﬀDonahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580 587, 2014.

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904 6913, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. ar Xiv preprint ar Xiv:2111.06377, 2021.

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340 8349, 2021a.

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262 15271, 2021b.

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. ar Xiv preprint ar Xiv:2111.12233, 2021.

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hyouk Joong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Eﬃcient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904 4916. PMLR, 2021.

Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, and Xiang Ren. A good prompt is worth millions of parameters? low-resource prompt-based learning for vision-language models. ar Xiv preprint ar Xiv:2110.08484, 2021.

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. ar Xiv preprint ar Xiv:1705.06950, 2017.

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision, 2021.

Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. Movinets: Mobile video networks for eﬃcient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16020 16030, 2021.

Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.

Published in Transactions on Machine Learning Research (08/2022)

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. ar Xiv preprint ar Xiv:1804.10959, 2018.

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744 3753. PMLR, 2019.

Dmitry Lepikhin, Hyouk Joong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. ar Xiv preprint ar Xiv:2006.16668, 2020.

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shaﬁq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021.

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uniﬁed vision-language understanding and generation. ar Xiv preprint ar Xiv:2201.12086, 2022.

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431 3440, 2015.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pp. 181 196, 2018.

Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence, 42(2):502 508, 2019.

Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, and Quoc V. Le. Combined scaling for open-vocabulary image classiﬁcation, 2021a. URL https://arxiv.org/abs/2111.10050.

Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V Le. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11557 11568, 2021b.

AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saﬀar, Fred Bertsch, and Anelia Angelova. Answer-me: Multi-task open-vocabulary visual question answering. ar Xiv preprint ar Xiv:2205.00949, 2022.

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641 2649, 2015.

Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, and Hugo Terashima-Marín. A straightforward framework for video retrieval using clip. ar Xiv preprint ar Xiv:2102.12443, 2021.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748 8763. PMLR, 2021.

Colin Raﬀel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. ar Xiv preprint ar Xiv:1910.10683, 2019.

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classiﬁers generalize to imagenet? In International Conference on Machine Learning, pp. 5389 5400. PMLR, 2019.

Published in Transactions on Machine Learning Research (08/2022)

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https: //proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf.

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7008 7024, 2017.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ar Xiv preprint ar Xiv:1508.07909, 2015.

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596 4604. PMLR, 2018.

Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling, 2019.

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip beneﬁt vision-and-language tasks? ar Xiv preprint ar Xiv:2107.06383, 2021.

Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. ar Xiv preprint ar Xiv:2112.04482, 2021.

Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the kinetics-700-2020 human action dataset. ar Xiv preprint ar Xiv:2010.10864, 2020.

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. ar Xiv preprint ar Xiv:1811.00491, 2018.

Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. ar Xiv preprint ar Xiv:1908.07490, 2019.

Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34: 200 212, 2021.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156 3164, 2015.

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. ar Xiv preprint ar Xiv:2202.03052, 2022.

Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. Vlmo: Uniﬁed vision-language pre-training with mixture-of-modality-experts. ar Xiv preprint ar Xiv:2111.02358, 2021a.

Published in Transactions on Machine Learning Research (08/2022)

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. ar Xiv preprint ar Xiv:2108.10904, 2021b.

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. ar Xiv preprint ar Xiv:2112.09133, 2021.

Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270 280, 1989.

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple ﬁne-tuned models improves accuracy without increasing inference time. ar Xiv preprint ar Xiv:2203.05482, 2022.

Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. Visual entailment: A novel task for ﬁne-grained image understanding. ar Xiv preprint ar Xiv:1901.06706, 2019.

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. ar Xiv preprint ar Xiv:2111.09886, 2021.

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288 5296, 2016.

Yuanzhong Xu, Hyouk Joong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, and Shibo Wang. Automatic cross-replica sharding of weight update in data-parallel training. ar Xiv preprint ar Xiv:2004.13336, 2020.

Yuanzhong Xu, Hyouk Joong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. Gspmd: general and scalable parallelization for ml computation graphs. ar Xiv preprint ar Xiv:2105.04663, 2021.

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Uniﬁed contrastive learning in image-text-label space, 2022a.

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pp. 3081 3089, 2022b.

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. ar Xiv preprint ar Xiv:2111.07783, 2021.

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. ar Xiv preprint ar Xiv:2111.11432, 2021.

Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language. ar Xiv preprint ar Xiv:2204.00598, 2022.

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers, 2021a. URL https://arxiv.org/abs/2106.04560.

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. ar Xiv preprint ar Xiv:2111.07991, 2021b.

Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M Dai, Ruoming Pang, and Fei Sha. Co-training transformer with videos and images improves action recognition. ar Xiv preprint ar Xiv:2112.07175, 2021a.

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5579 5588, June 2021b.

Published in Transactions on Machine Learning Research (08/2022)

A Visual Recognition Finetuning Details

Hyper-parameter Image Net Kinetics-400/600/700 Moments-in-Time Frozen-feature Finetuning Frozen-feature Finetuning Frozen-feature Finetuning

Optimizer Adafacter with Decoupled Weight Decay Gradient clip 1.0 EMA decay rate 0.9999 LR decay schedule Cosine Schedule Decaying to Zero Loss Softmax Mix Up None Cut Mix None Auto Augment None Repeated Augment None Rand Augment 2, 20 2, 20 None None None None Label smoothing 0.2 0.5 0.1 0.1 0.0 0.0 Train steps 200k 200k 120k 120k 120k 120k Train batch size 512 512 128 128 128 128 Pooler LR 5e-4 5e-4 5e-4 5e-4 5e-4 5e-4 Encoder LR 0.0 5e-4 0.0 5e-4 0.0 5e-4 Warm-up steps 0 0 1000 1000 1000 1000 Weight decay rate 0.01 0.01 0.0 0.0 0.0 0.0

Table 8: Hyper-parameters used in the visual recognition experiments.

In addition to zero-shot transfer, we evaluate frozen-feature and ﬁnetuning performance of Co Ca on visual recognition tasks. For frozen-feature evaluation, we add an attentional pooling layer (pooler) on top of the output sequence of visual features and an additional softmax cross entropy loss layer to learn classiﬁcation of images and videos. For ﬁnetuning, we adapt the same architecture as frozen-feature evaluation (thus also with poolers) and ﬁnetune both encoder and pooler. All learning hyperparameters are listed in Table 8.

B Multimodal Understanding Finetuning Details

Hyper-parameter VQA SNLI-VE NLVR2 MSCOCO No Caps

Optimizer Adafacter with Decoupled Weight Decay Gradient clip 1.0 LR decay schedule Cosine Schedule Decaying to Zero Rand Augment 1, 10 1, 10 None None None Train steps 100k 50k 50k 50k 10k Train batch size 64 128 64 128 128 Pooler LR 5e-4 1e-3 5e-4 NA NA Encoder LR 2e-5 5e-5 2e-5 1e-5 1e-5 Warm-up steps 1000 1000 1000 1000 1000 Weight decay rate 0.1 0.1 0.1 0.1 0.1

Table 9: Hyper-parameters used in the multimodal experiments.

Co Ca is an encoder-decoder model and the ﬁnal decoder outputs can be used for multimodal understanding/- generation. Thus, we evaluate on popular vision-language benchmarks. We mainly follow the same setup introduced in Wang et al. (2021b). All hyper-parameters are listed in Table 9.

For multimodal classiﬁcation, we feed the image into the encoder and the corresponding text to the decoder. We then apply another attentional pooler with a single query to extract embedding from the decoder output, and train a linear classiﬁer on top of the pooled embedding. For VQA v2 (Goyal et al., 2017), we follow prior work and formulate the task as a classiﬁcation problem over 3,129 most frequent answers in the training set.

Published in Transactions on Machine Learning Research (08/2022)

MSR-VTT Full Text Video Video Text

Method R@1 R@5 R@10 R@1 R@5 R@10

CLIP (Portillo-Quintero et al., 2021) 21.4 41.1 50.4 40.3 69.7 79.2 Socratic Models (Zeng et al., 2022) - - - 44.7 71.2 80.0

CLIP (Portillo-Quintero et al., 2021) (subset) 23.3 44.2 53.6 43.3 73.3 81.8 Socratic Models (Zeng et al., 2022) (subset) - - - 46.9 73.5 81.3 Co Ca (subset) 30.0 52.4 61.6 49.9 73.4 81.4

Table 10: Zero-shot Video-Text Retrieval on MSR-VTT Full test set.

We additionally enable cotraining with the generative loss on the concatenated pairs of textual questions and answers to improve model robustness. Similarly for SNLI-VE, the image and the textual hypothesis are fed to encoder and decoder separately, and the classiﬁer is trained to predict the relation between them as entailment, neutral or contradiction. For NLVR2, we create two input pairs of each image and the text description, and concatenate them as input to the classiﬁer. We do not use image augmentation for NLVR2.

For image captioning, we apply simple cross-entropy loss (same as the captioning loss used in pretraining) and ﬁnetune the model on the training split of MSCOCO to predict for MSCOCO test split and No Caps online evaluation. We use beam search with beam size of 4 for all our experiments.

C Zero-Shot Video Retrieval

We evaluate video-text retrieval using Co Ca on MSR-VTT (Xu et al., 2016) using the full split. Table 10

shows that Co Ca produces the highest retrieval metrics for both text-to-video and video-to-text retrieval. It is important to note that MSR-VTT videos are sourced from You Tube, and we require the original videos to compute our embeddings. Many of the videos have been made explicitly unavailable (Smaira et al., 2020), hence we compute retrieval over the subset of data that is publicly available at the time of evaluation. Using code1 provided by the authors of Socratic Models (Zeng et al., 2022), we re-computed metrics on the available subset for those methods, indicated by (subset) for fairest comparison.

1https://socraticmodels.github.io