# vlmo_unified_visionlanguage_pretraining_with_mixtureofmodalityexperts__d758d813.pdf

VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Hangbo Bao1 , Wenhui Wang2, Li Dong2, Qiang Liu2, Owais Khan Mohammed2, Kriti Aggarwal2, Subhojit Som2, Songhao Piao1, Furu Wei2

1Harbin Institute of Technology, 2Microsoft Corporation https://aka.ms/msragi

We present a unified Vision-Language pretrained Model (VLMO) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Multiway Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of Multiway Transformer, pretrained VLMO can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMO achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval. The code and pretrained models are available at http://aka.ms/vlmo.

1 Introduction

Vision-Language (VL) pre-training [31, 42, 36, 27, 21, 24] learns generic cross-modal representations from large-scale image-text pairs. Previous models usually employ image-text matching, image-text contrastive learning, masked region classification/feature regression, word-region/patch alignment and masked language modeling to aggregate and align visual and linguistic information. Then the pretrained models can be directly fine-tuned on downstream vision-language tasks, such as VL retrieval and classification (visual question answering, visual reasoning, etc.).

Two mainstream architectures are widely used in previous work. CLIP [36] and ALIGN [19] adopt a dual-encoder architecture to encode images and text separately. Modality interaction is handled by the cosine similarity of the image and text feature vectors. The dual-encoder architecture is effective for retrieval tasks, especially for masses of images and text. Feature vectors of images and text can be pre-computed and stored. However, the shallow interaction between images and text is not enough to handle complex VL classification tasks. Vi LT [21] finds that CLIP gives a relatively low accuracy on visual reasoning task. Another line of work [31, 42, 44, 4, 21, 24] relies on a fusion encoder with cross-modal attention to model image-text pairs. Multi-layer Transformer [46] networks are usually employed to fuse image and text representations. The fusion-encoder architecture achieves superior performance on VL classification tasks. But it requires to jointly encode all possible image-text pairs to compute similarity scores for retrieval tasks. The quadratic time complexity leads to a much slower inference speed than the dual-encoder models whose time complexity is linear.

In order to take advantage of the two types of architectures, we propose a unified Vision-Language pretrained Model (VLMO) that can be used as either a dual encoder to separately encode images and text for retrieval tasks, or used as a fusion encoder to model the deep interaction of image-text pairs for

Contribution during internship at Microsoft.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

classification tasks. This is achieved by introducing Multiway Transformer that can encode various modalities (images, text, and image-text pairs) within a Transformer block. Multiway Transformer employs a pool of modality experts to replace the feed-forward network in standard Transformer. It captures modality-specific information by switching to different modality experts, and uses the shared self-attention across modalities to align visual and linguistic information. Specifically, Multiway Transformer consists of three modality experts, namely vision expert for image encoding, language expert for text encoding, and vision-language expert for image-text fusion. Thanks to the modeling flexibility, we can reuse Multiway Transformer with the shared parameters for different purposes, i.e., text-only encoder, image-only encoder, and image-text fusion encoder.

VLMO is jointly learned with three pre-training tasks, namely image-text contrastive learning, imagetext matching, and masked language modeling. In addition, we propose a stagewise pre-training strategy to effectively leverage large-scale image-only and text-only corpus besides image-text pairs in VLMO pre-training. We first pretrain vision experts and self-attention modules of Multiway Transformer on image-only data using masked image modeling proposed in BEIT [3]. We then pretrain language experts on text-only data using masked language modeling [11]. Finally, the model is used to initialize vision-language pre-training. By getting rid of the limited size of image-text pairs and their simple and short captions, stagewise pre-training on large amounts of image-only and text-only data helps VLMO to learn more generalizable representations.

Experimental results demonstrate that VLMO achieves state-of-the-art results on vision-language retrieval and classification tasks. Our model, used as a dual encoder, outperforms fusion-encoderbased models [4, 15, 21, 24] while enjoying a much faster inference speed on retrieval tasks. Moreover, our model also achieves state-of-the-art results on visual question answering (VQA) and natural language for visual reasoning (NLVR2), where VLMO is used as a fusion encoder.

Our main contributions are summarized as follows:

We propose a unified vision-language pretrained model VLMO that can be used as a fusion encoder for classification tasks, or fine-tuned as a dual encoder for retrieval tasks. We introduce a general-purpose multimodal Transformer for vision-language tasks, namely Multiway Transformer, to encode different modalities. It captures modality-specific information by modality experts, and aligns contents of different modalities by the self-attention module shared across modalities. We show that stagewise pre-training using large amounts of image-only and text-only data greatly improves our vision-language pretrained model.

2 Related Work

Pre-training with Transformer [46] backbone networks has substantially advanced the state of the art across natural language processing [35, 11, 29, 23, 12, 37, 2, 8, 9, 5 7, 32], computer vision [13, 45, 3] and vision-language [44, 42, 4, 51, 36, 19, 21, 24, 47] tasks.

The approaches of vision-language pre-training can be divided into two categories. The first category utilizes a dual encoder to encode images and text separately, and uses cosine similarity or a linear projection layer to model the interaction between images and text [36, 19]. Image-text contrastive learning is usually employed to optimize the model. Dual-encoder models are effective for visionlanguage retrieval tasks. However, the simple interaction is not enough to handle tasks that require complex reasoning, such as visual reasoning and visual question answering (VL classification tasks). The second category models the interaction of images and text using a deep fusion encoder with cross-modal attention [44, 31, 42, 25, 53, 4, 27, 26, 15, 51, 17, 18, 21, 24, 48]. Image-text matching, masked language modeling, word-region/patch alignment, masked region classification and feature regression are widely used to train fusion-encoder-based models. These models achieve better performance for vision-language classification tasks, while the joint encoding of all image-text pairs leads to a slow inference speed for retrieval tasks. A large portion of fusion-encoder-based models rely on an off-the-shelf object detector like Faster R-CNN [38] to obtain image region features. Generating region features slows down the inference speed and renders the approach less scalable. Recently, Pixel-BERT [17] removes object detector and encodes images into grid features by convolutional neural networks. ALBEF [24] employs image Transformer [13, 45] to obtain the representations of images, and uses text Transformer [11] to learn the contextualized representations

Vision Expert

L-FFN VL-FFN

Multiway Transformer with Shared Parameters

Multi-Head Self-Attention

Multimodal Input

Multi-Head Self-Attention

Image-Text Contrastive Learning

Multi-Head Self-Attention

Word Embeddings Patch Embeddings

Image-Text Matching

A baseball player throwing a ball .

Word Embeddings Patch Embeddings

A baseball player throwing a ball .

Multi-Head Self-Attention

Multi-Head Self-Attention

V-FFN L-FFN

Word Embeddings Patch Embeddings

A baseball [MASK] throwing [MASK] ball

Multi-Head Self-Attention

Multi-Head Self-Attention

Hard Negative

Switching Modality Expert

V-FFN L-FFN

Figure 1: Overview of VLMO pre-training. We introduce Multiway Transformer to encode different modality input by modality-specific experts. The model parameters are shared across image-text contrastive learning, masked language modeling, and image-text matching pre-training tasks. During fine-tuning, the flexible modeling enables us to use VLMO as either a dual encoder (i.e., separately encode images and text for retrieval tasks) or a fusion encoder (i.e., jointly encode image-text pairs for better interaction across modalities).

of text. These representations are then fused by cross-modal attention. Vi LT [21] encodes images into patch embeddings, and then feed the concatenation of image patch embeddings and word embeddings into a Transformer network to learn contextualized representations and model the interaction of images and text.

Different from previous work, our unified pre-training using shared Multiway Transformer enables the model perform separate encoding for retrieval tasks, and jointly encode image-text pairs to capture deeper interaction for classification tasks. Our model achieves competitive performance, while enjoying a faster inference speed for both retrieval and classification tasks.

Given image-text pairs, VLMO obtains image-only, text-only and image-text pair representations by the Multiway Transformer network. As shown in Figure 1, the unified pre-training optimizes shared Multiway Transformer with image-text contrastive learning on image-only and text-only representations, image-text matching and masked language modeling on image-text pair representations. Thanks to the modeling flexibility, the model can be used as a dual encoder for retrieval tasks to encode images and text separately during fine-tuning. It can also be fine-tuned as a fusion encoder to model deeper modality interaction of images and text for classification tasks.

Image-Only Data

Frozen Self-Attention

Vision Pre-Training

Multi-Head Self-Attention

Text-Only Data

Language Pre-Training

Multi-Head Self-Attention

V-FFN L-FFN

Image-Text Pair Data

Vision-Language Pre-Training

Multi-Head Self-Attention

V-FFN L-FFN VL-FFN 𝐿x

Switching Modality Expert

Figure 2: Stagewise pre-training using image-only and text-only corpora. We first pretrain the vision expert (V-FFN) and self-attention module on large-scale image-only data as in BEIT [3]. Then the parameters of vision expert and self-attention module are frozen, and we train the language expert (L-FFN) by masked language modeling on large amounts of text-only data. Finally, we train the whole model with vision-language pre-training.

3.1 Input Representations

Given an image-text pair, we encode the pair into image, text and image-text vector representations. These representations are then fed into the Multiway Transformer to learn contextualized representations and align image and text feature vectors.

Image Representations Following vision Transformers [13, 45, 3], the 2D image v RH W C

is split and reshaped into N = HW/P 2 patches vp RN (P 2C), where C is the number of channels, (H, W) is the resolution of the input image, and (P, P) is the patch resolution. The image patches are then flattened into vectors and are linearly projected to obtain patch embeddings. We also prepend a learnable special token [I_CLS] to the sequence. Finally, image input representations are obtained via summing patch embeddings, learnable 1D position embeddings Vpos R(N+1) D

and image type embedding Vtype RD: Hv 0 = [v[I_CLS], V vp i , . . . , V vp N] + Vpos + Vtype, where Hv 0 R(N+1) D, linear projection V R(P 2C) D.

Text Representations Following BERT [11], we tokenize the text to subword units by Word Piece [49]. A start-of-sequence token ([T_CLS]) and a special boundary token ([T_SEP]) are added to the text sequence. Text input representations Hw 0 R(M+2) D are computed via summing the corresponding word embedding, text position embedding and text type embedding Hw 0 = [w[T_CLS], wi, . . . , w M, w[T_SEP]] + Tpos + Ttype. M indicates the length of tokenized subword units.

Image-Text Representations We concatenate image and text input vectors to form the image-text input representations Hvl 0 = [Hw 0 ; Hv 0 ]

3.2 Multiway Transformer

Inspired by mixture-of-experts networks [41, 14], we propose a general-purpose multimodal Transformer for vision-language tasks, namely Multiway Transformer, to encode different modalities. Multiway Transformer introduces mixture of modality experts as a substitute of the feed forward network of standard Transformer. Each modality expert is also the feed forward network which consists of two linear transformations and an activation. Given previous layer s output vectors Hl 1, l [1, L], each Multiway Transformer block captures modality-specific information by switching to different modality expert, and employs multi-head self-attention (MSA) shared across modalities to align visual and linguistic contents. LN is short for layer normalization.

H l = MSA(LN(Hl 1)) + Hl 1 (1)

Hl = Multiway-FFN(LN(H l)) + H l (2)

Multiway-FFN selects an expert among multiple modality experts to process the input according to the modality of the input vectors H l and the index of the Transformer layer. Specifically, there

Multi-Head Self-Attention

Image-Text Contrastive Learning

Multi-Head Self-Attention

Word Embeddings Patch Embeddings

(a) Fine-Tuning on VL Retrieval Tasks

A baseball player throwing a ball .

Word Embeddings Patch Embeddings

A baseball player throwing a ball .

Multi-Head Self-Attention

Multi-Head Self-Attention

V-FFN L-FFN

Task-Specific Classification Layer

(b) Fine-Tuning on VL Classification Tasks

Figure 3: Fine-tuning VLMO on vision-language retrieval and classification tasks. The model can be fine-tuned as a dual encoder to separately encode image and text for retrieval tasks. VLMO can also be used as a fusion encoder to handle interaction of image-text pairs for classification tasks.

are three modality experts: vision expert (V-FFN), language expert (L-FFN) and vision-language expert (VL-FFN). If the input is image-only or text-only vectors, we use vision expert for encoding images and language expert for encoding text. If the input consists of vectors of multiple modalities, such as the vectors of image-text pair, we employ vision expert and language expert to encode the respective modality vectors at the bottom Transformer layers. Vision-language expert is then used at the top layers to capture more modality interaction. Compared with conventional mixture-of-experts networks [41, 14], Multiway Transformer conducts hard routing according the input modality. Given the three types of input vectors, we obtain image-only, text-only and image-text contextualized representations.

3.3 Pre-Training Tasks

VLMO is jointly pretrained by image-text contrastive learning on the image and text representations, masked language modeling and image-text matching on the image-text pair representations with shared parameters.

Image-Text Contrast Given a batch of N image-text pairs, image-text contrastive learning aims to predict the matched pairs from N N possible image-text pairs. There are N 2 N negative image-text pairs within a training batch.

The final output vectors of [I_CLS] token and [T_CLS] token are used as the aggregated representation of the image and text, respectively. Followed by a linear projection and normalization, we obtain image vectors {ˆhv i }N i=1 and text vectors {ˆhw i }N i=1 in a training batch to compute image-to-text and text-to-image similarities:

si2t i,j = ˆhv i ˆhw j , st2i i,j = ˆhw i ˆhv j (3)

pi2t i = exp(si2t i,i /σ) PN j=1 exp(si2t i,j /σ) , pt2i i = exp(st2i i,i /σ) PN j=1 exp(st2i i,j /σ) (4)

Where si2t i,j represents image-to-text similarity of image of i-th pair and text of j-th pair, st2i i,j is the text-to-image similarity. ˆhw i RD and ˆhv j RD indicate the normalized vectors of i-th text and j-th image, σ is a learned temperature parameter. pi2t i and pt2i i are the softmax-normalized similarities. Cross-entropy losses over image-to-text and text-to-image similarities are used to train the model.

Masked Language Modeling Following BERT [11], we randomly choose tokens in the text sequence, and replace them with the [MASK] token. The model is trained to predict these masked tokens from all the other unmasked tokens and vision clues. We use 15% masking probability as in BERT. The final output vectors of masked tokens are fed into a classifier over the whole text vocabulary with cross-entropy loss.

Image-Text Matching Image-text matching aims to predict whether the image and text is matched. We use the final hidden vector of the [T_CLS] token to represent the image-text pair, and feed the

vector into a classifier with cross-entropy loss for binary classification. Inspired by ALBEF [24], we sample hard negative image-text pairs based on the contrastive image-to-text and text-to-image similarities.

3.4 Stagewise Pre-Training

We introduce a stagewise pre-training strategy, which leverages large-scale image-only and text-only corpus to improve the vision-language model. As present in Figure 2, we first perform vision pretraining on image-only data, and then perform language pre-training on text-only data to learn general image and text representations. The model is used to initialize the vision-language pre-training to learn the alignment of visual and linguistic information. For vision pre-training, we train the attention module and vision expert of Multiway Transformer as in BEIT [3] on image-only data. We directly utilize the pretrained parameters of BEIT to initialize the attention module and vision expert. For language pre-training, we freeze parameters of the attention module and vision expert to avoid catastrophic forgetting of vision knowledge learned in the first stage, and utilize masked language modeling [11] to optimize the language expert on text-only data. Compared with image-text pairs, image-only and text-only data are easier to collect. In addition, text data of image-text pairs is usually short and simple. Pre-training on image-only and text-only corpus improves the generalization on complex pairs.

3.5 Fine-Tuning VLMO on Downstream Tasks

As present in Figure 3, our model can be fine-tuned to adapt to various vision-language retrieval and classification tasks.

Vision-Language Classification For classification tasks such as visual question answering and visual reasoning, VLMO is used as a fusion encoder to model modality interaction of images and text. We use the final encoding vector of the token [T_CLS] as the representation of the image-text pair, and feed it to a task-specific classifier layer to predict the label.

Vision-Language Retrieval For retrieval tasks, VLMO can be used as a dual encoder to encode images and text separately. During fine-tuning, our model is optimized for the image-text contrastive loss. During inference, we compute representations of all images and text, and then use dot product to obtain image-to-text and text-to-image similarity scores of all possible image-text pairs. Separate encoding enables a much faster inference speed than fusion-encoder-based models.

4 Experiments

We pretrain our model using large-scale image-text pairs and evaluate the model on visual-linguistic classification and retrieval tasks.

4.1 Pre-Training Setup

Following previous work [4, 21], our pre-training data consists of four image captioning datasets: Conceptual Captions (CC) [40], SBU Captions [33], COCO [28] and Visual Genome (VG) [22] datasets. There are about 4M images and 10M image-text pairs in the pre-training data.

Our models adopt the same network configuration as Vi T [13] and BEIT [3]. VLMO-Base consists of 12-layer Transformer blocks with 768 hidden size and 12 attention heads. VLMO-Large is a 24-layer Transformer network with 1024 hidden size and 16 attention heads. VLMO-Base uses vision-language expert on the top two Transformer layers, and VLMO-Large introduces visionlanguage expert on the top three layers. VLMO-Base consists of 175M parameters and VLMO-Large contains 562M parameters. For images, the input resolution is 224 224 and the patch size is 16 16 during pre-training. We apply Rand Augment [10] to the input images. The tokenizer of the uncased version of BERT is employed to tokenize the text. The maximum text sequence length is set to 40. We also employ whole word masking for the masked language modeling pre-training task. We pretrain the models for 200k steps with 1024 batch size. We utilize Adam W [30] optimizer with β1 = 0.9, β2 = 0.98. The peak learning is 2e-4 for the base-size model, 5e-5 for the large-size model. Weight decay is set to 0.01. We use linear warmup over the first 2.5k steps and linear decay.

Model # Pretrain Images VQA NLVR2 test-dev test-std dev test-P

Base-Size Models Pretrained on COCO, VG, SBU and CC datasets UNITER-Base [4] 4M 72.70 72.91 77.18 77.85 VILLA-Base [15] 4M 73.59 73.67 78.39 79.30 UNIMO-Base [26] 4M 73.79 74.02 - - Vi LT-Base [21] 4M 71.26 - 75.70 76.13 ALBEF-Base [24] 4M 74.54 74.70 80.24 80.50 VLMO-Base 4M 76.64 76.89 82.77 83.34

Large-Size Models Pretrained on COCO, VG, SBU and CC datasets UNITER-Large [4] 4M 73.82 74.02 79.12 79.98 VILLA-Large [15] 4M 74.69 74.87 79.76 81.47 UNIMO-Large [26] 4M 75.06 75.27 - - VLMO-Large 4M 79.94 79.98 85.64 86.86

Models Pretrained on More Data Vin VL-Large [51] 5.7M 76.52 76.60 82.67 83.98 Sim VLM-Large [48] 1.8B 79.32 79.56 84.13 84.84 Sim VLM-Huge [48] 1.8B 80.03 80.34 84.53 85.15 Florence-Huge [50] 900M 80.16 80.36 - - Flamingo [1] 2.3B 82.00 82.10 - VLMO-Large++ 1.0B 82.88 82.78 88.62 89.54

Table 1: Fine-tuning results of base-size and large-size VLMO on vision-language classification datasets. VLMO-Large++ is the model trained on one billion noisy image-text pairs with a larger batch size. We report vqa-score on VQA test-dev and test-standard split, and report accuracy for NLVR2 development and public test set (test-P).

4.2 Training on Larger-scale Datasets

We scale up vision-language representation learning by training VLMO-Large on one billion noisy web image-text pairs with a larger batch size. We first pretrain the model for 200k steps with 16k batch size, and then continue train the model for 100k steps with 32k batch size. The other hyperparameters are the same as the training on 4M data. Please refer to the supplementary material for more details of hyper-parameters used for pre-training and fine-tuning.

4.3 Evaluation on Vision-Language Classification Tasks

We first conduct fine-tuning experiments on two widely used classification datasets: visual question answering [16] and natural language for visual reasoning [43]. The model is fine-tuned as a fusion encoder to model deeper interaction.

Visual Question Answering (VQA) For VQA, a natural image and a question are given, the task is to generate/choose the correct answer. We train and evaluate the model on VQA 2.0 dataset [16]. Following common practices, we convert VQA 2.0 to a classification task, and choose the answer from a shared set consists of 3, 129 answers. We use the final encoding vector of the [T_CLS] token as the representation of the image-question pair and feed it to a classifier layer to predict the answer.

Natural Language for Visual Reasoning (NLVR2) The NLVR2 [43] dataset requires the model to predict whether a text description is true about a pair of images. Following OSCAR [27] and Vin VL [51], we convert the triplet input to two image-text pairs, each containing the text description and one image. We concatenate the final output vectors of the [T_CLS] token of the two input pairs. The concatenated vector is then fed into a classification layer to predict the label.

We present the results of VL classification tasks in Table 1. VLMO achieves state-of-the-art performance and substantially outperforms previous methods. Our large-size model even outperforms Sim VLM-Huge [48] and Florence-Huge [50] by a large margin, which consists of more parameters and are also trained on larger-scale image-text pairs. Our model uses a simple linear projection to embed images as in Vi LT [21]. This leads to a significant speedup compared with previous models using image region features, which are extracted by an off-the-shelf object detector [31, 42, 4, 15, 26, 51].

Model # Pretrain Images

MSCOCO (5K test set) Flickr30K (1K test set) TR IR TR IR R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10

Base-Size Models Pretrained on COCO, VG, SBU and CC datasets UNITER-Base 4M 64.4 87.4 93.1 50.3 78.5 87.2 85.9 97.1 98.8 72.5 92.4 96.1 VILLA-Base 4M - - - - - - 86.6 97.9 99.2 74.7 92.9 95.8 Vi LT-Base 4M 61.5 86.3 92.7 42.7 72.9 83.1 83.5 96.7 98.6 64.4 88.7 93.8 ALBEF-Base 4M 73.1 91.4 96.0 56.8 81.5 89.2 94.3 99.4 99.8 82.8 96.7 98.4 VLMO-Base 4M 74.8 93.1 96.9 57.2 82.6 89.8 92.3 99.4 99.9 79.3 95.7 97.8

Large-Size Models Pretrained on COCO, VG, SBU and CC datasets UNITER-Large 4M 65.7 88.6 93.8 52.9 79.9 88.0 87.3 98.0 99.2 75.6 94.1 96.8 VILLA-Large 4M - - - - - - 87.9 97.5 98.8 76.3 94.2 96.8 VLMO-Large 4M 78.2 94.4 97.4 60.6 84.4 91.0 95.3 99.9 100.0 84.5 97.3 98.6

Models Pretrained on More Data Vin VL-Large 5.7M 75.4 92.9 96.2 58.8 83.5 90.3 - - - - - - ALIGN-Large 1.8B 77.0 93.5 96.9 59.9 83.3 89.8 95.3 99.8 100.0 84.9 97.4 98.6 Florence-Huge 900M 81.8 95.2 - 63.2 85.7 - 97.2 99.9 - 87.9 98.1 - VLMO-Large++ 1.0B 83.1 96.0 98.2 65.2 86.5 92.2 96.8 100.0 100.0 88.1 98.4 99.3

Table 2: Fine-tuning results of text-retrieval (TR) and image-retrieval (IR) on COCO and Flickr30K. : ALIGN, Florence and our model encode images and text separately, and then employ a shallow interaction (dot product) to obtain the similarity scores. : ALBEF first encodes images and text separately to obtain the top-k candidates, and then feed these representations into a fusion encoder to rerank the candidates. The others require to encode all image-text combinations by a fusion encoder. VLMO-Large++ represents the model trained on one billion noisy image-text pairs with a larger batch size.

Stagewise Pre-Training NLVR2 Flickr30k dev test-P TR IR

Image-Only Pre-Training 80.33 81.06 95.60 87.69 Image-Only + Text-Only Pre-Training 82.09 82.49 95.67 88.52

Table 3: Ablation studies of stagewise pre-training, i.e., different initialization for vision-language pre-training. We report the average of R@1, R@5 and R@10 for Flickr30k. Results of NLVR2 are averaged over three runs.

4.4 Evaluation on Vision-Language Retrieval Tasks

The retrieval tasks contain image-to-text retrieval and text-to-image retrieval. We evaluate the model on the widely used COCO [28] and Flickr30K [34] datasets, and use the Karpathy split [20] for both datasets. The model is used as a dual encoder for retrieval tasks. We encode images and text separately and compute their similarity scores by the dot product of image and text vectors.

As present in Table 2, VLMO achieves competitive performance with previous fusion-encoder-based models while having a much faster speed. Fusion-encoder-based models need to jointly encode all possible image-text pairs to compute their similarity scores, which requires quadratic time complexity. Moreover, our large-size model even outperforms the huge-size model of Florence [50], which also trained on massive image-text pairs using a larger batch size. VLMO pre-training can effectively leverage larger-scale noisy pairs and benefit from large batch training.

4.5 Evaluation on Vision Tasks

As shown in Table 4, we use VLMO as an image-only encoder and evaluate it on image classification (Image Net [39]) and semantic segmentation (ADE20K [52]) tasks. The model also achieves competitive performance, even slightly better than the BEIT model used for the initialization of VLMO. The image resolution is 224 224 for Image Net, and 512 512 for ADE20K. We perform intermediate fine-tuning [3] on Image Net-21k for all three models.

Models Image Net (acc@1) ADE20K (m Io U)

VIT-Base 83.6 - BEIT-Base 85.2 52.8 VLMO-Base 85.5 53.4

Table 4: Results on image classification and semantic segmentation.

Pre-Training Tasks Transformer NLVR2 Flickr30k ITC ITM MLM Std TRM Multiway Multiway VLExp dev test-P TR IR

58.51 58.83 92.23 84.24 73.91 73.75 94.07 85.82 76.46 76.19 94.37 85.67 78.81 79.27 93.37 85.73 79.58 80.11 94.50 86.69 80.13 80.31 95.17 87.25

Table 5: Ablation studies of Multiway Transformer and vision-language pre-training tasks. ITC is short for image-text contrastive loss, ITM is image-text matching, and MLM is masked language modeling. Std TRM is short for standard Transformer, and Multiway VLExp is Multiway Transformer without VL experts. The average of R@1, R@5 and R@10 is reported for Flickr30k. Results of NLVR2 are averaged over three runs.

4.6 Ablation Studies

Stagewise Pre-Training We first conduct ablation experiments of stagewise pre-training. Vi LT [21] shows that using the Vi T [13] model pretrained on image-only data as the initialization achieves better performance than the BERT model pretrained on text-only data. Therefore we start experiments with image-only pre-training. We compare using image-only pre-training, and image-only pre-training plus text-only pre-training as the initialization. For image-only pre-training, we directly use the parameters of BEIT-Base to initialize the self-attention module and all modality experts. For imageonly pre-training plus text-only pre-training, we use pretrained parameters of BEIT-Base to initialize the vision expert and self-attention module of Multiway Transformer, and then pretrain its language expert on text corpora. As shown in Table 3, image-only pre-training plus text-only pre-training improves our vision-language model. We also have tried to perform vision-language pre-training with random initialization but obtain a relatively low accuracy on downstream tasks. Stagewise pre-training effectively leverages large-scale image-only and text-only corpus, and improves our vision-language pre-training. Moreover, given the limited size of image-text pairs we used during pretraining, stage-wise pre-training on image-only and text-only data alleviates the need for image-text pair data. We have tried to perform multitask training on image-only and text-only data to combine the first two stages and observe similar performance. Stage-wise pre-training effectively leverages the pretrained weights to reduce the computation cost.

Multiway Transformer We also conduct ablation experiments of Multiway Transformer. We employ Vi T-Base to initialize the models for the ablation experiments. As present in Table 5, using Multiway Transformer achieves better performance than standard Transformer for both retrieval and classification tasks. In addition, we also analyse the contribution of vision-language expert (VL-FFN) used in Multiway Transformer. We remove the vision-language expert used in the top Transformer layers. Experimental results demonstrate that the introduction of vision-language expert improves the model. Using vision-language expert captures more modality interaction.

Pre-Training Tasks We perform ablation studies to analyse the contribution of different pre-training tasks, and the results are presented in Table 5. Compared with the model trained only using image-text contrastive loss, our unified training performs much better across classification and retrieval tasks. Introducing image-text matching with hard negative mining also greatly improves the model. This demonstrates the effectiveness of our unified-training framework with Multiway Transformer. In addition, experimental results show that masked language modeling positively contribute to our model. Please refer to the supplementary material for more ablation studies.

Models NLVR2 dev test-P

Local hard negative mining [24] 77.70 77.95 Global hard negative mining (ours) 79.54 79.48

Table 6: Global hard negative mining improves the model. We perform experiments using 32 V100 GPUs for the base-size model. The batch size per GPU is 32, and the total batch size is 1024. Local hard negative mining samples hard negatives from training examples of the single GPU (32 examples), while global hard negative mining uses training examples gathered from all GPUs as the candidates (1024 examples).

Global Hard Negative Mining Different from ALBEF [24], which samples hard negatives from training examples of the single GPU (named as local hard negative mining). We perform hard negative mining from more candidates by gathering training examples of all GPUs (named as global hard negative mining). As shown in Table 6, our global hard negative mining brings significant improvements.

5 Conclusion

In this work, we propose a unified vision-language pretrained model VLMO, which jointly learns a dual encoder and a fusion encoder with a shared Multiway Transformer backbone. Multiway Transformer introduces a pool of modality experts to encode modality-specific information, and aligns different modalities using the shared self-attention module. The unified pre-training with Multiway Transformer enables the model to be used as a dual encoder for efficient vision-language retrieval, or as a fusion encoder to model cross-modal interactions for classification tasks. We also show that stagewise pre-training that leverages large-scale image-only and text-only corpus greatly improves vision-language pre-training. Experimental results demonstrate that VLMO outperforms previous state-of-the-art models on various vision-language classification and retrieval benchmarks.

In the future, we would like to work on improving VLMO from the following perspectives:

We will scale up the model size used in VLMO pre-training.

We are also interested in fine-tuning VLMO for vision-language generation tasks, such as image captioning, following the method proposed in Uni LM [12].

We are going to explore to what extent vision-language pre-training can help each other modality, especially as the shared Multiway Transformer backbone naturally blends in text and image representations.

We can extend the proposed model to integrate more modalities (e.g., speech, video, and structured knowledge), supporting general-purpose multimodal pre-training.

[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. Co RR, abs/2204.14198, 2022. doi: 10.48550/ar Xiv.2204.14198. URL https://doi.org/10.48550/ar Xiv.2204.14198.

[2] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, and Hsiao-Wuen Hon. Uni LMv2: Pseudo-masked language models for unified language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 642 652. PMLR, 2020. URL http: //proceedings.mlr.press/v119/bao20a.html.

[3] Hangbo Bao, Li Dong, and Furu Wei. BEi T: BERT pre-training of image transformers. Co RR, abs/2106.08254, 2021. URL https://arxiv.org/abs/2106.08254.

[4] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: universal image-text representation learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX, volume 12375 of Lecture Notes in Computer Science, pages 104 120. Springer, 2020. doi: 10.1007/ 978-3-030-58577-8\_7. URL https://doi.org/10.1007/978-3-030-58577-8_7.

[5] Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xianling Mao, and Heyan Huang. Cross-lingual natural language generation via pre-training. Co RR, abs/1909.10481, 2019.

[6] Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian Ling Mao, Heyan Huang, and Ming Zhou. Info XLM: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3576 3588, Online, June 2021. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2021.naacl-main.280.

[7] Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Saksham Singhal, Payal Bajaj, Xia Song, and Furu Wei. XLM-E: Cross-lingual language model pre-training via ELECTRA. Ar Xiv, abs/2106.16138, 2021.

[8] Alexis Conneau and Guillaume Lample. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf.

[9] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440 8451, Online, July 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/ 2020.acl-main.747.

[10] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020, pages 3008 3017. Computer Vision Foundation / IEEE, 2020.

[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171 4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.

[12] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13042 13054, 2019.

[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. preprint ar Xiv:2010.11929, 2020.

[14] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Co RR, abs/2101.03961, 2021. URL https://arxiv.org/abs/2101.03961.

[15] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision-and-language representation learning. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[16] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325 6334. IEEE Computer Society, 2017. doi: 10.1109/ CVPR.2017.670. URL https://doi.org/10.1109/CVPR.2017.670.

[17] Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. Co RR, abs/2004.00849, 2020. URL https://arxiv.org/abs/2004.00849.

[18] Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 12976 12985. Computer Vision Foundation / IEEE, 2021.

[19] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904 4916. PMLR, 2021. URL http://proceedings.mlr.press/v139/jia21b.html.

[20] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3128 3137. IEEE Computer Society, 2015.

[21] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5583 5594. PMLR, 2021. URL http://proceedings.mlr.press/v139/kim21k.html.

[22] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123(1):32 73, 2017.

[23] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ar Xiv preprint ar Xiv:1910.13461, 2019.

[24] Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Co RR, abs/2107.07651, 2021. URL https://arxiv.org/abs/ 2107.07651.

[25] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. Co RR, abs/1908.03557, 2019. URL http://arxiv.org/abs/1908.03557.

[26] Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 2592 2607. Association for Computational Linguistics, 2021.

[27] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX, volume 12375 of Lecture Notes in Computer Science, pages 121 137. Springer, 2020. doi: 10.1007/978-3-030-58577-8\_8. URL https://doi.org/10.1007/978-3-030-58577-8_8.

[28] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pages 740 755. Springer, 2014.

[29] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. Co RR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907. 11692.

[30] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[31] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13 23, 2019. URL https://proceedings.neurips.cc/paper/2019/ hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html.

[32] Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, and Furu Wei. Deltalm: Encoder-decoder pre-training for language generation and translation by augmenting pretrained multilingual encoders. Co RR, abs/2106.13736, 2021. URL https://arxiv.org/abs/2106.13736.

[33] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pages 1143 1151, 2011. URL https://proceedings.neurips.cc/paper/2011/hash/ 5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html.

[34] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2641 2649. IEEE Computer Society, 2015. doi: 10.1109/ICCV.2015.303. URL https://doi.org/10.1109/ICCV.2015.303.

[35] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. URL https://s3-us-west-2.amazonaws.com/openaiassets/research-covers/ language-unsupervised/languageunderstandingpaper.pdf.

[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748 8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.

[37] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1 140:67, 2020. URL http://jmlr. org/papers/v21/20-074.html.

[38] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6): 1137 1149, 2017. doi: 10.1109/TPAMI.2016.2577031. URL https://doi.org/10.1109/ TPAMI.2016.2577031.

[39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015.

[40] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2556 2565. Association for Computational Linguistics, 2018. URL https://aclanthology.org/P18-1238/.

[41] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixtureof-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview.net/forum?id=B1ck MDqlg.

[42] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: pretraining of generic visual-linguistic representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum?id=Syg XPa EYv H.

[43] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28August 2, 2019, Volume 1: Long Papers, pages 6418 6428. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1644. URL https://doi.org/10.18653/v1/p19-1644.

[44] Hao Tan and Mohit Bansal. LXMERT: learning cross-modality encoder representations from transformers. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 5099 5110. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1514. URL https://doi.org/10.18653/v1/D19-1514.

[45] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. preprint ar Xiv:2012.12877, 2020.

[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998 6008, 2017. URL http://papers.nips.cc/paper/ 7181-attention-is-all-you-need.

[47] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. Co RR, abs/2202.03052, 2022. URL https://arxiv.org/abs/2202.03052.

[48] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. Co RR, abs/2108.10904, 2021. URL https://arxiv.org/abs/2108.10904.

[49] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google s neural machine translation system: Bridging the gap between human and machine translation. Co RR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609. 08144.

[50] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision. Co RR, abs/2111.11432, 2021. URL https://arxiv.org/abs/2111.11432.

[51] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 5579 5588. Computer Vision Foundation / IEEE, 2021. URL https: //openaccess.thecvf.com/content/CVPR2021/html/Zhang_Vin VL_Revisiting_ Visual_Representations_in_Vision-Language_Models_CVPR_2021_paper.html.

[52] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis., 127(3):302 321, 2019. doi: 10.1007/s11263-018-1140-0. URL https://doi.org/10.1007/ s11263-018-1140-0.

[53] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and VQA. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 13041 13049. AAAI Press, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Please refer to the conclusion.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We report average results of multiple runs.

(d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]