# filip_finegrained_interactive_languageimage_pretraining__ee520472.pdf Published as a conference paper at ICLR 2022 FILIP: FINE-GRAINED INTERACTIVE LANGUAGEIMAGE PRE-TRAINING Lewei Yao 1,2 Runhui Huang3 Lu Hou1 Guansong Lu1 Minzhe Niu1 Hang Xu1 Xiaodan Liang 3 Zhenguo Li1 Xin Jiang1 Chunjing Xu1 1Huawei Noah s Ark Lab, 2Hong Kong University of Science and Technology 3Sun Yat-sen University Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the crossmodal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/selfattention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finergrained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability. 1 INTRODUCTION Large-scale Vision-Language Pre-training (VLP) models like CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) have recently demonstrated success across various downstream tasks. They learn visual and textual representations from millions of image-text pairs collected from the Internet and show superior zero-shot ability and robustness. The core technique of these models lies in the global contrastive alignment of the images and texts through a dual-stream model. Such architecture is inference-efficient for downstream tasks like retrieval because the encoders for the two modalities can be decoupled and the image or text representations can be pre-computed offline. However, CLIP and ALIGN model the cross-modal interaction via solely the similarity of the global feature of each modality, lacking the ability of capturing finer-level information like the relationship between visual objects and textual words. In this paper, we develop a simple yet efficient cross-modal finer-grained interaction mechanism for large-scale VLP. To achieve finer-grained cross-modal interaction, previous methods mainly exploited two kinds of methods. (1) One line of work (Chen et al., 2020; Li et al., 2020b; Dong et al., 2021; Li et al., 2021b; Zhang et al., 2021; Zhan et al., 2021) uses a pre-trained object detector to extract region-of-interest (ROI) features from images, and then fuses it with the paired text through a VLP model. This design complicates the pre-training due to pre-computing and storing a large number of ROI features. In addition, the zero-shot ability of these approaches is usually limited by the predefined number of classes and their performance is also restricted by the quality of the detector. (2) Another line of work (Li et al., 2021a; Kim et al., 2021) enforces the token-wise or patch-wise representations from Equal contribution Corresponding authors: xu.hang@huawei.com, xdliang328@gmail.com Published as a conference paper at ICLR 2022 both modalities into the same space and models these finer-grained interactions via cross-attention (Li et al., 2021a) or self-attention (Kim et al., 2021). However, these methods are usually less efficient in terms of both training and inference. In particular, during training, cross-attention in (Li et al., 2021a) requires to be performed in an encoder-decoder structure, while the complexity of the self-attention (Kim et al., 2021) grows quadratically with the length of the prolonged concatenated sequences of both modalities. During inference, the data from both modalities are intertwined to compute the cross-attention or self-attention, and can not be pre-computed offline as dual-stream models like CLIP and ALIGN. This can be less efficient for downstream tasks like image/text retrieval and image classification. In this paper, we propose a large-scale Fine-grained Interactive Language-Image Pre-training framework named FILIP to address these limitations. Inspired by Khattab & Zaharia (2020), we model the fine-grained semantic alignment through a novel cross-modal late interaction mechanism in the contrastive loss, instead of using cross or self-attention. Specifically, our fine-grained contrastive learning uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. In this way, FILIP successfully leverages the finer-grained expressiveness among image patches and textual words while simultaneously gaining the ability to pre-compute image and text representations offline. Unlike Khattab & Zaharia (2020), we discard the padded tokens and use average instead summation of token-wise maximum similarities when computing the image-text alignment, which enhances the cross-modal representation learning and stabilizes training. Furthermore, we construct a large-scale pre-training dataset named FILIP300M from the Internet. Data cleaning and image-text data augmentation are also explored and proved useful in this work. Extensive experiments show that by effectively learning fine-grained representations, FILIP achieves state-of-the-art performance on multiple downstream tasks, including zero-shot image classification and image-text retrieval. For example, FILIP reaches 77.1% top-1 accuracy for zero-shot Image Net classification, surpassing CLIP with less training data. Visualizations on word-patch alignment further show that FILIP learns meaningful finer-grained features with promising localization ability. 2 RELATED WORK Vision-Language Pre-training Models. The pre-train-and-fine-tune scheme has achieved great success in the domains of natural language processing (Devlin et al., 2019; Brown et al., 2020) and computer vision (Dosovitskiy et al., 2020). It is then naturally extended to a joint cross-modal domain of Vision-and-Language Pre-training (VLP). The pre-training datasets of recent VLP models include publically available datasets like YFCC100M (Thomee et al., 2016) and CC12M (Changpinyo et al., 2021), as well as larger-scale datasets with more than 100M samples in CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021), which are shown to be even more powerful. The pretraining tasks of VLP models can be categorized into two categories: image-text contrastive learning task and Language Modeling (LM) based tasks: (i) CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021) and UNIMO (Li et al., 2021b) make use of cross-modal contrastive learning which aligns the textual and visual information into a unified semantic space; (ii) Visual BERT (Li et al., 2019), UNITER (Chen et al., 2020), M6 (Lin et al., 2021), and DALL-E (Ramesh et al., 2021) employ LM-like objectives, including both masked LM (e.g., Masked Language/Region Modeling), and autoregressive LM (e.g., image captioning, text-grounded image generation). On the other hand, some methods rely on a pre-trained object detection model such as Faster-RCNN (Ren et al., 2015) to extract image regional features offline, which requires extra labeled bounding-box data and makes the approach less scalable. Recent efforts such as SOHO (Huang et al., 2021) and Sim VLM (Wang et al., 2021) try to eliminate this burden via visual dictionary or Prefix LM (Raffel et al., 2020). In this paper, we directly learn fine-grained vision-language representations in an end-to-end and simpler manner while maintaining the benefit of inference efficiency. Multi-Modality Interaction Mechanism. The core of vision-language pre-training models lies in modeling the interaction between the two modalities. There are mainly two types of cross-modal interaction architectures: Single-stream models like Visual BERT (Li et al., 2019) and Vi LT (Kim et al., 2021) directly concatenate the patch-wise or regional visual features and textual embeddings and feed them to the transformer-based model. Dual-stream models such as Vi LBERT (Lu et al., 2019) and CLIP (Radford et al., 2021) have separate encoders for different modalities. This allows flexible use of different models for different modalities, and efficient inference for downstream tasks like image-text retrieval, through the ability of decoupling the encoders and pre-compute image/text Published as a conference paper at ICLR 2022 2. Mean 1. Max cat in the basin Visual Token Textual Token Token Embedding Positional Embedding Cross-modal Late Interaction Image Encoder Text Encoder Image-to-text Contrastive Text-to-image Contrastive Token-wise Similarity Image-to-text Text-to-image 0 0 1 2 3 4 5 1 2 3 4 5 𝑛" Linear Projection Figure 1: Overall architecture of FILIP, a dual-stream model with Transformer-based image and text encoders. On top of the image and text encoders, the representations of textual tokens and visual tokens are linearly projected to the multi-modal joint space. A novel fine-grained contrastive learning equipped with cross-modal late interaction is proposed, which uses a token-wise maximum similarity between visual and textual tokens. features offline. SCAN (Lee et al., 2018) considers latent alignments between image regions and words. However, it is based on Triplet loss with a bottom-Up attention via a Faster-RCNN to extract object features while we try to directly learn to localize fine-grained object from patches. In this paper, while following the dual-stream approach for its flexible and efficient inference, we further propose a new multi-modal interaction mechanism to capture the fine-grained representations. In this paper, we propose a new cross-modal pre-training model that excels in fine-grained interaction between image encoder and text encoder for mining more detailed semantic alignment, named as FILIP, as shown in Figure 1. Particularly, FILIP is a dual-stream model with Transformer-based image and text encoders. For the visual modality, the image encoder is a Vision Transformer (Dosovitskiy et al., 2020) which takes the concatenation of an extra [CLS] token embedding and linearly projected image patches as input. For the textual modality, following Radford et al. (2021), we use the lower-cased byte pair encoding (BPE) (Sennrich et al., 2016b) with a vocabulary size of 49,408 to tokenize the text. Each text sequence starts with [BOS] token and ends with [EOS] token. After the word embedding layer, the token embeddings are fed into a modified decoder-only Transformer model as in (Radford et al., 2019). On top of the image and text encoders, the representations of textual tokens and visual tokens are linearly projected to the multi-modal common space, and are separately L2-normalized. Different from existing dual-stream models (e.g., CLIP and ALIGN) which models cross-modal interaction via only the global features of the entire image and text sequence, we introduce a novel fine-grained contrastive learning objective equipped with cross-modal late interaction which takes into account the fine-grained interaction between image patches and textual tokens, detailed in Section 3.1. 3.1 FINE-GRAINED CONTRASTIVE LEARNING Contrastive representation learning has recently been found to learn better representations than its predictive counterpart in both visual (Tian et al., 2020) and vision-language cross-modal pre-training (Radford et al., 2021). Under a general formulation of cross-modal contrastive learning (Radford et al., 2021), we want to learn encoders fθ for image data I and gφ for text data T such that, given an image x I I, and a text x T T , the encoded representations fθ(x I) and gφ(x T ) are close if they are related and far apart if not, under a distance metric. In each training batch, we sample b image-text pairs {x I k, x T k }b k=1, For image x I k in image-text pair {x I k, x T k }, x T k is its positive, while Published as a conference paper at ICLR 2022 the other texts will be used as in-batch negatives. The image-to-text contrastive loss LI k for x I k can then be formulated as LI k(x I k, {x T j }b j=1) = 1 b log exp(s I k,k) P j exp(s I k,j), where s I k,j denotes the similarity of the k-th image to the j-th text. Similarly, the text-to-image contrastive loss for x T k is LT k (x T k , {x I j}b j=1) = 1 b log exp(s T k,k) P j exp(s T j,k). The total loss of this mini-batch can be represented by k=1(LI k + LT k ). (1) 3.1.1 CROSS-MODAL LATE INTERACTION From the contrastive loss (1), the cross-modal interaction is reflected in how we compute the similarities s I i,j and s T i,j for the i-th image and j-th text. Previous methods like CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) simply encode each image or text separately to a global feature i.e., fθ(x I i ) Rd and gφ(x T j ) Rd, and compute these two similarities as s I i,j = s T i,j = fθ(x I i ) gφ(x T j ), (2) neglecting finer-grained interactions (e.g., word-patch alignment) between the two modalities. To alleviate this problem, while simultaneously maintain the training and inference efficiency of dualstream models, we apply a cross-modal late interaction inspired by Khattab & Zaharia (2020) to model the token-wise cross-modal interaction. Specifically, denote n1 and n2 as the number of (non-padded) tokens of the i-th image and j-th text, respectively, and the corresponding encoded features are fθ(x I i ) Rn1 d and gφ(x T j ) Rn2 d. For the k-th visual token, we compute its similarities with all textual tokens of x T j , and use the largest one max 0 r token. Besides, we also disgard image-text pairs whose text contains sensitive words. A.2 DATASETS SUMMARY Table 6 shows the number of image-text pairs of each datasets used in different pre-training methods. Table 6: Number of image-text pairs used in the pre-training of FILIP, CLIP and ALIGN. FILIP CLIP ALIGN CC3M CC12M YFCC100M FILIP300M (Radford et al., 2021) (Jia et al., 2021) # 3M 10M 26M 300M 400M 1800M A.3 DETAILED EXPERIMENTAL SETTINGS Table 7: The architecture parameters for FILIP models. Model Embedding Input Image Encoder Text Encoder dimension resolution #layers width #heads #layers width #heads FILIPbase 256 224 224 12 768 12 12 512 8 FILIPlarge 256 224 224 24 1024 16 12 768 12 Model Architectures. We follow the same architecture design as CLIP, for both FILIPbase and FILIPlarge, except that we reduce the embedding dimension from 512/768 to 256 for the efficiency of loss computation. Table 7 describes the details of architectures. Details for Pre-training and Hyperparameters. For the implementation of the contrastive loss, following CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021), we also set the temperature in the softmax function to be a learnable parameter and initialize it as 0.07. For the pre-training, we use the LAMB optimizer implemented by the cybertronai s open-source repository (https: //github.com/cybertronai/pytorch-lamb). For the learning rate scheduler, we first assign a base learning rate and then linearly warm it up to the peak learning rate according to the effective total batch size by a square root strategy, peak lr = base lr q 512 . We note that a large weight decay is crucial to stabilize training and improve generalization. Specifically, we found that the training stability is a challenging issue when applying mix-precision training to large-scale models, i.e., the training is extremely unstable and the Na N loss easily happens. Recent works Table 8: Common hyperparameters used for FILIP pre-training. Hyperparameter Value Vocabulary size 49408 Initial temperature 0.07 LAMB β1 0.9 LAMB β2 0.999 LAMB ϵ 10 4 Warm-up iters 3000 Training epochs 30 Published as a conference paper at ICLR 2022 Table 9: Modeland dataset-specific hyperparameters used for FILIP pre-training. Numbers in batch size represent the total batch size across all workers and are calculated as: batch size per GPU #GPUs. FILIP340M is the combination of FILIP300M, YFCC100M, CC12M and CC3M. Model Dataset Batch size Base LR Weight decay FILIPbase YFCC100M 1024 8 6 10 3 3e-2 FILIPbase FILIP340M 320 128 2 10 3 3e-3 FILIPlarge FILIP340M 160 192 1.5 10 3 3e-3 DALL-E (Ramesh et al., 2021) and Cogview (Ding et al., 2021) also notice this issue and provide their solutions. However, we found that simply increasing the weight decay and applying the trick of removing the weight decay of specific parameters as described in Section 4.1 work for our case. The base learning rate and weight decay are selected manually via observing the performance at the early training stage. Table 8 summarizes the common hyperparameters and Table 9 shows the modeland dataset-specific hyperparameters for FILIP pre-training. Details for Image-text Retrieval. Following previous works (Jia et al., 2021; Li et al., 2021a), for Flickr30K, we test on the 1K test set with or without fine-tuning on the 30K training set, while for MSCOCO, we test on the 5K test set with or without fine-tuning on the 113K training set. We use the similarity between image and text for ranking and use the contrastive loss for fine-tuning. Since there are multiple texts for each image in these two datasets, we change the ground-truth label of contrastive loss to consider multiple positives, by assigning a probability of 1/#positive to each positive following ALBEF (Li et al., 2021a). Besides, we also use prompts during evaluation for both datasets, see Appendix A.5 for details. Table 10 shows the hyperparameters for image-text retrieval fine-tuning. Table 10: Hyperparameters used for image-text retrieval fine-tuning. Hyperparameter Value Image size 392 392 Training epochs 3 Optimizer LAMB Batch size 5120 Base LR 2 10 4 Weight decay 3 10 4 A.4 MORE VISUALIZATIONS OF WORD-PATCH ALIGNMENT AND GRAD-CAM HEATMAPS In Figure 3, we visualize the cross-modal alignment of the proposed method for more images, in terms of both word-patch alignment as described in Section 4.5 and Grad-CAM heatmaps (Selvaraju et al., 2017). We compute the Grad-CAM heatmaps based on the average self-attention maps over the image patches classified to targeted textual tokens (i.e., the textual token(s) corresponding to the class label in the Image Net dataset) in the last layer of the image encoder. We average the heatmaps over all attention heads. As can be seen, our proposed model learns meaningful alignment between image patches and textual tokens. A.5 PROMPT TEMPLATES FOR DOWNSTREAM TASKS Image Classification. Table 11 shows the prompt templates for different image classification datasets in the form of [prefix] {label}, [category description]. [suffix]. in Equation (6). There are three components to be determined in the template, i.e., the prefix, the category description and the suffix. For each component, we select several well-performed ones for each dataset. Then we use the full combinations of all three components as the set of prompt templates for ensemble. For instance, we use 5 prefixes, no category descriptions, and 6 suffixes for dataset Image Net. Then the total number of prompt templates for this dataset is: 5 1 6 = 30. Published as a conference paper at ICLR 2022 Bald eagle (5,6) Killer whale (5,6) Bee eater (5,6) Damselfly (5,6,7) Bullock cart (5,6) Fireboat (5,6) Yellow lady s slipper (5,6,7,8) Necklace (5) Raw image Patch prediction Grad CAM Patch prediction Grad CAM Raw image Figure 3: More visualizations on different classes of Image Net dataset. Numbers in the parentheses after the class label indicate the location indices of class label in the tokenized textual sequence. Published as a conference paper at ICLR 2022 Table 11: Prompt templates used for 12 downstream image classification tasks. Dataset Prefix Category description Suffix a photo of a , a jpeg photo of a , a painting of a , itap of a , graffiti of a , a cartoon , a doodle None, It s common in daily life , It s cute , It s ugly , It s weird , Hope you like it a jpeg photo of a , a painting of a , a good photo of a , a bad photo of a , a photo of a , itap of a , a rendering of a None, It s common in daily life , It s beautiful , It s ugly , I like it , I take it today a photo of a , a cropped photo of a , a good photo of a , a bad photo of a None None, I like it , I hate it , It s ugly , It s cute Stanford Car a photo of a , a close-up photo of a , a good photo of a , a bad photo of a a type of car , a type of automobile I like it , It belongs to my friend , It s brand new , It s popular recently , It s important to me , I take it today a photo of a (many) , a rendering of a (many) , itap of a (many) a type of flower , a type of bloom It s beautiful , It s from my best friend , It gives out a sweet perfume/fragrance a photo of a , a good photo of a , a bad photo of a , a closeup photo of a , itap of a I like it , It s common in daily life , It s not common in daily life , It s ugly , It s cute , It s beautiful Food101 a photo of my , a close-up photo of my , itap of my a type of food , a type of nourishment I made it today , I like it , I hate it , It s delicious , It s with nice flavour , It s with terrible flavour , It s popular recently a photo of a , a good photo of a , a bad photo of a , a bright photo of a , a dark photo of a , a black and white photo of a , a nice scene of a , a terrible scene of a None, I like it , I hate it , It s beautiful , It s common in daily life , It s important to me DTD itap of a , a close-up photo of a texture , surface , material None, It s out of style , It s popular in old days , It s ugly , It s beautiful a photo of the , a close-up photo of the , a good photo of the , a pixelated photo of the a type of plane , a type of aircraft , a type of airliner None, I like it , It s important to me , I take it today , Hope you like it a photo of my , a low resolution photo of my , a good photo of my a type of pet , a type of dog or cat None, It s cute , It s important to me , I like it , It s beautiful a photo of a , a painting of a , a cropped photo of a , a good photo of a , a blurry photo of a None, an example of aerial or satellite images None, I like it , It s taken from an aircraft or some flying object , It s collected by imaging satellites Published as a conference paper at ICLR 2022 Table 12: Prompt templates used for zero-shot image-text retrieval on Flickr30K and MSCOCO datasets. Dataset Task Prefix Suffix Flickr30K image-to-text retrieval a good photo of the I hate it. text-to-image retrieval a good photo of None MSCOCO image-to-text retrieval a good photo of It is ugly. text-to-image retrieval None None Image-text Retrieval. Following CLIP (Radford et al., 2021), we use prompt in zero-shot imagetext retrieval for both Flickr30K and MSCOCO datasets. The prompt is selected by the same rule as described in Section 3.1.2, except that we do not use [category description] here. Table 12 shows the prompt templates for zero-shot image-text retrieval on Flickr30K and MSCOCO datasets. A.6 LINEAR PROBE ON IMAGE CLASSIFICATION In this section, we evaluate FILIP on the linear probe for image classification. Following common linear probe setting, we freeze the whole backbone network and only finetune the last linear classifier. Since we remove the [CLS] token in our vision encoder, we apply a mean pooling over all the other visual tokens to aggregate them into a global image representation which is then fed into the linear classifier. Setting. Following CLIP, we train the logistic regression classifier using scikit-learn s L-BFGS implementation (Pedregosa et al., 2011), with maximum 1,000 iterations on those 11 datasets except Image Net. For Image Net, we use a pytorch-based codebase to accelerate the training with GPU. Following Doersch et al. (2015), we adopt a Batch Normalization (Ioffe & Szegedy, 2015) layer before the linear classifier which is beneficial to stabilize the mixed-precision training. Random resized crop and horizontal flipping are used to augment training data. We use the cosine learning rate scheduler with a linear warmup of 10 epochs. More hyperparameters used in linear probe on Image Net are shown in Table 13. Table 13: Hyperparameters used for linear probe image classification on Image Net. Hyperparameter Value Image size 224 224 Training epochs 90 Optimizer SGD Batch size 4096 Base LR 0.1 Weight decay 0 Results. Table 14 compares the linear probe performance of our proposed FILIP with CLIP over 12 datasets. Our FILIPbase (resp. FILIPlarge) achieves 85.5% (resp. 91.0%) average Top-1 accuracy over 12 downstream tasks, which provides noticeable improvements, i.e., 1.8% (resp. 1.2%) higher, compared to its CLIP s counterpart. This implies that our FILIP learns more powerful vision features which may potentially facilitate border downstream vision tasks. A.7 COMPARISON WITH KHATTAB & ZAHARIA (2020) As is stated in Section 3.1, compared to Khattab & Zaharia (2020), besides being the first to apply the late interaction to contrastive learning for vision-language pre-training, we make two other modifications, i.e., removing padded tokens and using average over non-padded tokens instead of summation. In the following, we show that these two modifications are crucial to the performance, and the quality of finer-granular word-patch alignment. For comparison, we replace the proposed cross-modal late interaction in FILIPbase with the original late interaction in Khattab & Zaharia (2020). Following the setting in Section 4.4, we pre-train on Published as a conference paper at ICLR 2022 Table 14: Top-1 accuracy(%) of linear probe on image classification on 12 datasets. Our FILIP outperforms CLIP by 1.2 1.8% points on average. Stanford Cars Oxford Pets CLIP-Vi T-B/32 95.1 80.5 93.0 81.8 96.6 88.8 76.6 76.5 52.0 90.0 97.0 76.1 83.7 FILIPbase 95.3 80.3 95.0 78.6 98.7 86.2 77.9 78.1 76.6 88.0 95.9 75.8 85.5+1.8 CLIP-Vi T-L/14 98.0 87.5 96.5 90.9 99.2 95.2 81.8 82.1 69.4 95.1 98.2 83.9 89.8 FILIPlarge 97.9 87.0 97.2 89.0 99.6 94.6 83.2 83.9 84.8 93.5 97.3 84.5 91.0+1.2 the filtered YFCC100M with mixed-precision using 8 V100 GPUs. The batch size per GPU is 512 and the dimension of the token feature is 256. We report results with the top 25% tokens (selected using the method in Section 3.1) during training. Note that the original late interaction in Khattab & Zaharia (2020) is sensitive to the temperature in the softmax function, and we report the best result among several initialization values of the temperature. Effect to Performance. Table 15 shows the comparison on zero-shot Image Net classification. When these two modifications are removed, the zero-shot Top-1 accuracy of Image Net drops from 34.3 to 32.7. Table 15: Comparison of Top-1 Accuracy(%) between the proposed cross-modal late interaction loss and Khattab & Zaharia (2020) on zero-shot Image Net classification. ours late interaction in Khattab & Zaharia (2020) 34.3 32.7 Effect to the Word-patch Alignment In Figure 4, we compare the word-patch alignment using the models trained with the proposed cross-modal late interaction and the late interaction in Khattab & Zaharia (2020). According to the visualizations, using the original late interaction in Khattab & Zaharia (2020) leads to less accurate word-patch alignment. Specifically, the object patches are often aligned to the padded tokens instead of class names. We speculate this is because the padded tokens learn similar representations as existing key textual tokens, similar to the finding in Section 3.2 of Khattab & Zaharia (2020) that padding with masked tokens (which is called query augmentation in Khattab & Zaharia (2020)) tend to re-weigh existing terms based on their importance for matching the query . A.8 ABLATION ON THE FULL PRE-TRAINING DATASET In Table 16, we compare the proposed cross-modal late interaction loss with the original CLIP loss (Radford et al., 2019) on the full pre-training dataset introduced in Section 3.3. In Table 16, CLIP denotes the results reported by CLIP paper, CLIPrep is our reproduced CLIP version with the original contrastive loss using exactly the same architecture on the same pre-training dataset as FILIPbase. As can be seen, the FILIPbase has 6.7 points higher average accuracy than the CLIPrep over 12 datasets. This further verifies that the performance gain of FILIP comes from the proposed cross-modal late interaction, rather than the data or architecture. A.9 INFERENCE TIME OF IMAGE-TEXT RETRIEVAL Setting. In this section, we test the inference time of both image retrieval and text retrieval on the test set of Flickr30K and MSCOCO. We compare our proposed model FILIPlarge against SCAN (Lee et al., 2018) and CLIP (Vi T-L/14) (Radford et al., 2021) . We test the inference time of CLIP and SCAN using their released code. For image retrieval, we precompute the image features and report the inference time for one text query, which contains (i) the time to extract the feature of one text query, and (ii) the time of similarity calculation with all images and ranking. Similarly, for Published as a conference paper at ICLR 2022 Bald eagle (5,6) Killer whale (5,6) Bee eater (5,6) Damselfly (5,6,7) Bullock cart (5,6) Fireboat (5,6) Yellow lady s slipper (5,6,7,8) Necklace (5) Raw image Patch pred. (FILIP) Raw image Patch pred. (colbert) Patch pred. (FILIP) Patch pred. (colbert) Figure 4: Comparison of word-patch alignment between the proposed cross-modal late interaction and that in Col BERT (Khattab & Zaharia, 2020). a photo of a {label}. is the prompt. Numbers in the parentheses after the class label indicate the location indices of the class label in the tokenized textual sequence. The correct predictions to the class labels are highlighted by opaque patches with the class label indices in red. Incorrect predictions to the padded tokens are highlighted by opaque patches with the padded token indices in blue. Published as a conference paper at ICLR 2022 Table 16: Top-1 accuracy(%) on image classification on 12 datasets. CLIPrep is our reproduced CLIP trained with the same training data and evaluated with the same prompts as our FILIP. With the same backbone architecture, our FILIP significantly improves the zero-shot Top-1 average accuracy over 12 datasets. Stanford Cars Oxford Pets CLIP 91.3 65.1 87.9 59.4 66.7 84.4 63.2 44.5 21.2 87 49.4 63.2 65.3 CLIPrep 82.0 57.5 89.9 45.1 80.7 75.1 63.6 46.7 33.7 82.7 49.0 64.2 64.2 FILIPbase 86.9 65.5 91.9 55.4 85.3 82.8 69.1 49.3 57.2 88.1 49.9 68.8 70.9 Table 17: Comparison on performance and inference time of image-text retrieval on Flickr30K and MSCOCO datasets. Recall Inference time Flickr30K MSCOCO Flickr30K MSCOCO image->text text->image image->text text->image i-to-t t-to-i i-to-t t-to-i R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 SCAN 67.4 90.3 95.8 48.6 77.7 85.2 50.4 82.2 90.0 38.6 69.3 80.4 4.47s 7ms 21.3s 26ms CLIP 88.0 98.7 99.4 68.7 90.6 95.2 58.4 81.5 88.1 37.8 62.4 72.2 23ms 8ms 24ms 9ms FILIP 96.6 100.0 100.0 87.1 97.7 99.1 78.9 94.4 97.4 61.2 84.3 90.6 24ms 8ms 26ms 9ms text retrieval, we precompute the text features and report the inference time for one image query, which contains (i) the time to extract the feature of one image query, and (ii) the time of similarity calculation with all texts and ranking. The test set of Flickr30k contains 1000 images and 5000 texts, while the test set of COCO contains 5000 images and 25000 texts. The time is averaged over 1000 runs. Results. The inference time of retrieval is shown in Table 17. Benefitting from the efficiency optimizations (i.e., FP16 quantization and reduced feature dimension) in Section 3.1, the inference time of FILIP is close to CLIP. In image retrieval, SCAN is slightly faster than FILIP on Flickr30K with 1000 images, because SCAN uses a lightweight GRU as the text encoder. However, SCAN is much slower than FILIP (i.e., about 17ms slower per query) on MSCOCO with more (i.e., 5000) images because of the slower computation involved in the two-stage stacked cross-attention when computing the similarity. For text retrieval, SCAN is much slower than FILIP and its own image retrieval, mainly due to three reasons: (i) the image encoder is a Faster RCNN which is more expensive than the lightweight GRU text encoder; (ii) the text candidates are 5 times more than the image candidates in image retrieval; and (ii) the similarity computation of SCAN relies on the cross-attention computation, which is not straightforward to be paralleled, even in their official code; while our FILIP s similarity computation is simply a matrix multiplication and is readily optimized on most modern hardwares.