# finegrained_semantically_aligned_visionlanguage_pretraining__9a6123b4.pdf

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Juncheng Li 1, 2 Xin He 2 Longhui Wei 2 Long Qian 1 Linchao Zhu 1

Lingxi Xie 2 Yueting Zhuang 1 Qi Tian 2 Siliang Tang 1

1 Zhejiang University, 2 Huawei Cloud {junchengli, qianlong0926, zhulinchao, yzhuang, siliang}@zju.edu.cn {hexin80, weilonghui1, tian.qi1}@huawei.com, 198808xc@gmail.com

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE

, a fine-grained semantically a Ligned visi Onlang Uage Pr E-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently compute the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves stateof-the-art performance on a variety of vision-language tasks. Furthermore, without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from largescale raw image-text pairs. The repository of this work is at https://github. com/YYJMJC/LOUPE.

1 Introduction

Learning transferable cross-modal representations from large-scale vision-language pre-training has exhibited remarkable performance on a wide variety of downstream tasks. Most existing works can be classified into two categories: dual-encoder and fusion-encoder. The dual-encoder methods [17, 26, 35, 49] adopt two separate encoders to embed images and texts, and model the cross-modal alignment by the cosine similarity between the global features of images and texts. While such architecture is efficient for large-scale image-text retrieval by pre-computing image and text representations offline, they fail to model fine-grained semantic alignment between visual regions and textual phrases. On the other hand, the fusion-encoder methods [7, 19, 20, 25, 30, 34, 43, 42] attempt to use a single multi-modal encoder to jointly model the concatenated sequence of images and texts. These methods simulate soft alignment via advanced cross-modal attention [45]. However, they can only learn implicit alignment by end-to-end training, lacking explicit supervision to encourage semantic

Work done when interning at Huawei Cloud. Equal Contribution. Corresponding Authors.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

alignment between visual regions and textual phrases. And the learned cross-modal attention matrices are often scattering and uninterpretable. Further, they are inefficient for retrieval since it requires jointly encoding every image-text pair during inference.

Learning fine-grained semantic alignment from image-text pre-training is crucial to many crossmodal reasoning tasks (e.g., visual grounding [51], image captioning [48]), but it is particularly challenging as the alignment information between visual regions and textual phrases is not available, posing fine-grained semantic alignment learning a weakly-supervised learning problem. In this paper, we address this problem while simultaneously maintaining high retrieval efficiency by proposing LOUPE

, a fine-grained semantically a Ligned visi On-lang Uage Pr E-training framework, from the novel perspective of game theory. We formulate input patch and word tokens as multiple players into a cooperative game and quantify game-theoretic interactions (i.e., Shapley Interaction [12, 40]) among them to investigate the semantic alignment information. LOUPE learns fine-grained semantic alignment from two stages: token-level Shapley interaction modeling and semantics-level Shapley interaction modeling, where we first learn to identify semantic regions of images that correspond to some semantically meaningful entities, and then align these regions with phrases in the paired text.

Specifically, token-level Shapley interaction modeling aims to group patch tokens of images into semantic regions that semantically correspond to some visual instances. From the game-theoretic view, we take patch tokens as players and the similarity score between images and texts as the game function. Intuitively, supposing a set of patch tokens correspond to a visual instance in the image, then they tend to have strong interaction to form the complete semantics of the corresponding instance, which contributes to the better similarity judgment with the paired text. Based on this insight, we take the token-level Shapley interaction as soft supervision labels to encourage the model to capture semantic regions from images. Then, semantics-level Shapley interaction modeling infers the fine-grained semantic alignment between semantic regions and phrases. We consider every region and phrase as players and define a fine-grained similarity score as the game function. If a region and a phrase have strong correspondence, they tend to interact with each other and contribute to the fine-grained similarity score. By measuring the Shapley interaction between each region-phrase pair, we obtain the alignment information to guide the pre-training model.

As computing the exact Shapley interaction is an NP-hard problem [32], existing methods mainly employ sampling-based method [6] to obtain unbiased estimation. However, as the number of players grows, they require thousands of model evaluations. To reduce the computational cost, we further propose an efficient hybrid Shapley interaction learning strategy, where an uncertainty-aware neural Shapley interaction learning module cooperates with the sampling-based method. Experimental results show that our hybrid strategy significantly saves the computational cost while maintaining the estimation accuracy. More analysis is shown in Section 4.5.

Our framework serves as a proxy training objective that explicitly establishes the fine-grained semantic alignment between local region and phrase representations. This proxy objective can be directly removed for downstream tasks, rendering an efficient and semantics-sensitive dual-encoder model. Experiments show that LOUPE achieves new state-of-the-art on image-text retrieval benchmarks. For text-to-image retrieval on MSCOCO, LOUPE surpasses its strongest competitor by 4.2% on recall@1. Further, without any fine-tuning, LOUPE successfully transfers to object detection and visual grounding tasks in a zero-shot manner. For object detection, it achieves 12.1% m AP on COCO and 19.5% m AP on PASCAL VOC. For visual grounding, it achieves 26.8% accuracy on Ref COCO and 23.6% accuracy on Ref COCO+. Our contributions are summarized as follows:

We propose LOUPE

that explicitly learns fine-grained semantic alignment between visual regions and textual phrases while preserving the high retrieval efficiency of dual-encoder.

We introduce an efficient and effective hybrid Shapley interaction learning strategy, based on an uncertainty-aware neural Shapley interaction learning module and a sampling-based method.

Pre-trained on image-text data, LOUPE achieves new state-of-the-art on image-text retrieval and successfully transfers to the tasks that require more fine-grained object-level visual understanding (i.e., object detection and visual grounding) without any fine-tuning.

As manual annotations for masses of object categories is time-consuming and unscalable, our work demonstrates a promising alternative, that is, learning fine-grained semantics from raw texts about images, which are easily available and contain a broader set of visual concepts.

2 Related Work

Vision-Language Pre-Training. The great success of pre-train-and-fine-tune paradigm in natural language processing [5, 9] and computer vision [10, 14, 47] has been expanded to the joint domain of vision and language [2, 3, 22]. Dominant vision-language pre-training models can be categorized into two groups: dual-encoder and fusion-encoder. The dual-encoder methods [17, 26, 35, 49] adopt two individual encoders to embed images and texts separately, and model the cross-modal interaction by cosine similarity. Such architecture is efficient for large-scale image-text retrieval as image and text representations can be pre-computed offline. However, simply measuring the cosine similarity between global representations is shallow to capture fine-grained semantic relationships between regions and phrases. The fusion-encoder methods [7, 15, 16, 19, 20, 24, 25, 30, 34, 42, 43, 50, 55] adopt a single multi-modal encoder to jointly model the concatenated sequence of images and texts, which achieves deeper cross-modality interaction. However, these methods are less efficient as images and texts are intertwined to compute the cross-modal attention and can not be pre-computed offline. Further, there are no explicit supervision signals to encourage the alignment between regions and phrases. Some works [7, 24, 25, 30, 43, 50, 55, 56] attempt to leverage an off-the-shelf object detector to extract object features for pre-training. However, the detector is usually pre-trained on limited object categories. Furthermore, considering the excessive demand on memory and computation, existing methods usually fix the parameters of detection models and regard region detection as a pre-processing step, disconnected with vision-language pre-training. Thus, the performance is also restricted by the quality of detection models. FILIP [49] uses a token-wise maximum similarity to enhance the cross-modal interaction of dual-encoder methods. To learn explicit fine-grained semantic alignment, GLIP [23] and X-VLM [53] utilize human-annotated datasets, where regions with bounding-box annotations are aligned with text descriptions. Such a manner is time-consuming and hard to scale to larger raw image-text data from the Internet. In contrast, our proposed framework explicitly learns the fine-grained semantic alignment from raw image-text data and at the same time maintains the high efficiency of dual-encoder. Detailed discussions can be found in Appendix K.

Shapley Values. The Shapley value [40] was initially introduced in game theory. It has been theoretically proven to be the unique metric to fairly estimate the contribution of each player in a cooperative game such that certain desirable axioms are satisfied [46]. With solid theoretic foundations, Shapley value has recently been studied as post-hoc explanation methods for Deep Neural Networks (DNN) [8, 31, 54]. Lundberg et al. [31] propose a unified attribution method based on Shapley value to interpret the predictions of DNN. Ren et al. [38] propose to explain adversarial attacks by Shapley value. In this paper, we propose to model fine-grained semantic alignment by game-theoretic interactions, along with an efficient Shapley interaction learning strategy.

In this section, we first introduce the problem formulation of fine-grained semantically aligned vision-language pre-training in Section 3.1. Then, we propose the corresponding LOUPE framework for fine-grained semantic alignment learning in Section 3.2 and an efficient approach for Shapley interaction learning in Section 3.3.

3.1 Problem Formulation and Model Overview

Generally, vision-language pre-training aims to learn an image encoder f I and a text encoder f T by cross-modal contrastive learning, where the matched image-text pairs are optimized to get closer and the mismatched pairs are optimized to get further. Let f I(Ii) and f T(Ti) denote the global representations of the image and text. Then the cross-modal contrastive loss can be formulated as:

LCMC = log exp(f I(Ii) f T(Ti)/τ)) PB j exp(f I(Ii) f T(Tj)/τ) log exp(f I(Ii) f T(Ti)/τ)) PB j exp(f I(Ij) f T(Ti)/τ)) (1)

where B is the batch size and τ is the temperature hyper-parameter.

While intuitive, such a manner can only learn coarse alignment between images and texts but fails to explicitly capture the fine-grained semantic alignment between visual regions and textual phrases. To learn fine-grained semantic alignment while simultaneously maintaining high retrieval efficiency, we

Phrase-Region Semantic Alignment

Image Encoder

Patch Embeddings

Patch Tokens {𝐱$

Semantic Region Generation

Semantic Regions {𝐡$

Semantics-level Shapley Interaction

A girl in a blue coat is looking down at a dog led by a man in front of a church

Text Encoder

Word Embeddings

Word Tokens {𝐱(

Semantic Phrases {𝐡(

...[CLS] [CLS] Text-Image Contrastive Learning

A girl in a blue coat is looking down at a dog led by a man in front of a church

Token-level Shapley Interaction

Efficient and Semantics-Sensitive Dual-Encoder

Figure 1: Overview of LOUPE. Our framework serves as a proxy training objective that encourages the image encoder to capture semantic regions and establishes the semantic alignment between region and phrase representations. The proxy training objective can be easily removed for downstream tasks, rendering an efficient and semantics-sensitive dual-encoder.

propose LOUPE, a fine-grained semantically aligned vision-language pre-training framework that germinates from cooperative game theory.

As illustrated in Figure 1, LOUPE learns fine-grained semantic alignment from two stages: tokenlevel Shapley interaction modeling and semantics-level Shapley interaction modeling. For token-level Shapley interaction modeling, we learn to aggregate patch tokens of images into semantic regions that semantically correspond to some visual concepts, under the guidance of token-based semantic aggregation loss LTSA. As for semantics-level Shapley interaction modeling, the semantic alignment between the aggregated regions and textual phrases is learned, supervised by the finegrained semantic alignment loss LFSA. Combined with the two newly proposed losses, the full objective of fine-grained semantically aligned vision-language pre-training can be formulated as:

L = LCMC + LTSA + LFSA (2)

Such a new pre-training objective enforces the image encoder to capture semantic regions and establishes fine-grained semantic alignment between visual regions and textual phrases. During inference, it can be directly removed, rendering an efficient and semantics-sensitive dual-encoder.

3.2 Interpreting Fine-Grained Semantic Alignment as Game-Theoretic Interaction

3.2.1 Preliminaries

Shapley Values. The Shapley value [40] is a classic game theory solution for the unbiased estimation of the importance or contribution of each player in a cooperative game. Considering a game with N = {1, ..., n} players, S N denotes a potential subset of players. A game v( ) is implemented as a function that maps each subset S of players to a score, modeling the outcome of a game when players in S participate in. Specifically, v(N) v( ) denotes the contribution obtained by all players in the game. The Shapley value ϕ(i|N) for player i is defined as the average marginal contribution of player i to all possible coalitions S that are formed without i:

S N\{i} p(S)[v(S {i}) v(S)], p(S) = |S|!(|N| |S| 1)!

where p(S) is the likelihood of S being sampled. The Shapley value has been proved to be the unique metric that satisfies the following axioms: Linearity, Symmetry, Dummy, and Efficiency [46]. We summarize these axioms in Appendix B.

Shapley Interaction. In the game theory, some players tend to form a coalition and always participate in the game together. The players in the coalition might interact or cooperate with each other, which

brings additional contributions to the game. The Shapley interaction [12] measures this additional contributions brought by the coalition compared with the case when the players work individually. For a coalition S, we consider [S] as a single hypothetical player, which is the union of the players in S. Then, the reduced game is formed by removing the individual players in S from the game and adding [S] to the game. The Shapley value ϕ([S]|N \ S {[S]}) for player [S] can be computed using Equation 3 over the reduced game. Similarly, we can obtain ϕ(i|N \ S {i}), where i is the individual player in S. Finally, the Shapley interaction for coalition S is formulated as:

I([S]) = ϕ([S]|N \ S {[S]}) X

i S ϕ(i|N \ S {i}) (4)

In this way, I([S]) reflects the interactions inside S. The higher value of I([S]) indicates that players in S cooperate closely with each other.

3.2.2 Token-Level Shapley Interaction Modeling

Due to inherent semantic unit mismatch between texts and images, it is ineffective to directly compute the alignment between words and pixels (patches). A textual phrase usually refers to a specific image region, which is composed of multiple patches and represents a visual instance. Thus, we first introduce token-level Shapley interaction modeling to aggregate patches into semantic regions.

Input Representations. Given an image-text pair, the input image I is sliced into patches and flattened. Followed by linear projection layer and position embeddings, we obtain patch token sequence X I = {x I i }L1 i=1 with an additional [CLS_I] token embedding. The input text T is tokenized and embedded into word token sequence X T = {x T i }L2 i=1, added with position embeddings. We also prepend a learnable special token [CLS_T] to the word token sequence. Then, we adopt a dual-encoder structure to encode the patch token sequence and word token sequence, separately. On top of the image and text encoders, we obtain the representations of patch token sequence X I = { x I i } L1 i=1 and word token sequence X T = { x T i } L2 i=1. We take the learned representations of [CLS_I] and [CLS_T] tokens as the global representations for images and texts. And the global similarity of image-text pairs is measured by the cosine similarity between them.

Understanding Semantic Region via Shapley Interaction. Supposing a set of patches represent a complete visual instance in an image, then they tend to have a strong Shapley interaction because they work jointly to form a visual instance, which contributes to the better similarity judgment with the text. From the game-theoretic view, we take patch tokens and word tokens as players X = X I X T , and the global similarity between images and texts as the game score v1( ). To compute v1(S), we keep tokens in S and mask input tokens in X \ S to zeros. Thus, the global similarity only considers the tokens in S, which reflects the contribution of the tokens in S to the global similarity judgment.

Semantic Region Generation. Inspired by YOLOv3 [37], we design a lightweight region generation module. It takes each patch token representation x I i as input and generates a bounding box prediction centered on x I i , which corresponds to a visual region Ri = {x I i,k}Ki k=1 with Ki patch tokens. The region generation module also predicts a confidence score s(Ri) for each region. We select the top-M predictions as the semantic regions. Then, the Shapley interaction of Ri can be defined as:

I([Ri]) = ϕ([Ri]|X \ Ri {[Ri]}) X

x I i,k Ri ϕ(x I i,k|X \ Ri {x I i,k}) (5)

According to the Equation 3, we can reformulate Shapley value into the form of expectation:

ϕ([Ri]|X \ Ri {[Ri]}) = E c { E S X\Ri |S|=c [v1(S Ri) v1(S)]} (6)

where c represents the coalition size. ϕ(x I i,k|X \ Ri {x I i,k}) can be defined in a similar manner, and the Shapley interaction of Ri can be reformulated as (we provide the proof in Appendix C):

I([Ri]) = E c { E S X\Ri |S|=c [v1(S Ri) X

x I i,k Ri v1(S {x I i,k}) + (K 1)v1(S)]} (7)

Taking normalized I ([Ri]) as the soft supervision label, the token-based semantic aggregation loss is defined as cross-entropy loss:

i=1 [I ([Ri]) log(s(Ri)) + (1 I ([Ri])) log(1 s(Ri))] (8)

which propagates gradients to the region generation module and image encoder to adjust bounding box predictions such that more accurate semantic regions can be captured.

3.2.3 Semantics-Level Shapley Interaction Modeling

After obtaining the inferred semantic regions, we propose semantics-level Shapley interaction modeling to explicitly model the fine-grained semantic alignment between regions and phrases. We first define the fine-grained similarity score and then explain semantic alignment based on game theory.

We adopt Avg-Pooling over learned patch representations in each Ri to obtain region representation h I i Rd. We employ an off-the-shelf constituency parser to extract phrases from text and obtain phrase representation h T i Rd by Avg-Pooling. Totally, we obtain M regions HI = {h I i }M i=1 and N phrases HT = {h T j }N j=1. And the alignment matrix can be defined as: A = [aij]M N,

where aij = h I i h T j represents the alignment score between i-th region and j-th phrase. Next, we apply softmax-normalization over each row of A, obtaining A. For the i-th region, we calculate its maximum alignment score as maxj aij. Then, we use the average maximum alignment score over all regions as the fine-grained image-to-text similarity p1. Similarly, we can obtain the fine-grained text-to-image similarity p2, and the total fine-grained similarity score can be defined: p = (p1+p2)/2.

Understanding Semantic Alignment via Shapley Interaction. If a region and a phrase have strong semantic correspondence, then they tend to cooperate with each other and contribute to the fine-grained similarity score. Thus, we can consider H = HI HT as the players and the fine-grained similarity score p as the game score v2( ). The Shapley interaction of them can be formulated as:

I([Hij]) = ϕ([Hij]|H \ Hij {[Hij]}) ϕ(h I i |H \ Hij {h I i }) ϕ(h T j |H \ Hij {h T j }) (9)

= E c { E S H\Hij |S|=c [v2(S Hij) v2(S {h I i }) v2(S {h T j }) + v2(S)]} (10)

where [Hij] represents the single player formed by the coalition of i-th region and j-th phrase. Taking normalized I ([Hij]) as soft labels, the fine-grained semantic alignment loss can be defined as:

LFSA = 1 MN

j=1 I ([Hij]) log( aij) (11)

3.3 Uncertainty-Aware Neural Shapley Interaction Learning

According to Equation 3 and Equation 4, computing exact the Shapley value is an NP-hard problem [32]. Previous methods mainly apply sampling-based method [6] to approximate it. While sampling-based approximation is unbiased, an accurate approximation requires thousands of model evaluations. To reduce the computational cost, we propose an uncertainty-aware neural Shapley interaction learning (UNSIL) module to cooperate with the sampling-based method, rendering an efficient and effective hybrid strategy.

Specifically, the sampling-based method [6] estimates the expectation terms in Equation 7 and Equation 10 by sampling to compute the Shapley interaction. Inspired by noisy label learning [18], the UNSIL module learns to predict the Shapley interaction and the corresponding uncertainty σ (0, 1). Intuitively, if the UNSIL module makes a prediction with low uncertainty, we can directly apply its prediction to LTSA and LFSA, avoiding thousands of model evaluations. If the uncertainty is high, we then resort to the sampling-based method for a more accurate estimation.

During training, the UNSIL module first predicts the target interaction with uncertainty σ. Then, we sample a value ϵ from a uniform distribution on (0, 1). If ϵ > σ, we directly use its prediction. If

Table 1: Results (%) of zero-shot image-text retrieval on Flickr30K and MSCOCO datasets.

Flickr30K MSCOCO image-to-text text-to-image image-to-text text-to-image R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Image BERT 70.7 90.2 94.0 54.3 79.6 87.5 44.0 71.2 80.4 32.3 59.0 70.2 UNITER 83.6 95.7 97.7 68.7 89.2 93.9 - - - - - - CLIP 88.0 98.7 99.4 68.7 90.6 95.2 58.4 81.5 88.1 37.8 62.4 72.2 ALIGN 88.6 98.7 99.7 75.7 93.8 96.8 58.6 83.0 89.7 45.6 69.8 78.6 FILIP 89.8 99.2 99.8 75.0 93.4 96.3 61.3 84.3 90.4 45.9 70.6 79.3 LOUPE 90.5 99.5 99.8 76.3 93.9 96.7 62.3 85.1 91.2 50.1 75.4 83.3

Table 2: Top-1 accuracy (%) of zero-shot image classification over 11 datasets.

Stanford Cars

Oxford Pets

CLIP 96.2 92.9 77.3 67.7 78.7 34.9 57.7 36.1 93.5 92.6 75.3 73.0 LOUPE 95.9 94.3 79.9 69.8 87.4 37.8 53.3 54.9 94.1 93.9 76.1 76.1

ϵ σ, we then use the sampling-based method to compute the Shapley Interaction and update the UNSIL module based on the sampling-based results. Note that, for the first few iterations, we employ the sampling-based method directly, and use its results to train the UNSIL module.

Let I and ˆI denote the results from the sampling-based method and UNSIL module, respectively. Taking I as the ground-truth, the UNSIL module is trained by:

LUNSIL = 1 β1σ LMSE(ˆI, I ) + β2σ (12)

where the first term is mean squared error LMSE weighted by the uncertainty, the second term serves as a regularization term for the prediction uncertainty, and β is the weight hyper-parameter. The UNSIL module implicitly learns the uncertainty from the regression loss function. We discuss the implementation details of the UNSIL module in Section 4.5 and Appendix D.

4 Experiments

4.1 Pre-training Details As sufficient data is a prerequisite for vision-language pre-training, we construct a dataset with 240M image-text pairs from the Internet. We implement the image encoder by Swin-L [28] and the text encoder by BERT-Small [9]. The input images are resized to 224 224 and the input texts are tokenized by Word Piece with a maximum length of 60. We pre-train the model for 20 epochs using a batch size of 512 on 128 NVIDIA V100 GPUs. We utilize Adam W [29] optimizer with a learning rate of 2 10 4 and a weight decay of 0.01. More pre-training and evaluation details are provided in Appendix D, E. We also analyze the image encoder and training efficiency in Appendix G, J.

4.2 Zero-Shot Image-Text Retrieval

We compare LOUPE on the widely used MSCOCO [27] and Flickr30K [33] datasets. First, the results in Table 1 show that LOUPE achieves new state-of-the-art zero-shot performance on most metrics of the two datasets, demonstrating the stronger generalizability of our pre-training framework. Second, while previous works mainly pre-train on larger datasets (CLIP 400M, ALIGN 1800M, FILIP 340M), LOUPE still achieves superior performance using less training data (240M). Third, compared with FILIP which directly computes token-wise similarity, our model captures semantic alignment between visual regions and textual phrases, which is more semantically meaningful. For text-to-image retrieval on MSCOCO, LOUPE significantly outperforms FILIP by 4.2% on recall@1.

4.3 Zero-Shot Image Classification

In this section, we evaluate LOUPE on the zero-shot image classification task. We compare LOUPE with CLIP on 11 downstream classification datasets, following the same evaluation setting as

Table 3: Without fine-tuning, zero-shot transfer performance on object detection and visual grounding.

COCO PASCAL VOC Ref COCO Ref COCO+ m AP@0.3 m AP@0.5 m AP@0.3 m AP@0.5 val test A test B val test A test B CLIP + Pixel-Wise 8.5 4.5 18.2 7.3 6.7 6.2 5.8 6.1 7.0 5.7 CLIP + K-Means 6.4 1.9 11.7 4.8 2.1 2.3 1.7 1.7 2.0 2.8 CLIP + Grad-CAM 7.1 3.2 19.1 8.2 5.5 5.2 4.8 4.4 5.6 4.9 Adapt CLIP 14.9 6.6 28.7 12.9 16.7 18.4 18.0 17.5 18.9 19.6 LOUPE 25.3 12.1 30.3 19.5 25.2 26.8 24.5 22.9 23.3 23.6

Image Encoder

Semantic Region

A photo of Person

A photo of Dog

A photo of Church

Text Encoder

Phrase-Region Semantic Alignment

Person Person

Figure 2: An example of LOUPE zero-shot transferring to object detection using prompt templates.

CLIP [35]. Table 2 summarizes the results. As shown in Table 2, our LOUPE outperforms CLIP with average improvement of 3.1%. Notably, on Image Net, the largest dataset among 11 datasets, our LOUPE surpasses CLIP by 0.8%. Also, we observe that LOUPE achieves substantial performance gains on several fine-grained image classification datasets (i.e., Flowers102 and Aircrafts). It demonstrates the superiority of our LOUPE on fine-grained semantics understanding.

We also evaluate the linear probing performance of our LOUPE on image classification. The detailed results can be found in Appendix I.

4.4 Zero-Shot Transfer to Object Detection and Visual Grounding

To answer whether our model has learned fine-grained semantics, we further evaluate LOUPE on object detection [39] and visual grounding [51], which require more fine-grained semantic understanding ability to identify specific visual regions in images according to the object labels or referring expressions. Visual grounding can be seen as generalized object detection, where the pre-defined class labels are replaced by language referring expression sentences. As LOUPE can generate a set of semantic regions that are aligned with textual phrases, it can be easily applied to object detection and visual grounding without structure modification. For visual grounding, we take referring expressions as input text. For object detection, as illustrated in Figure 2, we use prompt to expand detection labels to input text. Then, we encode input text by the learned text encoder, and these tasks can be completed by measuring the similarity between candidate region representations and text representations.

For comparison, we zero-shot transfer CLIP (Vi T-L/14) to object detection and visual grounding by applying several non-parametric approaches on the spatial feature maps of CLIP. We also compare with Adapt CLIP [21], which is a concurrently unpublished method that leverages classic superpixel (SLIC [1]) and bounding box proposal (selective search [44]) methods to zero-shot transfer CLIP to phrase localization. We use its public official implementations to get the experiment results. For object detection, we evaluate their mean Average Precision (m AP) at Io U thresholds of {0.3, 0.5} on COCO [27] (65 classes) and PASCAL VOC [11] (20 classes). For visual grounding, we evaluate their top-1 accuracy at Io U thresholds of 0.5 on Ref COCO [51] and Ref COCO+ [51]. The experiment details of CLIP variants and LOUPE are provided in Appendix E.

Table 3 summarizes the results. 1) Overall, LOUPE outperforms all CLIP variants by a large margin. The significantly higher performance illustrates the stronger zero-shot transfer ability of our fine-grained semantically aligned pre-training paradigm. 2) Second, all CLIP variants rely on pre-processing steps on CLIP s feature map (e.g., Adapt CLIP first uses SLIC to group pixels and then uses selective search to generate a large number of proposals), which is time-consuming. In contrast, our method directly predicts the semantic regions based on the patch token representations. 3) Third, the consistently competitive performance across four benchmarks validates that LOUPE can learn fine-grained semantics from raw text supervision. LOUPE demonstrates a promising alternative, that is, learning fine-grained semantics from large-scale raw image-text pairs, which are easily available and contain a broader set of visual concepts.

Table 4: Ablation study of each component across three tasks.

MSCOCO COCO Ref COCO Training Time I2T T2I m AP@0.3 m AP@0.5 val test A test B (sec/iter) 1 Backbone 31.0 24.8 3.8 1.0 1.3 0.9 0.8 1.17 2 Backbone + LTSA 32.4 26.2 7.6 3.3 1.8 2.0 2.6 8.38 3 Backbone + LTSA + LFSA 33.5 28.3 9.4 5.9 4.1 4.6 4.3 9.90 4 Backbone + LTSA + LFSA + UNSIL 33.3 28.1 9.0 5.6 4.5 4.9 4.4 1.93

(a) Instability (b) Uncertainty (c) Error Figure 3: (a) Instability of the Shapley interaction approximation with respect to different sampling numbers. (b, c) Uncertainty and error of the UNSIL module with different structures.

As time-consuming human annotations are unscalable for massive object classes in the real world, some recent works [4, 36] target at training object detectors with annotations on base object classes to generalize to the remaining object classes of the same dataset. The latest works [13, 52] leverage the generalizability of vision-language pre-training models to further improve the zero-shot performance on novel classes. However, these zero-shot approaches still require bounding box annotations on base classes for task-specific supervised learning. In contrast, our LOUPE is trained on large-scale raw image-text pairs, which are already accessible on the Internet and contain more diverse semantics.

4.5 Ablation Study

Effectiveness of Individual Components. In this section, we investigate the effectiveness of each component in Table 4. Given the costly training time, all ablation studies are based on a relatively small dataset (Conceptual Captions 3M [41]). We start with the backbone model that consists of a dual-encoder trained by cross-modal contrastive loss. We then gradually add tokenlevel Shapley interaction modeling supervision LTSA (Row 2), semantics-level Shapley interaction modeling supervision LFSA (Row 3), and UNSIL module (Row 4). For Row 2 and Row 3, the Shapley interaction is only computed by the sampling-based method. The results in Table 4 show that both LTSA and LFSA bring significant improvement for all tasks. We observe that LTSA boosts a 3.8% improvement on object detection. And the improved fine-grained visual semantic understanding further facilitates the cross-modal retrieval performance (+1.4%). The semantics-level Shapley interaction modeling further improves the performance on all tasks by modeling the semantic alignment between visual regions and textual phrases. Comparing Row 3 and Row 4, we observe that the UNSIL module maintains the estimation accuracy while avoiding intensive computations. The averaged training time is reduced from 9.90 seconds per iteration to 1.93 seconds per iteration.

Accuracy of the Shapley Interaction Learning. Since we use the sampling-based method [6] to compute the Shapley Interaction and train the UNSIL module, we conduct a study to evaluate the accuracy of the sampling-based method and the error of the UNSIL module. As [54], we compute the interaction multiple times and measure the instability of them. A lower instability means that we obtain similar interactions from different sampling processes. It indicates a high accuracy. Specifically, the instability is defined as Eu,v:u =v|Iu Iv|

Ew|Iw| , where Iw denotes the interaction computed in the w-th time. We average the instability values over Shapley interaction of 100 image-text pairs. We report the average instability values with respect to different sampling numbers. As shown in Figure 3 (a), the instability decreases along with the increase of the sampling number. When the sampling number is larger than 500, the approximated Shapley interaction is stable enough with instability less than 0.06. Further, we attempt different models (i.e., Conv1D, 3-Layer MLP + Attention, 3-Layer Transformer) to implement the UNSIL module (please see Appendix D for more details). We test them on 1000 samples and report their mean uncertainty and relative error in Figure 3 (b) and (c). We observe that MLP + Attention is good enough to predict the interaction with lower complexity. Thus, we implement the UNSIL module by MLP + Attention.

person in black

arm holding pic

holding photo

girl with umbrella

black dog white dog

(a) Object Detection (b) Visual Grounding Figure 4: Qualitative examples of object detection on COCO and visual grounding on Ref COCO+.

A dog on a leash attached to

a wooden bench.

A cat tired to a door area with a blue bag

over it, with a bike in the front view.

A girl is playing with a dog on a chair,

with a yellow bag beside them.

Figure 5: Visualization of learned fine-grained semantic alignment and corresponding Shapley interaction values. The values in the red boxes represent the Shapley interaction of regions.

4.6 Qualitative Analysis

Qualitative Examples. As shown in Figure 4, LOUPE successfully captures the regions that correspond to the detected objects, and grounds the referring expressions onto the referred regions.

Visualization of Learned Fine-Grained Semantic Alignment. In Figure 5, we visualize some key semantic regions and corresponding alignment matrices inferred by LOUPE. We present the regions with top-3 confidence (Region 1 3) and two randomly sampled regions (white boxes). The red boxes at the bottom of bounding boxes indicate their normalized token-level Shapley interaction values. Comparing their Shapley interaction values, we observe that the token-level Shapley interaction successfully distinguishes semantic regions from randomly sampled regions. The semantically meaningful regions tend to have stronger interaction. It indicates that token-level Shapley interaction can provide correct supervision for semantic region generation. Further, we show the alignment matrices inferred by semantics-level Shapley interaction and LOUPE, respectively. As shown in the right case of Figure 5, LOUPE successfully recognizes the leash region

and aligns it with the a leash phrase. Note that existing object detection datasets do not contain the leash category.

5 Conclusion

This paper introduces a novel vision-language pre-training framework, LOUPE

, which models the fine-grained semantic alignment between visual regions and textual phrases by game-theoretic interactions. To efficiently compute the interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Comprehensive experiments show that LOUPE achieves new state-of-the-art on image-text retrieval datasets and can transfer to object detection and visual grounding in a zero-shot manner. This work demonstrates a new promising direction of learning fine-grained semantics from large-scale raw image-text data.

Limitations. 1) The phrases are extracted by off-the-shelf constituency parsers, whose predictions might not be completely accurate. 2) The web data might inevitably contain mismatched image-text pairs, leading to noisy supervision.

Social Impacts. Our model is trained on noisy data from the Internet, which may contain unsuitable images, violent text, or private information. Thus, additional analysis of the data is necessary. Further, the use of our model for privacy surveillance or other nefarious purposes should be prohibited.

Acknowledgment. This work has been supported in part by the National Key Research and Development Program of China (2018AAA0101900), Zhejiang NSF (LR21F020004), Key Research and Development Program of Zhejiang Province, China (No. 2021C01013), Chinese Knowledge Center of Engineering Science and Technology (CKCEST). We thank all the reviewers for their valuable comments.

[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11):2274 2282, 2012. [2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077 6086, 2018. [3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425 2433, 2015. [4] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. Zero-shot object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 384 400, 2018. [5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020. [6] Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling. Computers & Operations Research, 36(5):1726 1730, 2009. [7] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104 120. Springer, 2020. [8] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In 2016 IEEE symposium on security and privacy (SP), pages 598 617. IEEE, 2016. [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. [11] Mark Everingham, Andrew Zisserman, Christopher KI Williams, Luc Van Gool, Moray Allan, Christopher M Bishop, Olivier Chapelle, Navneet Dalal, Thomas Deselaers, Gyuri Dorkó, et al. The pascal visual object classes challenge 2007 (voc2007) results. 2008. [12] Michel Grabisch and Marc Roubens. An axiomatic approach to the concept of interaction among players in cooperative games. International Journal of game theory, 28(4):547 565, 1999. [13] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. ar Xiv preprint ar Xiv:2104.13921, 2021. [14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729 9738, 2020. [15] Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12976 12985, 2021. [16] Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. ar Xiv preprint ar Xiv:2004.00849, 2020. [17] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904 4916. PMLR, 2021. [18] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017. [19] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583 5594. PMLR, 2021. [20] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021. [21] Jiahao Li, Greg Shakhnarovich, and Raymond A Yeh. Adapting clip for phrase localization without further training. ar Xiv preprint ar Xiv:2204.03647, 2022. [22] Juncheng Li, Xin Wang, Siliang Tang, Haizhou Shi, Fei Wu, Yueting Zhuang, and William Yang Wang. Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12123 12132, 2020. [23] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. ar Xiv preprint ar Xiv:2112.03857, 2021. [24] Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. ar Xiv preprint ar Xiv:2012.15409, 2020. [25] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In

European Conference on Computer Vision, pages 121 137. Springer, 2020. [26] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. ar Xiv preprint ar Xiv:2110.05208, 2021. [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014. [28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. [29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. [30] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019. [31] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017. [32] Yasuko Matsui and Tomomi Matsui. Np-completeness for calculating power indices of weighted majority games. Theoretical Computer Science, 263(1-2):305 310, 2001. [33] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641 2649, 2015. [34] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. ar Xiv preprint ar Xiv:2001.07966, 2020. [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748 8763. PMLR, 2021. [36] Shafin Rahman, Salman Khan, and Nick Barnes. Improved visual-semantic alignment for zero-shot object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11932 11939, 2020. [37] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. ar Xiv e-prints, 2018. [38] Jie Ren, Die Zhang, Yisen Wang, Lu Chen, Zhanpeng Zhou, Yiting Chen, Xu Cheng, Xin Wang, Meng Zhou, Jie Shi, et al. A unified game-theoretic interpretation of adversarial robustness. ar Xiv preprint ar Xiv:2111.03536, 2021. [39] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015. [40] Lloyd S Shapley. A value for n-person games, contributions to the theory of games, 2, 307 317, 1953. [41] P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alttext dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018. [42] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. ar Xiv preprint ar Xiv:1908.08530, 2019. [43] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. ar Xiv preprint ar Xiv:1908.07490, 2019. [44] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154 171, 2013. [45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [46] Robert J Weber. Probabilistic values for games. The Shapley Value. Essays in Honor of Lloyd S. Shapley, pages 101 119, 1988. [47] Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, and Qi Tian. Mvp: Multimodality-guided visual pre-training. ar Xiv preprint ar Xiv:2203.05175, 2022. [48] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048 2057. PMLR, 2015. [49] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. ar Xiv preprint ar Xiv:2111.07783, 2021. [50] Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vil: Knowledge enhanced vision-language representations through scene graph. ar Xiv preprint ar Xiv:2006.16934, 2020. [51] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69 85. Springer, 2016. [52] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393 14402, 2021.

[53] Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vision language pre-training: Aligning texts with visual concepts. ar Xiv preprint ar Xiv:2111.08276, 2021. [54] Hao Zhang, Yichen Xie, Longjie Zheng, Die Zhang, and Quanshi Zhang. Interpreting multivariate shapley interactions in dnns. ar Xiv preprint ar Xiv:2010.05045, 2020. [55] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579 5588, 2021. [56] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793 16803, 2022.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Please see Section 5.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] Please see Section 5. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] Please see Appendix B, C. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The pre-trained LOUPE model is coming soon. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Please see Section 4.1 and Appendix D, E. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Please see Section 4.1 and Appendix D, E, J. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes]

(c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [No] We do not use crowdsourcing or conducted research with human subjects. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [No] We do not use crowdsourcing or conducted research with human subjects. (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [No] We do not use crowdsourcing or conducted research with human subjects.