# selfsupervised_relationship_probing__74d79ccd.pdf

Self-Supervised Relationship Probing

Jiuxiang Gu1, Jason Kuen1, Shaﬁq Joty2, Jianfei Cai3, Vlad I. Morariu1, Handong Zhao1, Tong Sun1

1Adobe Research, 2Nanyang Technological University, 3Monash University {jigu,kuen,morariu,hazhao,tsun}@adobe.com, srjoty@ntu.edu.sg, jianfei.cai@monash.edu

Structured representations of images that model visual relationships are beneﬁcial for many vision and vision-language applications. However, current humanannotated visual relationship datasets suffer from the long-tailed predicate distribution problem which limits the potential of visual relationship models. In this work, we introduce a self-supervised method that implicitly learns the visual relationships without relying on any ground-truth visual relationship annotations. Our method relies on 1) intraand inter-modality encodings to respectively model relationships within each modality separately and jointly, and 2) relationship probing, which seeks to discover the graph structure within each modality. By leveraging masked language modeling, contrastive learning, and dependency tree distances for self-supervision, our method learns better object features as well as implicit visual relationships. We verify the effectiveness of our proposed method on various vision-language tasks that beneﬁt from improved visual relationship understanding.

1 Introduction

Visual relationships that describe object relationships in images have become more and more important for high-level computer vision (CV) tasks that need complex reasoning [1, 2, 3, 4]. They are often organized in a structured graph representation called scene graph, where nodes represent objects and edges represent relationships between objects. In recent years, we have witnessed great progress with visual relationship datasets such as Visual Genome [5] and the application of scene graphs to various CV reasoning tasks such as image captioning [6, 7], image retrieval [8], and visual reasoning [9].

Despite this, current visual relationship models still rely on human-annotated relationship labels. Due to the combinatorics involved two objects and one relationship between them, where objects and relationships each have different types relationships are numerous and have a long-tailed distribution and, thus, it is difﬁcult to collect enough annotations to sufﬁciently represent important but less frequently observed relationships. Consequently, current visual relationship models tend to focus on modeling only a few relationships that have a large number of human annotations [10], and they ignore relationship categories with few annotations. We have seen some research attempts that use external knowledge databases to help enrich visual relationships, however, the total number of relationships modeled is still relatively small [11].

On the other hand, in the past few years, we have seen signiﬁcant progress in natural language processing (NLP) towards building contextualized language models with self-supervised pretraining objectives [12, 13]. The removal of human annotators from the training loop has enabled training on massive unlabeled datasets, leading to signiﬁcant advances in NLP performance [14, 15]. These trends have also brought signiﬁcant advances in vision-language (VL) pretraining tasks [16, 17, 18, 19, 20]. Most existing VL pretraining methods concatenate visual objects and the corresponding sentences as one input and adopt the Transformer [21] as the core module to learn contextualized multi-modal representations in a self-supervised manner via selfand cross-attentions. These models rely heavily

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

on the multi-head attention layers to explore implicit relations, or they directly rely on attention distributions to explain the relations between objects [17, 22]. However, different layers vary in their behaviors [23, 24], and it has been shown that attention alone can be deceiving when used for interpretability and explanation [25]. Thus, existing VL pretraining algorithms suffer from two problems: discovered relationships are not modeled explicitly, but are instead expected to be implicitly represented as transformer weights; and, the concatenation of multimodal inputs at training time restricts the model to require multimodal inputs at prediction time, as well.

Motivated by textual relation mining work in NLP [26], we propose a novel framework that discovers dependencies between objects from the model s representation space which addresses the problems highlighted above. Our approach is based on two simple observations: (1) when we slightly change the images, the relative visual relationships in those images remain unchanged; (2) relationships mentioned in image descriptions are visually observable in the corresponding image. Our approach relies on three modules, each consisting of a set of layers. In the ﬁrst module, implicit intra-modal relationships are modeled using transformer encoders. In the second module, cross-modal learning allows for implicit relationship information to be leveraged across modalities. In the third module, relationships between visual and linguistic entities are represented explicitly as latent variables via a technique we call relationship probe. All modules are trained using self-supervision, with a ﬁrst stage relying on masked language modeling to train the ﬁrst two modules, and a second stage relying on contrastive learning and linguistic dependency trees as supervisory signals to train the relationship probe network.

Our main contribution is a novel self-supervised relationship probing (SSRP) framework for ﬁnding dependencies in visual objects or textual entities that address issues with existing visual relationship models: it relies on self-supervision rather than explicit supervision, it explicitly models relationships as latent variables, and it leverages cross-modal learning but allows a single modality as input at prediction time. We conduct extensive experiments to demonstrate that our method can beneﬁt both vision and VL understanding tasks.

2 Background

Visual relationships. It has been demonstrated that visual relationships between objects can help improve performance on many CV tasks [8, 27, 28, 29, 30, 31]. Most of these methods assume a known explicit graph structure, and limit the graph to the most frequently occurring predicate categories while ignoring others that do not have enough labeled examples. Relaxing this assumption, some works transfer the object representations learned with predicate functions to rare predicates in few-shot scene graph generation [32, 33, 34]. Other works capture the relations via attention mechanisms [35, 36, 37, 38]. However, unlike object detectors that are trained on unambiguous and objectively deﬁned object class labels, visual relationships are subjective and it is hard to exhaustively annotate all possible relationships between objects. Thus, we do not explicitly deﬁne or label visual relationship classes, but instead, we discover the implicit visual relationships via the accompanied captions. We call our method SSRP in the sense that we do not use any explicit predicate labels.

Pretraining. Motivated by the huge success of BERT [13] in NLP, there is a growing interest in pretraining generic models to solve a variety of VL problems [39, 40, 22, 40, 18]. These methods generally employ BERT-like objectives to learn cross-modal representations from visual region features and word embeddings. They use selfand cross-attention mechanisms to learn joint representations that are appropriately contextualized in both modalities. However, most of the VL pretraining works heavily rely on massive amounts of visual-linguistic corpus [19, 17]. Moreover, although huge multi-modal training datasets enable pretraining methods to learn good representations for downstream multi-modal VL tasks, they usually do not beneﬁt visual tasks that only deal with single visual modality during inference. We overcome this problem with a new approach that enables the generation of implicit visual object relationships even with only visual inputs during inference, while beneﬁting greatly from the cross-modality learning objectives during training.

We would like to point out that several works focus on investigating the representations learned by transformer-based pretraining models [41, 42]. Their ﬁndings suggest that BERT-based network pretraining learns a rich set of intermediate representations of both semantic and syntactic information, which can be used to unearth the representations of dependency grammar relations. An interesting ﬁnding in [26] shows that BERT can recover dependency parse trees that have not been encountered

(a) (b) (c)

man standing [MASK] [MASK]

Intra-Modality Encoder ( ) Intra-Modality Encoder ( )

[MASK] [MASK]

[CLS] A man standing ... [SEP]

Inter-Modality Encoder ( )

Relationship Probe ( ) Relationship Probe ( )

Visual Relationships Textual Relationships

Inter-Modality Encoder ( )

[CLS] A ... [SEP] man standing [MASK] [MASK]

Intra-Modality Encoder ( ) Intra-Modality Encoder ( )

[MASK] [MASK]

[CLS] A man standing ... [SEP]

Inter-Modality Encoder ( )

Relationship Probe ( ) Relationship Probe ( )

Visual Relationships Textual Relationships

Inter-Modality Encoder ( )

[CLS] A ... [SEP] man standing [MASK] [MASK]

Intra-Modality Encoder ( ) Intra-Modality Encoder ( )

[MASK] [MASK]

[CLS] A man standing ... [SEP]

Relationship Probe ( ) Relationship Probe ( )

Visual Relationships Textual Relationships

[CLS] A ... [SEP]

Inter-Modality Encoder ( )

Figure 1: Overview of the proposed three types of SSRP frameworks, each of which consists of three types of modules: intra-modality encoder, inter-modality encoder and relationship probe.

during training. Coenen et al. [43] further present empirical descriptions of syntactic representations in BERT. These results in NLP motivate us to exploit BERT to ﬁnd visual relationships between image regions without explicitly training on relationship annotations.

Fig. 1 gives an overview of three variants of our method: SSRPShare, SSRPVisual and SSRPCross. Each variant consists of three modules: intra-modality encoder, inter-modality encoder and relationship probe. The main difference among the three SSRP variants lies in the inter-modality encoding process. The intra-modality and inter-modality encoders are BERT-like encoders, that respectively capture implicit single-modality relations and cross-modality relations among the entities (image objects and textual tokens) and output contextual representations. The relationship probe generates relationship graphs for each modality from the encoded contextual representations in a self-supervised way.

In the following, we ﬁrst brieﬂy describe BERT [13] since our approach is based on BERT architecture, and then we describe the individual modules of our SSRP frameworks as well as the learning process.

3.1 Revisiting BERT

BERT uses Masked Language Modeling (MLM), a self-supervised pretraining objective that allows a transformer encoder [21] to encode a sequence from both directions simultaneously. Speciﬁcally, for an input sequence S = {w1, . . . , w Nw} of Nw tokens, BERT ﬁrst randomly masks out 15% of the tokens and then predicts the masked tokens in the output. The masked tokens in the input sequence are represented by a special symbol [MASK] and fed into a multi-layer transformer encoder. Let Hl = {h1, . . . , h Nw} be the encoded features at the l-th transformer layer, with H0 being the input layer. The features at the (l + 1)-th layer are obtained by applying a transformer block deﬁned as:

Hl+1 = LN LN Hl + f l Self-Att(Hl) + f l FF LN(Hl + f l Self-Att(Hl)) (1)

where LN stands for layer normalization [44], f l Self-Att( ) is a multi-headed self-attention sub-layer, f FF( ) is a feed-forward sub-layer composed of two fully-connected (FC) layers, wrapped in residual connection [45] with an LN as speciﬁed in Eq. 1. The token representations in the ﬁnal layer are used to predict the masked tokens independently.

3.2 Model architecture

Input embeddings. The input to the three SSRP pretraining models includes both visual and textual elements, where the former is deﬁned as regions-of-interest (Ro Is) in an image and the latter is deﬁned as the tokens in a caption. Speciﬁcally, given an image I, we use Faster-RCNN [46] to detect Ro Is {v1, . . . , v Nv} and take the feature vector prior to the output layer of each Ro I as the visual feature embedding. For a caption S, we insert the special tokens [CLS] and [SEP] before and after the sentence, and use the Word Piece tokenizer [47] to split it into tokens {w1, . . . , w Nw}. Apart from token and visual feature embeddings, we also add positional encoding to represent tokens. In particular, for token wi, its input representation wi is the sum of its trainable token embedding, positional embedding (index in the sequence) and segment (image/text) embedding, followed by an LN layer. Each object vi is represented by its positional feature (normalized top-left and bottom-right

Augm entation

Augm entation

Probe Probe

Augm entation

Probe Probe

Augm entation

Probe Probe

Image-Caption Pair

M axim ize agreement

(Positive) (Positive)

Three people standing on a grassy hill flying a kite

A man riding a wave on to of a surfboard

A cute little dog running through a yard towards a frisbee

(Negative) (Negative)

Image-Caption Pair Image-Caption Pair

Figure 2: Illustration of our proposed learning process for relationship probing. The entire learning process consists of two training stages: training BERT encoders and training relationship probes. The notations i and j here refer to two different augmented images or sentences.

coordinates) and its 2048-dimensional Ro I feature, both of which are transformed through FC+LN layers to obtain the position-aware object-level embedding vi.

Intra-modality encoding. The purpose of intra-modality encoding is to model the intra-relations of the encoded representations in one modality via self-attention, same as that in BERT. Speciﬁcally, we randomly mask out v\i and w\j with a ﬁxed probability, and feed the masked object-level embeddings V = v1, . . . , v\i, . . . , v Nv

and word-level embeddings W = w1, . . . , w\j, . . . , w Nw

into two intra-modality encoders (f V V Intra and f S S Intra ) separately. Each layer in the intra-modality encoders contains a self-attention sub-layer and an FF sub-layer (Eq. 1).

Inter-modality encoding. The inter-modality encoder models the cross-modality relationships between image and textual entities. The three proposed SSRP pretraining models use different inter-modality encoding schemes as illustrated in Fig. 1. In SSRPShare, the inter-modality encoding is done with a single encoder f V S Inter that is shared between the two modalities, and f V S Inter consists of a shared self-attention sub-layer wrapped in residual connection with an LN. The shared weights connect the two modalities by causing the projections of the two input modalities to align in the query, key, and value spaces. In SSRPVisual, the textual features attend to visual features to connect the two modalities. In contrast to SSRPShare, we keep f V S Inter for the visual branch which contains a self-attention sub-layer and an FF sub-layer, while using f S V Inter for the textual branch which consists of a self-attention sub-layer, one unidirectional cross-attention sub-layer, and an FF sub-layer. Finally, SSRPCross uses an inter-modality bidirectional cross-attention encoder f V S Inter , where both textual and visual features attend to each other. Following [17], each layer in f V S Inter consists of two self-attention sub-layers, one bi-directional cross-attention sub-layer, and two FF sub-layers.

Relationship probing. The purpose of the relationship probing is to model the implicit relations among visual or textual entities. Speciﬁcally, we build a latent relationship graph Gv for the objects in an image and a latent relationship graph Gw for the tokens in a caption, based on the unmasked contextual object representations V = {v1, . . . , v Nv} and token representations W = {w1, . . . , w Nw}, which are the output feature vectors of the inter-modality encoders. Inspired by [26], we use a visual probe and a textual probe to compute the distances for each object pair (vi, vj) Gv and each token pair (wi, wj) Gw, respectively. The distance for an object/token pair is deﬁned as:

d Bu(ui, uj)2 = (Bu(ui uj))T (Bu(ui uj)) (2)

where u {v, w}, i and j are the object/token indices, and Bu are the parameters for the probe layer. The learning goal of a structural probe (Sec. 3.3) is to determine the edge distances between all pairs of nodes. The outputs of the visual probe and the textual probe layer are respectively the distance matrices Rv = (d Bv(vi, vj)2) RNv Nv and Rw = (d Bw(wi, wj)2) RNw Nw, which capture implicit relations between visual/textual entities.

3.3 Learning

We employ two learning stages in our method. In the ﬁrst stage, we train the BERT encoders including the intra-modality encoders and the inter-modality encoders to obtain the contextual object representations V and the token representations W . In the second stage, with these contextual representations, we freeze the BERT encoders and train the two probe layers to generate implicit relationship matrices Rv and Rw. Fig. 2 shows a schematic diagram of our learning framework.

3.3.1 Stage 1: Training BERT encoders

Masked language modeling with Ro I feature reconstruction. We train the BERT encoders with the MLM objective to predict masked Ro I feature vi and masked token wj given their surroundings I\i and S\j. We also include a L1 reconstruction smoothing loss [48] for the grounding of visual features. We minimize the following loss:

LMLM = EI,S D log p(vi|I\i, S) + log p(wj|S\j, I) X

i L1(vi g(vi|I\i, S))

where I and S are the image regions and input words with random masking, g(.) outputs the unmasked visual feature, p(vi|I\i, S) and p(wj|S\j, I) are respectively the predicted probabilities for the target object label and word given the masked inputs, and I and S are sampled from the training set D. Note that here we reuse the symbols v and w to represent both the visual features and the label/word for simplicity.

Image-text matching. An additional loss is added to perform the instance-level alignment between an image and its caption. Both positive (y = 1) and negative (y = 0) image-sentence pairs are sampled and the model learns to align with a binary cross-entropy loss:

LMatch = EI,S D[y log p(falign) + (1 y) log(1 p(falign))] (4)

where p(falign) is the output probability of a binary classiﬁer and falign is the visual-textual alignment representation. For SSRPShare and SSRPVisual, falign is computed as galign([ v; w CLS]), where v = P i vi/Nv is the visual representation averaged over the contextual features of all the visual elements V , w CLS is the contextual representation of the special token [CLS], and galign( ) is a non-linear mapping function (see supplementary for details). For SSRPCross, we deﬁne falign = galign(w CLS). Essentially, we force w CLS to model either the aggregated textual or visual-textual information.

The overall training loss for the ﬁrst-stage pretraining becomes: LStage1 = LMLM + LMatch.

3.3.2 Stage 2: Training relationship probes

In the second stage, the relationship probe layers are learned via a probe loss LS Probe and a contrastive loss LCL-all, where the former is to ensure the learned textual relationships Rw is structurally consistent with a dependency tree and the latter is to ensure that the learned relationships Rv and Rw remain stable across different data augmentations.

In particular, on the language side, we use a pre-parsed dependency tree Gw for each sentence [49] to guide the textual relationship probe learning with LS Probe deﬁned as:

LS Probe = 1

i,j |d Gw(wi, wj) d Bw(wi, wj)2| (5)

where d Gw(wi, wj) is the distance between tokens wi and wj in the dependency tree Gw.

For the contrastive loss, we adopt stochastic data augmentation methods to transform an original image (or sentence) into semantics-preserving data samples, and treat them as positive pairs; see Fig. 2, where Ii TI and Si TS denote image and sentence augmentations, respectively.1 For the data augmentation details, please refer to Sec. 4.1. Speciﬁcally, we sample a minibatch of Nc image-caption pairs and apply two separate augmentation strategies to each modality, resulting in 2Nc image-caption pairs. For every positive pair, its negative pairs are not sampled explicitly, but

1Note that in the interest of coherence, we describe data augmentation with contrastive learning in Stage 2, the augmented data can be used to train BERT encoders in Stage 1.

instead we take the other 2(Nc 1) augmented image-caption pairs within a minibatch as negatives. We adapt the contrastive loss introduced in [50, 51] to our cross-modal scenario. The single-modality contrastive loss LSCL(i, j) and cross-modality contrastive loss LXCL(i, j) for a positive image-caption pair {Ii, Ij}, {Si, Sj} are deﬁned as:

LSCL(i, j) = log e Zv,v i,j P2Nc k=1 1[k =i]e Zv,v i,k log e Zw,w i,j P2Nc k=1 1[k =i]e Zw,w i,k (6)

LXCL(i, j) = X

log e Zv,w m,n P2Nc k=1 1[k =m]e Zv,w m,k + log e Zw,v m,n P2Nc k=1 1[k =m]e Zw,v m,k !

where 1[k =i] {0, 1} is an indicator function, Zx,y i,j = ((zx i zy j )/( zx i zy j ))/τ denotes the cosine similarity between zx i and zy j , zv and zw are the nonlinear projections of vectorized relationship matrices Rv and Rw projected using MLP projection head [50], and τ is a temperature hyperparameter [52]. The ﬁnal loss is computed across all positive image-caption pairs in a mini-batch LCL-all = 1 2Nc P

i,j[LSCL(i, j) + LSCL(j, i) + LXCL(i, j)]. Note that LXCL is invariant to the order of sample indices (i, j) and thus is included just once in LCL-all.

In this stage, the overall training objective is: LStage2 = LS Probe + LCL-all.

4 Experiments

4.1 Datasets and implementation details

Pretraining corpus. To enlarge the training data, recent VL pretraining works [17, 16, 53, 18] use combined pretraining corpora such as Conceptual Captions (CC) [54], SBU captions [55], MSCOCO [56, 57, 58], Flickr30K [59], VQA [1], GQA [2], VG [5], Books Corpus (BC) [60], and English Wikipedia (EW), etc. In contrast, we only aggregate pretraining data from the train (113k) and validation (5k) splits of MSCOCO [58]. Speciﬁcally, with each MSCOCO image associated with ﬁve independent caption annotations, MSCOCO provides us an aligned VL dataset of 591K image-and-sentence pairs on 118K distinct images. Table 1 summarizes the corpus used by different pretraining methods.

Table 1: Comparisons of the corpus used by different pretraining methods.

Method Source Total Method Source Total

LXMERT [17] MSCOCO,GQA,VQA,VGQA,VG-Cap 9.2M VL-BERT [22] CC,BC,EW 3.3M OSCAR [19] MSCOCO,GQA,VQA,VGQA,CC,SBU,Flickr30K 6.5M Vil BERT [16] CC 3.1M UNITER [40] MSCOCO,CC,VG,SBU 5.6M Visual BERT [53] MSCOCO 0.6M Unicoder-VL [18] CC,SBU 3.8M Ours: SSRP MSCOCO 0.6M

Data augmentation. Instead of combining the existing VL datasets, we expand the pretraining corpus with data augmentation on both images and sentences, as shown in Table 2. For data augmentation on images, we employ horizontal ﬂipping (HFlip) at the image level and a few augmentations at the Ro I feature level including HFlip, rotations (90o, 180o, and 270o) and bounding box jittering (with scale factors selected from the range of [0.8, 1.2]). We enrich the training sentences through two pretrained back-translators [61]: English German English (En-De-En) and English Russian English (En Ru-En). Our augmentation strategies can generate signiﬁcantly more training samples: 1.65M at Ro I level and 1.77M at sentence level, while largely preserving the semantic information.

Table 2: Number of training samples at image, Ro I, and sentence levels.

Split Image Ro I features of Raw & HFlip images Sentence

Raw HFlip HFlip Rotate(90 ,180 ,270 ) Jitter[0.8,1.2] Raw En-De-En En-Ru-En

Train 118k 118k 118k 2 354k 2 236k 2 591k 591k 591k

Pretraining setting. We pretrain our three SSRP variants shown in Fig. 1. We set the numbers of layers for the intra-modality encoders of f S S Intra and f V V Intra to 9 and 5, respectively, and the number of layers for the inter-modality encoders of f V S Inter, f S V Inter , and f V S Inter to 5. For each transformer block, we set its hidden size to 768 and the number of heads to 12. To keep the sizes the same for the relationship matrices, the maximum numbers of words and objects are equally set to 36.

Pretraining is divided into two stages. In stage 1, we train with LStage 1. At each iteration, we randomly mask input words and Ro Is with a probability of 0.15. All models are initialized with BERT pretrained weights and the respective pretraining corpus is listed in Table 2. For cross-modality matching, we replace each sentence with a mismatched one with a probability of 0.5. We use Adam optimizer [62] with a linear learning-rate schedule [13] and a peak learning rate of 1e 4. The training is carried out with four Tesla V100 GPUs with a batch size of 128 for 10 epochs. After stage 1, we freeze the parameters of the intra-modality and inter-modality encoders and further train the relationship probes with LStage 2. The syntactic dependency tree for each sentence is built by [49]. All variants of SSRP are trained for 30 epochs with Adam, a batch size of 512, and a learning of 5e 5.

Fine-Tuning tasks. We ﬁne-tune the pretrained models to handle multiple downstream tasks: three VL understanding tasks (NLVR2 [63], VQA [1], and GQA [2]) and a generation task (image captioning), following the standard ﬁne-tuning settings for downstream tasks in [17, 53]. For VL understanding tasks, we use linearly-fused probed relationships and visual-textual alignment prediction falign in Eq. 4 as features. For image captioning, we utilize the Up-Down [64] framework and incorporate the reﬁned object features learned by SSRPVisual. The captioning model is ﬁrst trained with cross-entropy loss and is then followed by reinforcement learning loss [65].

4.2 Experimental results & analysis

We ﬁrst perform ablation experiments over a few design choices of our method on NLVR2. We then show the comparison results on VQA, GQA and image captioning tasks.

Table 3: Ablation study on NLVR2. The reported results are accuracy numbers on Dev set.

Method falign(Stage 1) falign(Stage 1) + Rel.(Stage 2)

Raw Aug. Rv Rw Rv+Rw

SSRPShare 60.53 61.67 62.52 62.66 64.25 SSRPVisual 69.92 70.75 71.23 71.24 72.03 SSRPCross 74.35 74.48 74.25 74.68 75.71

Effect of data augmentation. Table 3 shows the ablation study results. For the Raw setting, we pretrain our models only on the original corpus, while in the Aug. setting, we augment the original corpus with the augmentation techniques mentioned in Table 2. It is evident that our data augmentation strategy indeed improves the performance of all three models. Note that we employ data augmentation only during pretraining, but not during ﬁne-tuning.

Effect of attention. Comparing the three variants that use different attention settings in Table 3, we observe that SSRPCross performs the best, and SSRPVisual is better than SSRPShare. This conﬁrms the beneﬁts of the cross-attention structures that enable the features of one modality to attend to the other.

Table 4: Online VQA/GQA results on the teststandard splits, where * indicates the used corpus is larger than Visual BERT and ours.

Method VQA GQA

Binary Number Other Accu Accu

BUTD* [64] 86.6 48.6 61.5 70.3 LXMERT* [17] 88.2 54.2 63.1 72.5 60.3 Vil BERT* [16] 70.9 VL-BERT* [22] 87.9 54.8 62.5 72.2 OSCARB* [19] 73.4 61.2 UNITER-Base* [40] 72.9

Visual BERT [53] 87.5 52.3 61.0 71.0 SSRPCross 87.8 54.4 62.7 72.2 60.0

Effect of relationship probing. To analyze the effectiveness of the visual and textual relationships learned via pretraining, we concatenate the visual-textual alignment representation falign and relationships (Rel.) to form a relationshipaware feature vector for answer prediction. Table 3 shows that using language relationships Rw leads to better results than using visual relationships Rv. This is due to the available dependency tree for supervising the language model during training, while the visual relationships are learned in a completely self-supervised way. Combining visual and textual relationships achieves the best results. Our method SSRPCross (75.71) outperforms LXMERT (74.9) and Visual BERT (67.4) on NLVR2 dev-set, demonstrating that the probed relationships are beneﬁcial for the reasoning task.

Results on VQA& GQA. Table 4 shows the performance of our SSRPCross on VQA and GQA. Our method outperforms Vil BERT and Visual BERT, while being highly competitive with the best method that is trained with considerably larger training corpora.

Results on image captioning. Unlike the recent VL pretraining methods, which cannot be applied to single-modality vision tasks such as image captioning due to the cross attention used in pretraining, our SSRPShare and SSRPVisual models do not have such a limitation. Thus, we apply the stronger model SSRPVisual to image captioning using its reﬁned object features and the learned implicit visual

relationships. Table 5 shows the quantitative results, where SSRPVisual outperforms the baselines, indicating that the learned relationship-aware image representations can beneﬁt image captioning. Note that the online results of BUTD are achieved with model ensemble, while we use a single model.

Table 5: Results of image captioning on MSCOCO test split and online test server, where B@n, M, C and S are abbreviations for BLEU-n, METEOR, CIDEr, and SPICE, respectively.

Method B@1 B@4 M C S Method B@1 B@4 M C S

SCST [65] 33.3 26.3 111.4 Up-Down [64] (Our Impl.) 81.2 36.9 28.3 120.8 21.6 BUTD [64] 79.8 36.3 27.7 120.1 21.4 SSRPVisual 82.0 38.1 28.8 126.7 22.3

Results on the online MSCOCO test server

BUTD [64] (c5) 80.2 36.9 27.6 117.9 SSRPVisual (c5) 81.5 37.5 28.3 119.8 BUTD [64] (c40) 95.2 68.5 36.7 120.5 SSRPVisual (c40) 95.3 68.6 37.2 122.4

Original Image Image (HFlip) Original Image Ro I (Jitter) Original Image Ro I (HFlip) Original Image Ro I (Rotate)

Textual Dependencies (SSRPCross) Textual Dependencies (Gold) Textual Dependencies (SSRPCross) Textual Dependencies (Gold) Textual Dependencies (SSRPCross) Textual Dependencies (Gold) Textual Dependencies (SSRPCross) Textual Dependencies (Gold)

Gold Parse Gold Parse

SSRPCross SSRPCross

Figure 3: Examples of generated relationships for different augmented images and sentences. The bottom part shows the dependency trees resulted from SSRPCross outputs. Black edges above each sentence are the gold tree provided by Stanza [49], and red edges are provided by our SSRPCross.

Obj. Obj.+Rel. Obj. Obj.+Rel. Obj. Obj.+Rel. Obj. Obj.+Rel. Obj. Obj.+Rel. Obj. Obj.+Rel.

COCO_val2014_000000485129 COCO_val2014_000000361238 COCO_val2014_000000243134 COCO_val2014_000000335089 COCO_val2014_000000099119 COCO_val2014_000000347210

Figure 4: A visualization of the retrieved images on MSCOCO validation set. The Obj. method averages object features and computes the cosine similarities between images. The Obj. + Rel. method enhances the object features according to the predicted relationships.

What do probes learn during training? To answer that, we visualize in Fig. 3 the heat-maps of a few relationship examples generated by SSRPCross, where a darker color indicates a closer relationship. Particularly, the ﬁrst row shows the example images and their augmented counterparts, each of which contains objects and their probed visual relationships represented by straight lines with varying color intensity values. The second row presents the visual relationship distance graphs for the corresponding images. The bottom rows show the distance graphs and dependency trees for augmented captions. Fig. 3 shows that the probed dependency trees closely resemble the gold dependency trees. In addition, the distance graphs of the original data samples and their augmented counterparts for sentences and images are also close to each other, validating our assumption that the visual/linguistic relationships should be preserved even when data augmentation is applied. Remarkably, the learned implicit

relationships between objects are stable across differently augmented images, despite the fact that no gold visual relationships are provided in training.

Are visual relationships useful for visual tasks? To further verify the beneﬁts of implicit visual relationships in single-modality visual tasks, we perform image retrieval on MSCOCO with SSRPVisual. Fig. 4 shows the top-2 image retrieval results. As shown, Obj. + Rel. retrieves better visuallymatching images that are consistent with the object relationships in query images. For example, in the third example, the person in the top-1 retrieved image is next to a pizza, similar to the original image. This suggests that our model can capture the complex underlying visual relationships.

5 Conclusion

We have proposed a self-supervised visual relationship probing method that implicitly learns visual relationships without training on ground-truth relationship annotations. Our method transfers the textual relationships from image descriptions to image objects and explores the visual relationships by maximizing the agreement between differently augmented images via contrastive learning. Through our relationship probes, we have demonstrated that relationship structures in images and sentences can be well explored with well-designed distance and contrastive learning objectives. We believe such implicit relationships in images and languages can help improve many existing vision-language tasks, especially in the scenarios with limited annotations.

Broader Impact

Current representation learning models such as BERT and alike follow a similar structure. We think it is important to discover or probe the implicit knowledge that these models capture about language and vision. Our research on self-supervised relationship probing is a push in that direction and can be used for grounding the relationships expressed in language.

In this paper, we introduce SSRP, a self-supervised relationship probing method for visual and textual relationship extraction. Our research could be used to enrich the current scene graph generation methods and to complete the missing relationships between objects. The visual relationships generated by our method could be applied to a wide range of vision and vision-language applications including image captioning, image retrieval, object detection, visual question answering, visual reasoning, and visual-textual cross-modal retrieval, etc.

Here, we discuss the broader impact on the two important example applications (image retrieval and image captioning) which can beneﬁt greatly from the implicit relationships obtained with our method. By performing image retrieval using the implicit visual relationships discovered with our method, visual search engines can provide higher-quality results that better respect the visual relationships contained in query images to users. This provides a smoother visual search experience and helps users ﬁnd their desired images. On the other hand, for image captions/descriptions, with the implicit visual relationships generated by our method, richer and improved descriptions of images that more accurately describe the scenes in images can be obtained. This can help blind or visually-impaired people [66] see their surrounding environments better.

In terms of technical impacts, our method opens a new direction to better model visual object relationships, which is completely different from current visual relation models that heavily rely on human-annotated explicit visual relation labels. Annotating visual relationships is a highly subjective process where different annotators are likely to annotate quite differently. Relations are also very diverse and there is no clear deﬁnition. Our approach bypasses all these challenges of annotating relations by advocating to discover rich implicit relations directly from natural images and their textual descriptions in a self-supervised manner without using any explicit relation annotations. Thus, our method leads to richer and fairer visual relation model.

In addition, in terms of dataset, our method also goes beyond current pretraining models that prefer to combine more and more datasets together for self-supervised training. Instead, our proposed method is developed speciﬁcally to work effectively with augmented data that can be cheaply obtained with the proposed augmentation strategies and can be nicely integrated into the self-supervision objectives.

Overall, our method makes VL pretraining and visual relationship modeling more accessible to the masses.

[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.

[2] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.

[3] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019.

[4] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.

[5] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.

[6] Jiuxiang Gu, Shaﬁq Joty, Jianfei Cai, Handong Zhao, Xu Yang, and Gang Wang. Unpaired image captioning via scene graph alignments. In ICCV, 2019.

[7] Xu Yang, Kaihua Tang, Hanwang Zhang Zhang, and Jianfei Cai. Auto-encoding scene graphs for image captioning. In CVPR, 2019.

[8] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015.

[9] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual reasoning. In ICCV, 2017.

[10] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global context. In CVPR, 2018.

[11] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. Scene graph generation with external knowledge and image reconstruction. In CVPR, 2019.

[12] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.

[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.

[14] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Neur IPS, 2019.

[15] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. In ACL, 2019.

[16] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil BERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neur IPS, 2019.

[17] Hao Tan and Mohit Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.

[18] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020.

[19] Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.

[20] Amanpreet Singh, Vedanuj Goswami, and Devi Parikh. Are we pretraining it right? digging deeper into visio-linguistic pretraining. ar Xiv preprint ar Xiv:2004.08744, 2020.

[21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017.

[22] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: Pre-training of generic visual-linguistic representations. In ICLR, 2020.

[23] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, 2019.

[24] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert s attention. In Black Box NLP@ACL, 2019.

[25] Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, and Zachary C Lipton. Learning to deceive with attention-based explanations. In ACL, 2019.

[26] John Hewitt and Christopher D Manning. A structural probe for ﬁnding syntax in word representations. In NAACL, 2019.

[27] Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. Visual relationship detection with internal and external linguistic knowledge distillation. In ICCV, 2017.

[28] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent advances in convolutional neural networks. Pattern Recognition, 2018.

[29] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016.

[30] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, 2018.

[31] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In ICCV, 2019.

[32] Vincent S Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, Christopher Re, and Li Fei-Fei. Scene graph prediction with limited labels. In ICCV, 2019.

[33] Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. In ICLR, 2018.

[34] Apoorva Dornadula, Austin Narcomey, Ranjay Krishna, Michael Bernstein, and Fei-Fei Li. Visual relationships as functions: Enabling few-shot scene graph prediction. In ICCVW, 2019.

[35] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Neur IPS, 2017.

[36] Damien Teney, Lingqiao Liu, and Anton van den Hengel. Graph-structured representations for visual question answering. In CVPR, 2017.

[37] Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. Learning conditioned graph structures for interpretable visual question answering. In Neur IPS, 2018.

[38] Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR, 2019.

[39] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In ICCV, 2019.

[40] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. In ECCV, 2020.

[41] Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos, Or Sharir, Shai Shalev-Shwartz, Amnon Shashua, and Yoav Shoham. Sensebert: Driving some sense into bert. In ACL, 2020.

[42] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas Mc Coy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. What do you learn from context? probing for sentence structure in contextualized word representations. In ICLR, 2019.

[43] Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, and Martin Wattenberg. Visualizing and measuring the geometry of bert. In Neur IPS, 2019.

[44] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

[45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

[46] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Neur IPS, 2015.

[47] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google s neural machine translation system: Bridging the gap between human and machine translation. ar Xiv preprint ar Xiv:1609.08144, 2016.

[48] Ross Girshick. Fast r-cnn. In CVPR, 2015.

[49] Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. Stanza: A Python natural language processing toolkit for many human languages. In ACL, 2020.

[50] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020.

[51] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

[52] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.

[53] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. ar Xiv preprint ar Xiv:1908.03557, 2019.

[54] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.

[55] Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. Im2text: Describing images using 1 million captioned photographs. In Neur IPS, 2011.

[56] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

[57] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. ar Xiv preprint ar Xiv:1504.00325, 2015.

[58] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.

[59] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.

[60] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, 2015.

[61] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In NAACL, 2019.

[62] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[63] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. In ACL, 2019.

[64] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.

[65] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Selfcritical sequence training for image captioning. In CVPR, 2017.

[66] Haley Mac Leod, Cynthia L Bennett, Meredith Ringel Morris, and Edward Cutrell. Understanding blind people s experiences with computer-generated captions of social media images. In CHI, 2017.