# cycleconsistency_learning_for_captioning_and_grounding__5ff07378.pdf Cycle-Consistency Learning for Captioning and Grounding Ning Wang1, Jiajun Deng2, Mingbo Jia1 1Huawei Inc. 2University of Adelaide, Australian Institute for Machine Learning wn6149@mail.ustc.edu.cn, jiajun.deng@adelaide.edu.au, jiamingbo@huawei.com We present that visual grounding and image captioning, which perform as two mutually inverse processes, can be bridged together for collaborative training by careful designs. By consolidating this idea, we introduce Cy Co, a cyclicconsistent learning framework to ameliorate the independent training pipelines of visual grounding and image captioning. The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions. Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. Our image captioning model has the capability to freely describe image regions and meanwhile shows impressive performance on prevalent captioning benchmarks. Introduction The recent decades have witnessed the great success in vision-language (VL) related fields. Based on the primary target of bridging the modality gap between vision and language, deep neural networks addressing VL tasks generally share the pre-training objective, model structure, large-scale training corpus, etc. However, by the time of downstream fine-tuning, these tasks are typically individually tackled or simply combined in a multi-task training paradigm. In this work, we devote our efforts to two VL downstream tasks including image captioning (Vinyals et al. 2015; Anderson et al. 2018) and visual grounding (Mao et al. 2016; Yu et al. 2016), and explore their inherent relationships to enable effective joint training. To match the granularity of visual parts in visual grounding, we first extend image captioning to a more general scenario, where the model is intended to describe a given region. We define this generalized task as regional image captioning, which is similar to dense captioning task (Johnson, Karpathy, and Fei-Fei 2016) but is free of the requirement of object detection. Particularly, the conventional task definition of image captioning (Vinyals et al. 2015) is a special case of regional image captioning Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Visual Grounding Image Captioning Cycle Consistency Loss Visual Grounding Image Captioning right pink and blue person Cycle Consistency Loss Predicted Caption Initialization Caption Predicted Caption Initialization Box Predicted Box Predicted Box (a) Captioning-to-Grounding Cycle-Consistency Framework (b) Grounding-to-Captioning Cycle-Consistency Framework a woman in a pink snowsuit a woman in a pink snowsuit Figure 1: The proposed framework jointly optimizes image captioning and visual grounding models via cycleconsistency learning. In (a), our method computes the region (box) consistency in the captioning-to-grounding optimization cycle. In (b), our framework measures the caption consistency in the grounding-to-captioning cyclic process. that regards the whole image as a region. For visual grounding, which is also known as referring expression comprehension (Mao et al. 2016; Yu et al. 2016) and phrase localization (Kazemzadeh et al. 2014; Plummer et al. 2015) in the literature, we maintain the original task target to localize the corresponding region (generally denoted by a bounding box) described by a given language expression. Despite achieving inspiring progress, image captioning and visual grounding still suffer from several limitations. For visual grounding, this task simultaneously requires text descriptions and accurate object bounding boxes for model optimization. These fine-grained image-text-box triplets are rather laborious to collect. How to optimize the grounding model using limited data annotations has received considerable attention (Liu et al. 2021; Wang et al. 2021a). As for image captioning, existing methods typically focus on describing the whole image. The capabilities of modeling region relationships and properly describing them are largely overlooked in existing algorithms. We argue that a robust image captioner should be qualified to freely describe the image, from an arbitrary region to the whole image. Simi- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) lar to the grounding task, obtaining the regional captioning ability also requires sufficient image-text-box data, increasing the training cost. In this paper, we introduce a joint learning framework, namely Cy Co, to ameliorate the training pipelines of image captioning and visual grounding via cyclic consistency. Our core motivation is that visual grounding and regional image captioning can be regarded as an inverse process of each other. Specifically, a visual grounding model takes an image and a region-specific description as inputs to predict the position of the corresponding bounding box, while regional image captioning receives the location of a region to produce a region-aware caption. When taking one s output as the input of the other, these two tasks naturally establish a cyclic structure. As shown in Figure 1 (a), the generated bounding box from the grounding model is expected to be consistent with the input box of the regional image captioner. In this process, the whole framework merely needs an initialized bounding box for training. As depicted in Figure 1 (b), the produced text description from the regional image captioner is expected to be consistent with the input referring text of the grounding model. As a result, the models in this cycle are free of the bounding box annotations. The proposed cycle-consistency learning framework enjoys the following merits. (1) We bridge two independent VL tasks in a unified framework. To this end, our method can potentially absorb the training data from both tasks and share the model parameters for collaborative training. (2) Thanks to the joint training and online data augmentation (iterative pseudo label generation) in the cyclic learning process, under the same training data, our framework further improves the performance of the supervised grounding model. (3) After collaborative training, our framework yields a strong image captioning model that can describe the visual contents in different spatial levels, i.e., from a subregion to the global image. (4) The proposed framework allows the semi-weakly supervised training of the grounding model, which merely needs limited fully-annotated images and many more images with only bounding box annotations or only language expression labels for model training. In summary, we make three-fold contributions: We present a novel cycle-consistency learning framework to bridge two independent vision-language tasks including image captioning and visual grounding. We design a simple regional image captioner and a Transformer-based grounding model. We further organize them in a unified framework with weight-sharing architecture for efficient end-to-end learning. Extensive experiments validate the effectiveness of our proposed cycle-consistency learning framework. Related Work Vision-language Pre-training. Vision-language (VL) pretraining algorithms (Radford et al. 2021; Li et al. 2020, 2022) aim to bridge the domain gap between vision and language representations. The recent dual-encoder methods such as CLIP (Radford et al. 2021) align the representations using contrastive learning. Despite the outstanding performance, their light interaction manner fails to deeply fuse VL representations for generation tasks. In contrast, recent VL pre-training approaches (Zhou et al. 2020; Li et al. 2020, 2021, 2022) adopt a relatively heavy Transformer architecture (Vaswani et al. 2017) to achieve the deeper multi-modal interaction. Inspired by the success of previous arts, we also conduct the cross-modal pre-training to prompt the downstream VL tasks. Visual Grounding. Traditional visual grounding methods typically follow a two-stage pipeline, which generates plentiful region proposals in the first stage and selects the most matched one via language expression in the second stage (Yang, Li, and Yu 2019; Liu et al. 2019). Recently, onestage visual grounding approaches gain increasing attention. They generally embed the linguistic information into the one-stage object detector (Yang et al. 2019) or model multimodal representations via Transformer (Deng et al. 2021, 2023) for efficient visual grounding. Different from the above algorithms based on supervised training, weakly supervised grounding models learn the region-phrase correspondence with only language expressions. These methods first obtain a set of ROIs (Region of Interest) using object detectors, and then mine the correspondence between ROIs and query expressions for model training (Wang et al. 2021a; Liu et al. 2021). In this work, we train the grounding model in a semi-weakly supervised manner with the help of a captioning model. Our framework is free of external object detectors for data pre-processing. Image Captioning. Image captioning aims to generate a human-readable sentence to describe the image contents. Captioning algorithms (Huang et al. 2019; Anderson et al. 2018; Hu et al. 2021b) typically utilize the object detectors to extract ROI features or simply exploit the grid features for efficient visual representation modeling (Wang et al. 2021b; Li et al. 2022). After visual feature extraction, captioning models utilize a decoder such as Transformer to generate the sentence (Wang, Xu, and Sun 2022; Wang et al. 2023a,b). Previous algorithms exploit the region information to facilitate the image captioning (Chen et al. 2020a; Kinghorn, Zhang, and Shao 2018; Cornia, Baraldi, and Cucchiara 2019), but they still focus on describing the global image content. The dense captioner (Johnson, Karpathy, and Fei-Fei 2016) mainly focuses on detecting and describing local ROI regions. In contrast, our image captioner is designed to freely describe both global and regional contexts. Recently, some methods (Wang et al. 2022; Yu et al. 2017) jointly train the captioning and grounding models in a multitask fashion. Mao et al. (Mao et al. 2016) utilize the grounding model as a verification module to push the generated referring expressions to be unambiguous. Different from previous arts, our method bridges the captioning and grounding models in a cyclic framework and explores two different cycle-consistency constraints for collaborative training. Cycle-Consistency Learning. To bridge one modality or domain to the other, cycle-consistency learning has been explored extensively in visual object tracking (Wang et al. 2019; Wang, Jabri, and Efros 2019), machine translation (He et al. 2016), unpaired image-to-image translation (Zhu et al. 2017), visual question answering (Shah et al. 2019), image captioning (Guo et al. 2019), etc. Different from the previ- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Feed Forward Project Layer N Visual Encoder (Vi T) Feed Forward Cross Atten Causal Self Atten Feed Forward Cross Atten Bi-direct Self Atten Regression Head (x, y, w, h) (x, y, w, h) Box Project Layer Classification Head right pink and blue person Vision Tokens Regional Image Captioning Model Visual Grounding Model (a) Model Architecture (b) Cycle-Consistency Learning Framework Regional Image Captioning Visual Grounding Cycle Consistency Constraint Language Expression Predicted Expression Bounding Box Visual Grounding Predicted Box Regional Image Captioning Language Expression Bounding Box Grounding-to-Captioning Cycle-Consistency Captioning-to-Grounding Cycle-Consistency Pseudo Label Pseudo Label Input Image Input Image Partial Weight-sharing Partial Weight-sharing Cycle Consistency Constraint Figure 2: In (a), we exhibit the model architecture of our proposed joint training framework of visual grounding and regional image captioning. Two models share a visual encoder (Vi T) and leverage different Transformer blocks for individual tasks. In (b), we show the cyclic consistency learning processes of our framework including a grounding-to-captioning cycle and a captioning-to-grounding cycle. ous arts that explore the consistency within a single modality or a single task, we jointly optimize two VL tasks to form different cycles. To our knowledge, the cyclic consistency between image captioning and visual grounding has rarely been touched in the literature. Further, we explore the potential of our framework in different training scenarios including semi-weakly supervised and fully supervised training. Methodology Our method follows the pretrain-and-finetune paradigm. At the pre-training stage, we leverage the widely-adopted training objectives (i.e., image-text contrastive loss, matching loss, and language modeling loss) to align and fuse the visual and linguistic representations. At the fine-tuning stage, both visual grounding and image captioning models reuse the pre-trained model architecture, while capitalizing on task-specific head networks. During fine-tuning, we further develop the cycle-consistency learning framework to jointly optimize the grounding and captioning models. The detailed model architecture and our proposed cycleconsistency learning framework are illustrated in Figure 2. Revisiting Model Pre-training Our pre-trained vision-language model follows the BLIP approach (Li et al. 2022). We briefly review its vision encoder, image-grounded text encoder, and image-grounded text decoder, which are highly related to our downstream tasks. Vision Encoder. We exploit the commonly used Vi T-B/16 network (Dosovitskiy et al. 2020) as the vision encoder. Vi TB/16 is composed of a stack of 12 Transformer encoder layers with 12 heads in each multi-head attention layer. Given an image, we first split it into small patches with equally 16 16 size, and project them into feature vectors, which are known as vision tokens. We denote the final encoded vision tokens as v. Image-Grounded Text Encoder. In this block, we first append a special [ENC] token at the beginning of the text sequence. Then the text tokens are converted to embeddings via a word embedding layer. This text encoder leverages the bi-directional self-attention to further encode the text embeddings, and aggregate the visual information through cross-attention with vision tokens for multi-modal fusion. The output embedding of [ENC] contains the rich multimodal representation of the image-text pair, which is exploited to compute the image-text matching (ITM) loss. Specifically, we add a binary classification head on top of it to predict whether an image-text pair is matched. Image-Grounded Text Decoder. This text decoder block is similar to the above image-grounded text encoder, while replacing the bi-directional self-attention with the causal selfattention to facilitate the text generation. A special [DEC] token is used as the beginning signal of a sequence. The image-grounded text decoder is optimized by the language modeling (LM) loss, which maximizes the likelihood The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) of the text in an autoregressive manner. Cycle-Consistency Learning Framework In this section, we first introduce how to transfer the pretrained model to downstream visual grounding and regional image captioning tasks. Then, we exhibit how to jointly optimize them by virtue of the cyclic constraints. Visual Grounding (VG). To facilitate the following introduction of cycle-consistency learning, we depict the visual grounding model in a holistic view, which takes the visual features v and description tokens x as its inputs, and outputs a predicted bounding box bpred as follows. bpred = Model VG(v, x). (1) In our framework, visual grounding model Model VG( , ) reuses the image-grounded text encoder block, which fuses the visual and text features to predict the region localization. Following the setting in the pre-training stage, we also add a special [ENC] token at the beginning of the tokenized tokens. Then, the bi-directional self-attention encodes the text tokens and cross-attention injects visual information into the text tokens. After multi-layer visual-linguistic fusion via the Transformer structure, the output embedding of the [ENC] token contains rich cross-modal contexts, which is leveraged for bounding box regression via a regression head. The regression head is a simple three-layer multi-layer perceptron (MLP) with Re LU activations between layers. The output of the box regression head is the 4-dim box coordinates bpred = (x, y, w, h). In the model fine-tuning stage, we normalize the groundtruth region bounding box by the scale of the image to obtain bgt = (ˆx, ˆy, ˆw, ˆh), and combine the generalized Io U (GIo U) loss LGIo U(bpred, bgt) (Rezatofighi et al. 2019) and smooth L1 loss Lsmooth-L1(bpred, bgt) to optimize the grounding model. Regional Image Captioning (IC). Similar to the visual grounding, we also formulate the regional image captioning model Model IC( , ) as a black box, which takes image features v and a specific region (denoted by the box coordinate b) as inputs to predict the region-related language expression xpred as follows. xpred = Model IC(v, b). (2) This regional image captioner mainly reuses the imagegrounded text decoder in the pre-training stage to generate the language expression. After model pre-training with language modeling (LM) loss, the text decoder already has the zero-shot captioning capability to some extent. Nevertheless, different from the classic image captioner, our model is required to be region-aware. To this end, in the fine-tuning stage, we project the box coordinate b to the regional embedding via a fully-connected layer: ebox = FCbox(b). This regional embedding is added to the vision tokens to obtain the region-aware visual representations v : v = v + ebox. (3) Different from the bi-directional self-attention in visual grounding, the captioning model utilizes causal selfattention to facilitate text generation. As shown in Figure 2 (a), a classification head upon Transformer is used to generate tokens over the vocabulary. For a text sequence x = {x1, x2, , xn}, image captioner generates token xt based on the previous tokens x