# locca_visual_pretraining_with_locationaware_captioners__9b7afc51.pdf Loc Ca: Visual Pretraining with Location-aware Captioners Bo Wan1,3 Michael Tschannen1 Yongqin Xian2 Filip Pavetic1 Ibrahim Alabdulmohsin1 Xiao Wang1 André Susano Pinto1 Andreas Steiner1 Lucas Beyer1 Xiaohua Zhai 1 1Google Deep Mind, Zürich 2Google, Zürich 3KU Leuven Image captioning was recently found to be an effective pretraining method similar to contrastive pretraining. This opens up the largely-unexplored potential of using natural language as a flexible and powerful interface for handling diverse pretraining tasks. In this paper, we demonstrate this with a novel visual pretraining paradigm, Loc Ca, that incorporates location-aware tasks into captioners to teach models to extract rich information from images. Specifically, Loc Ca employs two tasks, bounding box prediction and location-dependent captioning, conditioned on the image pixel input. Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can effortlessly handle multiple tasks during pretraining. Loc Ca significantly outperforms standard captioners on downstream localization tasks, achieving state-of-the-art results on Ref COCO/+/g, while maintaining comparable performance on holistic tasks. Our work paves the way for further exploration of natural language interfaces in visual pretraining. 1 Introduction Remarkable progress has been made in large-scale visual pretraining [1, 2, 3, 4], where vision models are pretrained on large-scale annotated datasets [5, 6, 4] with a supervised classification loss. Yet, the manual annotation required for such datasets is time-consuming and costly, posing a challenge to scalability. In light of this, the modern contrastive pretraining methods [7, 8] extract learning signals from web-crawled image-text pairwise datasets [9, 10], circumventing the need for extensive manual annotations. The contrastively pretrained models demonstrate remarkable capability on zero-shot transfer tasks, especially on downstream applications that require fine-grained visual [11, 12] or textual understanding [13]. More recently, image captioning has been shown as an alternative visual pretraining task to learn capable vision encoders [14], where an encoder-decoder architecture is pretrained to generate text captions from the image input. Some studies, such as [15, 16], pioneered the joint pretraining of contrastive and generative methods. Typically, encoded image features are fed into two parallel branches: one employs a text encoder to produce sentence embeddings for contrastive learning, while the other utilizes a text decoder to generate image captions. Despite the effectiveness of these works, they typically focus on a holistic understanding of images, often overlooking the region-specific details of the visual content. The recent success of image captioning [14] and the advancements in multitasking learning of decoders [17, 18, 19] opens up the largely-unexplored potential of using natural language as a Work done during Google Deep Mind internship. Project lead. All authors made significant technical contributions. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Vision Transformer Transformer Cap: ARef: GCap: A picture with a puffin standing on a cliff edge and another puffin curled up in the back. a puffin standing on a cliff edge [20, 480, 150, 200] [400, 110, 460, 40] another puffin curled up in the back Cross Attention Location-aware Captioner Figure 1: Overview of Loc Ca. Loc Ca consists of a standard vision transformer and a transformer decoder. The vision transformer takes image pixel as input, produces visual tokens as cross attention input to the transformer decoder. The transformer decoder is trained to read out rich information from the visual tokens. We adopt the following three task for pretraining: Cap, AREF and GCAP. flexible and powerful interface for handling diverse tasks. We demonstrate this with a novel visual pretraining paradigm, Loc Ca, that enhances the visual representation with location-aware context. Works including [20, 21, 22] investigate the matching of image regions with corresponding text during pretraining. The central concept involves extracting Region of Interest (Ro I) features from image embedding to facilitate contrastive learning with corresponding textual features. These approaches yield encouraging outcomes in location-sensitive tasks, such as object detection [23, 24, 25, 26] and referring expression [27, 28, 29, 30]. However, they require complex model architectures (e.g. RPN [24] and FPN [25]) for Ro I generation. Also, given the presence of multiple object candidates within an image, region-wise matching becomes computationally demanding. By contrast, Loc Ca is a simple yet effective location-aware captioning pretraining method as shown in Figure 1, which uses an autoregressive decoder as an interface to handle additional location-aware pretraining tasks. Concretely, other than the image-to-text captioning task, Loc Ca also pretrains the model with two location-aware tasks: (i) automatic referring expressions, which amounts to predict bounding box coordinates from automatically generated captions for specific image regions, and (ii) grounded captioning to jointly predict box coordinates and captions from the image. Specifically, Loc Ca leverages a multi-task decoder [17] for pretraining, where the model outputs are conditioned on the task prefixes for each task. Thanks to the shared vision transformer for multiple tasks, the additional localization losses are relatively cheap to compute, while the model inference speed is identical to the standard image caption pretrained models. Our experimental results show that the Loc Ca performs significantly better on downstream tasks that require localization capabilities, while maintaining the same level of capabilities on holistic tasks. We summarize our contributions as follows: (i) For the first time we explore location-aware tasks as proxies for generative visual pretraining (as opposed to transfer/instruction tuning in prior works), enabling flexibly customized inference (detailed in Sec.3.3); (ii) Without bells and whistles, Loc Ca achieves state-of-the-art results on localization tasks, while preserving the competitive performance on holistic tasks; and (iii) When integrated in vision-language models, i.e. Pa LI-3 [31], the vision encoder outperforms strong Sig LIP baselines [13]. 2 Related Works Contrastive visual pretraining is a prominent direction in training vision and vision-language foundation models. Early works [32, 33, 34, 35] explore image-only contrastive loss by matching different views of the same image in the self-supervised learning setting. In vision-language model pretraining, CLIP [7] and ALIGN [8] show that a two-tower architecture trained with the contrastive objective on noisy image-text pairs can learn highly transferable image and text representations for various downstream tasks. There have been many follow-up works [10, 36, 13, 37] that further improve the zero-shot image classification and retrieval performance. Notably, [20, 21] propose to incorporate location cues by contrastively aligning image regions and text phrases. In contrast, our work focuses on learning a location-aware vision-language model with a generative loss. A natural alternative to contrastive pretraining is image captioning: Rather than matching image and text embeddings, one tries to predict captions from an image embedding. Early works investigate this approach at small scale [38, 39, 40, 41]. Later works augment large-scale contrastive pretraining with a captioning loss [15, 16, 42], or scale-up captioning as a stand-alone task without investigating transfer to a broad set of vision tasks [43, 44]. [14] recently showed that image captioning alone leads to visual representations competitive with contrastive pretraining. Many recent large multimodal models are trained with a mixture of tasks [45, 18, 19, 46, 10, 47, 31, 48, 49]. Among the most popular types of tasks are those which can be formulated by mapping an image to a text string, including captioning, detection, and VQA [50, 45, 18, 19, 10, 47, 31, 48]. Another popular group of tasks are dense prediction tasks such as segmentation and depth prediction [19, 46]. While several studies have enhanced model pretraining by incorporating location information, their methodologies primarily leverage either pretrained language decoders [18, 51, 52] or pretrained cross-modal encoder-decoder [51] to integrate vision and language features for multitasking purposes, often neglecting the independent significance of visual pretraining from scratch. Furthermore, there is a trend of towards co-training on images, video, and audio [53, 54, 55], highlighting the multifaceted nature of current multi-modal research. Crucially, essentially all of these works rely on pretrained vision and language backbones, and merely fine-tune those together on the described tasks. Here, by contrast, we use multi-task pretraining to train visual encoders from scratch. 3 Location-aware Captioner In this section, we introduce the location-aware image captioner Loc Ca for multitask pretraining. Loc Ca builds on an image captioner but provides a recipe for integrating location-aware information during model pretraining. 3.1 Pretraining tasks The pretraining phase of Loc Ca draws inspiration from pioneering works that have successfully integrated a unified decoder for multitasking based on pretrained models [18, 10, 19, 46, 52], utilizing a task-specific prefix for each distinct task. This enhances the model s ability to handle multiple tasks concurrently. For conventional image captioning, the process involves taking an image x as input and generating a sequence of text tokens y = [y1, y2, . . . , yn]. In the Loc Ca framework, a task-specific prefix, labeled as Cap: , is added to the beginning of the caption sequence to designate the task at hand. Moreover, Loc Ca integrates two additional location-aware tasks during its pretraining phase: automatic referring expression (AREF) and grounded captioning (GCAP). These tasks are inspired by referring expression comprehension [27, 28, 29, 30] and dense captioning [56, 57, 58, 59, 60] respectively. The key difference is that Loc Ca predicts both regional captions and box coordinates sequentially with task prefixes, instead of relying on either caption or box conditional inputs (see Fig. 1). The foundation of Loc Ca s pretraining is built upon dense, automatically generated region annotations. Each image x is associated with a comprehensive set of annotations {(b, c)}, where b N 4 denotes the bounding box coordinates, and c represents the corresponding textual descriptions or labels. For every bounding box, two distinct prompts are generated to cater to the aforementioned location-aware tasks: ARef: {c} : {b} for automatic referring expression and GCap: {b} : {c} for grounded captioning, each prefixed with ARef: and GCap: respectively. These prompts are then tokenized to produce the sequence y for each task, facilitating pretraining with a text interface. For each image, Loc Ca utilizes the same visual features extracted by the image encoder and performs three tasks using the same decoder in parallel. This pretraining scheme aims to make Loc Ca adept at linking fine-grained regional visual elements with appropriate textual descriptions. 3.2 Model details Architecture Loc Ca utilizes a conventional encoder-decoder framework, where the encoder comprises a Vision Transformer [2] to transform the input image x into a sequence of feature embeddings. The decoder, built on a Transformer architecture [61], processes these image features, employing cross-attention across each layer to integrate visual and textual information effectively. Autogressive decoding In the decoding stage, Loc Ca uses causal attention masks [61] to guide the prediction of each token in the sequence, ensuring that each token is generated based on the ones before it, in a step-by-step manner. This setup helps in creating coherent and context-aware captions, drawing from the visual cues encoded earlier and maintaining a logical flow in the generated text. Parallel prediction Inspired by [14], Loc Ca also adopts parallel prediction for a fraction of the training examples (concretely 50%) in the standard image captioning task. This technique requires the model to predict the caption tokens independently in parallel, focusing exclusively on visual information to prevent the model from relying on preceding text tokens to predict the following ones. This strategy was shown to improve the learned representation on a range of tasks [14]. Objective The optimization of Loc Ca s parameters θ is achieved through the maximization of the log-likelihood: P|y| i=1 log Pθ(yi|y