# scaling_openvocabulary_object_detection__2e970529.pdf

Scaling Open-Vocabulary Object Detection

Matthias Minderer Alexey Gritsenko Neil Houlsby

Google Deep Mind

{mjlm, agritsenko, neilhoulsby}@google.com

Open-vocabulary object detection has beneﬁted greatly from pretrained visionlanguage models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to imagelevel pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudoannotation ﬁltering, and training efﬁciency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales ( 10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With a Vi TL/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classiﬁcation and language modelling. Code and checkpoints are available on Git Hub.1

1 Introduction

Object detection is a core computer vision task with many real-world applications. Consequently, there is great interest in improving detection models, especially in the open-vocabulary domain. For image-level tasks, large improvements have been achieved through contrastive pretraining of visionlanguage models, which is massively scalable because it can use naturally abundant weak supervision in the form of image-text pairs from the Web [30, 12, 29]. Since no such natural supervision data exists for localization tasks, open-vocabulary detection models typically build on pretrained imagelevel encoders [9, 19, 26, 46, 22, 1, 39, 47]. However, due to the scarcity of detection data and the fragility of pretrained representations, detection-training stages of these models have typically had to be relatively brief, which limits ﬁnal detection performance and scaling potential.

The scarcity of detection data can be addressed with self-training. In self-training, an existing detector is used to predict bounding boxes on unlabeled images to generate data for training better detectors [31, 48, 35]. By combining open-vocabulary detectors with Web image-text data, such pseudo-labeling can produce practically unlimited amounts of open-vocabulary detection training data that leverages the image-associated text for semantic supervision. While several works have applied various forms of self-training to open-vocabulary object detection [46, 1, 47, 39, 38], they have done so at relatively small scales, comparable to the size of human-annotated detection datasets and much smaller than the datasets used for image-level training.

1https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

OWL-Vi T detection loss

OWL-Vi T detection loss

Caption: Monarch on

N-grams: Monarch Monarch on Monarch on a Monarch on a Zinnia

detector (OWL-Vi T L/14)

Pseudo-annotated

1. Annotation 2. Self-training 3. Fine-tuning

Monarch on a Zinnia

107 108 109

Total examples seen (including repetitions)

LVIS APrare (%)

Unseen classes

Model OWL-ST+FT G/14 OWL-ST+FT L/14 OWL-ST+FT B/16 OWL L/14 (annotator) Best OWL Best F-VLM

Best OWL Best F-VLM

Figure 1: Overview of our method. Left: Our method has three steps: (1) Generate pseudo-box annotations on Web LI with OWL-Vi T L/14, queried with caption N-grams. (2) Train new models on pseudo-annotations. (3) Optionally, ﬁne-tune on human annotations. Right: Zero-shot detection performance on LVISrare after ﬁne-tuning on LVISbase. Neither the annotator nor our models have seen any human-generated box annotations for LVISrare classes. Our self-training approach improves over other methods even at moderate amounts of training (e.g. the OWL-L/14 model we use as annotator; black ), and continues to improve as training is scaled up. Horizontal black lines indicate previous state-of-the-art open-vocabulary detectors which did not see LVISrare classes during training.

To scale detection self-training further, we take guidance from image-level methods, where the principle has been to leverage weak supervision in the largest possible amount [30, 12, 29, 42]. We identify three key ingredients for optimizing the use of weak supervision for detection: choice of label space, ﬁltering of pseudo-annotations, and training efﬁciency. Prior methods have typically used human-curated label spaces or complex concept mining [47, 39, 38, 46] and strict ﬁltering, keeping just the single largest [47] or highest-scoring [1] pseudo-box for each image. In contrast, we argue that we should let the data do the work and therefore apply little processing and ﬁltering. We propose to simply use all possible N-grams of the image-associated text as detection prompts for that image, and apply only weak conﬁdence ﬁltering to the resulting pseudo-labels.

We apply this self-training recipe to the OWL-Vi T detection architecture [26] and call it OWL-ST. To increase the number of examples seen for a given amount compute, we also introduce OWLv2, an optimized architecture with improved training efﬁciency. Combining the OWL-ST recipe with the OWLv2 architecture surpasses prior state-of-the-art methods already at moderate amounts of self-training, comparable to training amounts of previous methods (Figure 1). Scaling self-training to billions of examples yields further large improvements. For example, our Vi T-L/14-based model, trained on 2.3B image-text pairs and ﬁne-tuned on LVISbase, achieves 44.6% zero-shot LVIS m APrare, which is a 36% relative improvement over the prior state of the art (32.8% m APrare for F-VLM R50x64 [19]). Our largest model, Vi T-G/14, reaches 47.2% m APrare.

We also evaluate our models on a suite of "in the wild" datasets [21] and study the trade-off between ﬁne-tuned and open-vocabulary performance. We ﬁnd that strong inand out-of-distribution performance is possible with weight ensembling [37]. Finally, our analysis of the scaling behavior of OWL-ST suggests that self-training has further potential for leveraging abundantly available weak supervision for open-vocabulary object detection.

2 Related Work

2.1 Scaling Vision Models

Vision models have recently seen large advances in model and training scale, leading to improved performance on many image-level tasks. On the architecture side, Vision Transformers have been shown to scale more efﬁciently than prior architectures [17]. Task performance improves predictably as training data and compute are increased [42], with recent work showing continued improvements for models with up to 22 billion parameters [6]. We apply these ﬁndings to object detection.

On the data side, contrastive pretraining of vision-language models (VLMs) [30] has unlocked the use of abundantly available image-text pairs from the Web as weak supervision, with improved results if more data is used [12, 28]. VLMs, which embed images and text into a shared space, also enable open-vocabulary applications where prior models were limited to ﬁxed label spaces. Here, we use pretrained CLIP [30] and Sig LIP [43] encoders as backbones for our detector.

2.2 Open-Vocabulary Object Detection

Much recent work aims to transfer the open-vocabulary capabilities of VLMs to localization tasks such as object detection. A ﬁrst wave of VLM-based object detection methods either distilled VLMpredictions for cropped image regions (e.g. Vi LD [9]), or added detection heads directly to frozen (F-VLM [19]) or ﬁne-tuned (OWL-VIT [26]) VLM encoders. A challenge identiﬁed by these works is to protect the VLM from forgetting its open-vocabulary knowledge while training the detection heads on the relatively little available detection data.

2.3 Scaling Open-Vocabulary Detection with Weak Supervision

Given that earlier methods identiﬁed detection data as a limiting factor in open-vocabulary detection performance, more recent works focus on using weak supervision directly for detection training, rather than just during VLM pretraining. There are two main approaches:

Some methods use self-training, in which an existing detector is used to predict pseudo-boxes for images where image-level labels or captions, but no human box annotations, are available. Better detectors can then be trained on the pseudo-annotations. For example, Region CLIP [46] generates pseudo-boxes using nouns parsed from image captions and uses those boxes for localization pretraining. Detic [47] predicts class-agnostic pseudo-boxes on images for which classiﬁcation labels are available and associates the largest predicted box with the image label. Similar to our approach, 3Ways [1] uses an existing open-vocabulary detector to predict pseudo-boxes on captioned images, but uses the whole caption as a prompt, instead of dividing it into multiple prompts as we do.

Other methods propose grounding losses that directly train a detector on weak supervision such as image-level labels or captions. These methods pretrain models to align class-agnostic pseudo-boxes with words from image-associated text and rely on human-generated detection data for ﬁne-tuning. Major examples of this approach are GLIPv1/v2 and [22, 45] and Det CLIPv1/v2 [39, 38].

In principle, these approaches unlock Web-scale training for detection, but prior methods rarely go much beyond 10M examples and instead focus on the model architecture and training loss. Here, we keep architecture and loss simple, and focus on scaling up the training data, since this was successful for image-level models. A similar approach was recently applied with good results to class-agnostic segmentation in the Segment Anything work [16]. Together with our results on text-conditioned localization, this suggests that scaling up self-training is a powerful and general method for improving performance on ﬁne-grained vision tasks.

We propose a simple self-training approach with three steps: (1) Use an existing open-vocabulary detector to predict bounding boxes for a large Web image-text dataset. (2) Self-train a new detector on the pseudo-annotations. (3) Optionally, ﬁne-tune the self-trained model brieﬂy on human-annotated detection data (Figure 1, left). Our goal is to optimize the key components of this approach label space, annotation ﬁltering, and training efﬁciency such that it provides strong and scalable open-vocabulary performance with few human annotations.

3.1 Generating Web-Scale Open-Vocabulary Object Annotations

We use the Web LI dataset [4] as the source of weak supervision for self-training. Web LI is a large dataset of images and texts available on the public Web. The dataset consists of approximately 10B images and associated alt-text strings, which can be thought of as noisy image captions. For images whose alt-text is not in English, we use an automatically generated English translation [4].

We use OWL-Vi T CLIP-L/14 [26] to annotate all 10B Web LI images with bounding box pseudoannotations. OWL-Vi T is an open-vocabulary object detector. Given an image, the model ﬁrst detects objects in the image in a class-agnostic way. Then, given a list of free-text queries, the model produces scores indicating the likelihood that each detected object is associated with each text query.

A crucial design choice for open-vocabulary pseudo-labeling is the annotation label space. Methods in the literature vary widely but typically fall somewhere between two extremes: (1) use a ﬁxed, human-curated label space for all images (e.g. [47]), or (2) machine-generate per-image queries from image-associated text (e.g. [1]). We implement both and compare their performance in Section 4.3.

Human-curated label space. We performed one pseudo-annotation run by combining the label sets from the LVIS [10], Objects365 [33], Open Images V4 [20], and Visual Genome [18] datasets and removing duplicates and plural forms. In total, this label space contains 2520 common object categories, e.g. "phone", "goatee", "teakettle", "park", "suit (clothing)". See Appendix A.2 for code to generate the full list. Models trained on this label space may not be considered fully open-vocabulary for evaluation datasets whose classes were included in the pseudo-annotation label space (e.g. LVIS), since the evaluation vocabulary is known at training time in this case. However, LVISrare classes are still unseen for all of our models, in the sense that neither the annotator nor the self-trained models have ever seen human box annotations for LVISrare classes.

Machine-generated label space. In a second pseudo-annotation run, we automatically generated queries from the image-associated text. Prior work using image captions as weak supervision for detection often used grammatical parsing to extract noun phrases or concepts [46, 39, 38]. These approaches may add biases that reduce the diversity of extracted queries. To keep such biases to a minimum, we use no grammatical parsing and simply extract all word N-grams up to length 10 from the text associated with a given image and use them as queries for that image. We apply minimal ﬁltering, only removing generic terms like image or png, and queries consisting entirely of stop-words (details in Appendix A.3). Note that, since OWL-Vi T uses late image-text fusion, the quality of box localization (as opposed to classiﬁcation) is not affected by the chosen label space.

Regardless of label space, we ensemble predictions over seven prompt templates such as "a photo of a {}" as described in [26]. For each predicted box, we keep the query with the highest score as its pseudo-label. For each image, we keep all boxes above a score threshold. We study the choice of threshold in Section 4.4. The pseudo-annotations are used as hard labels for self-training.

3.2 Self-training at Scale

We now describe how we use the pseudo-annotations to train better detectors. We use a variant of the OWL-Vi T architecture [26] as described below. The image and text encoders are initialized from contrastively trained image-text models (CLIP, unless noted otherwise); the detection heads are randomly initialized. All models are ﬁrst trained exclusively on pseudo-annotations ( self-training ). In an optional separate step, models are ﬁne-tuned brieﬂy on human-annotated detection data.

Self-training proceeds similarly to detection training in [26]. In particular, we use the same losses and also augment queries with pseudo-negatives that are randomly sampled from the queries of other images, similar to batch negatives in [1]. Due to the size of our dataset, in contrast to [26], we use no random prompt templates and fewer image augmentations (details in Appendix A.5).

Prior work on image-level tasks shows that pretraining improves performance on downstream tasks well beyond 1 billion examples seen [44, 12, 28, 42], across model sizes. We hypothesize that similar scaling applies to detection self-training. We therefore optimize training efﬁciency to maximize the number of images seen for a given amount of training compute as follows.

Token dropping. Vision Transformers represent images as an unordered sequence of tokens. Tokens can therefore be reorganized or dropped without changing the model parameters. Various forms of token dropping or pooling have been proposed to improve efﬁciency [24, 32, 41, 25, 2]. Here, we drop tokens simply based on the pixel variance of the corresponding image patch. Both natural and Web images contain low-variance areas devoid of useful information, e.g. sky, single-color backgrounds, or padding. We ﬁnd that the lower half of image patches by mean pixel variance can be dropped without loss in detection performance (Appendix A.6). We therefore drop 50% of patches in all of our experiments during training. No patches are dropped during inference.

Table 1: Open-vocabulary detection performance on LVIS and ODin W. Rows for our models are shown in blue . None of our models have seen any human box annotations for LVISrare classes at any stage of training, so LVIS APval rare (rightmost column) measures zero-shot performance. Numbers in green or red indicate the difference to the prior state of the art, i.e. F-VLM R50x64 in the openvocabulary (top) part of the table and Det CLIPv2 Swin-L in the curated-vocabulary (bottom) part. Gray O+VG indicates that O365+VG were used indirectly (for training the annotator). Gray ODin W numbers indicate that these models were trained on Open Images data, which overlaps with ODin W. APmini refers to the LVIS minival split introduced by MDETR [14].

Method Backbone Self-training data Self-training vocabulary Human box annotations ODin W 13 LVIS APmini all LVIS APmini rare LVIS APval all LVIS APval rare Open vocabulary (evaluation vocabulary is not available at training time): 1 Region CLIP [46] R50x4 CC3M 6k concepts LVISbase 32.3 22.0 2 OWL [26] CLIP B/16 O365+VG 27.2 20.6 3 OWL [26] CLIP L/14 O365+VG 48.4 34.6 31.2 4 GLIPv2 [45] Swin-T Cap4M tokens O365+Gold G 48.5 29.0 5 GLIPv2 [45] Swin-B CC15M tokens Five ODs+Gold G 54.2 48.5 6 GLIPv2 [45] Swin-H CC15M tokens Five ODs+Gold G 55.5 50.1 7 F-VLM [19] R50x4 LVISbase 28.5 26.3 8 F-VLM [19] R50x64 LVISbase 34.9 32.8 9 3Ways [1] NFNet-F0 TODO captions LVISbase 35.7 25.6 10 3Ways [1] NFNet-F6 TODO captions LVISbase 44.6 30.1

11 OWL-ST CLIP B/16 Web LI N-grams O+VG 48.8 31.8 35.4 27.0 29.6 -3.2 12 OWL-ST CLIP L/14 Web LI N-grams O+VG 53.0 38.1 39.0 33.5 34.9 +2.1 13 OWL-ST Sig LIP G/14 Web LI N-grams O+VG 49.9 37.8 40.9 33.7 37.5 +4.7

14 OWL-ST+FT CLIP B/16 Web LI N-grams O+VG, LVISbase 48.6 47.2 37.8 41.8 36.2 +3.4 15 OWL-ST+FT CLIP L/14 Web LI N-grams O+VG, LVISbase 50.1 54.1 46.1 49.4 44.6 +11.8 16 OWL-ST+FT Sig LIP G/14 Web LI N-grams O+VG, LVISbase 50.1 51.3 50.9 47.0 47.2 +14.4

Human-curated vocabulary (evaluation vocabulary may be accessed at training time): 17 Detic [47] R50 IN-21k LVIS classes LVISbase 32.4 24.6 18 Det CLIPv2 [38] Swin-T CC15M Nouns+curated O365+Gold G 40.4 36.0 32.8 31.0 19 Det CLIPv2 [38] Swin-L CC15M Nouns+curated O365+Gold G 44.7 43.1 36.6 33.3

20 OWL-ST+FT CLIP B/16 Web LI N-grm+curated O+VG, LVISbase 48.9 51.1 41.9 45.6 40.5 +7.2 21 OWL-ST+FT CLIP L/14 Web LI N-grm+curated O+VG, LVISbase 48.7 55.8 50.0 50.4 45.9 +12.6

Instance selection. OWL-Vi T is an encoder-only architecture and predicts one bounding box per encoder token. This is inefﬁcient, since there are typically many more encoder tokens than objects (e.g. 5184 tokens for resolution 1008 1008 and patch size 14 14). Most output tokens therefore do not represent objects. We introduce an objectness head which predicts the likelihood that an output token actually represents an object, and compute boxes, class scores, and losses only for the top k tokens by objectness, similar to Efﬁcient DETR [40]. The objectness head receives an encoder token as input and computes a scalar objectness score. The objectness score predicts the future classiﬁcation score of a token and is supervised by the actual classiﬁcation score of those tokens that end up being selected and passed on to the classiﬁcation head. We select approximately 10% of instances by top objectness during training in all of our experiments. During inference, all instances are used.

Mosaics. During self-training, we combine raw images into grids of up to 6 6 to produce a single training example (i.e. a more extreme version of the mosaics in [26]). This has two main motivations: (1) Using mosaics increases the number of raw images seen for a given ﬁxed model input resolution. An alternative is to train using variable image sizes [38], but this would require resizing image position embeddings for each input size. (2) The average resolution and complexity of Web images is lower than images in detection benchmarks and applications. Mosaics reduce the average object size and improve small-object performance, similar to large scale-jittering [8], but with less padding. For all self-training experiments, we use 1 1, 2 2,3 3, 4 4, and 6 6 grids in equal proportions, resulting in an average of 13.2 raw component images per training example.

To further improve training efﬁciency, we also adopt previously proposed practices for large-scale Transformer training [42] (details in Appendix A.7). Together, our improvements reduce training FLOPS by approximately 50% compared to the original OWL-Vi T [26] and increase training throughput by 2 (e.g. for L/14 at 840 840 resolution measured on TPUv3: GFLOPs/example 11 945.4 vs. 5357.9; examples/s/core 1.0 vs. 2.2). We refer to the improved model as OWLv2.

At inference, no token dropping or instance selection is performed. Inference is therefore identical to the original OWL-Vi T, i.e. each image encoder token is decoded into a bounding box and a list of per-query classiﬁcation scores.

107 108 109

Total examples seen (including repetitions)

LVIS APfrequent (%)

Fine-tuned classes

107 108 109

Total examples seen (including repetitions)

LVIS APrare (%)

Unseen classes

107 108 109

Total examples seen (including repetitions)

ODin W13 mean AP (%)

"In the Wild" datasets

Pseudo-label space

Curated vocabulary N-grams N-grams+curated

Figure 2: Comparison of pseudo-label spaces. Self-training on a human-curated list of classes yields good downstream performance on these classes, but generalizes poorly to unseen classes and datasets. Open-vocabulary generalization can be improved by obtaining weak but diverse supervision from image-associated text. Web LI image-text data was pseudo-annotated using OWL-Vi T CLIP-L/14 with one of three label spaces: Curated vocabulary (the union of label spaces from LVIS, Objects365, Open Imagesv4, and Visual Genome), N-grams (lightly ﬁltered N-grams from the text associated with each image), or a combination of both (N-grams + curated). OWLv2-B/16 models were then self-trained on the pseudo-annotations and ﬁne-tuned on LVISbase. Each point represents a separate ﬁne-tuning run. Examples seen refers to the number of images after creating mosaics; the total number of raw images seen is 13.2 that number (Section 3.2).

3.3 Fine-tuning

Self-training on pseudo-annotations alone already yields strong performance (Section 4.2). However, ﬁne-tuning brieﬂy on human annotations can provide signiﬁcant further beneﬁts. For ﬁne-tuning, we start with the learning rate and optimizer state of the self-trained checkpoint and then continue training on the target dataset while linearly cooling down the learning rate to zero. Fine-tuning of open-vocabulary models involves a trade-off between improving the performance on the ﬁne-tuned classes and losing open-vocabulary performance [30, 29, 37]. We study this trade-off in Section 4.6.

4 Experiments

4.1 Experimental Setup

Models. We use the publicly available OWL-Vi T CLIP L/14 model to generate detection pseudoannotations for the Web LI dataset (10 billion image-text pairs [4]). For all self-training experiments, we use OWL-Vi T models modiﬁed as described in Section 3.2. Backbones are initialized with the publicly available CLIP [30] checkpoints (B/16 and L/14) or a Sig LIP [43] checkpoint (G/14).

Training. Models are ﬁrst self-trained on the pseudo-annotations for varying durations as indicated. If indicated, after self-training, models are ﬁne-tuned on LVISbase, i.e. the LVIS dataset [10] with all annotations for rare categories removed. Therefore, neither the annotator nor any of our models have seen human-generated annotations for LVISrare classes. Fine-tuning uses mosaics up to 3 3 and is always done until the model has seen 256 000 mosaics (1.1M individual images, roughly equivalent to 100 LVIS epochs). The image size is 960 960 for /16 models and 1008 1008 for /14 models. See Appendix A.8 for a complete list of hyperparameters.

Evaluation. We use mean average precision (m AP) on LVIS [10] as our main detection metric, where m APrare indicates open-vocabulary performance on unseen classes. To measure generalization on diverse real-world tasks, we evaluate zero-shot performance on the Object Detection in the Wild (ODin W) benchmark [21]. ODin W is a suite of datasets covering a wide range of domains. We report the average m AP on the subset of 13 ODin W datasets introduced in [22] and provide performance on individual datasets in Appendix A.9.2. To avoid leakage of evaluation data into the training set, Web LI was ﬁltered to remove images similar to those in the train, validation, and test splits of 68 common computer vision datasets, including COCO/LVIS, Objects365, and Visual Genome, but not the ODin W datasets (see [4] for details).

107 108 Total examples seen (including repetitions)

LVIS APfrequent (%)

Fine-tuned classes

107 108 Total examples seen (including repetitions)

LVIS APrare (%)

Unseen classes

107 108 Total examples seen (including repetitions)

ODin W13 mean AP (%)

"In the Wild" datasets

0.1 0.3 0.5 0.7

Figure 3: Impact of pseudo-annotation ﬁltering by detection conﬁdence on self-training effectiveness. Pseudo-labels (N-gram label space) were ﬁltered using different conﬁdence thresholds. Number of remaining images for each threshold: 0.1: 5B, 0.3: 2B, 0.5: 782M, 0.7: 224M. OWLv2-B/16 detectors were self-trained on the ﬁltered pseudo-annotations and ﬁne-tuned on LVISbase. Each point represents a different ﬁne-tuning run. Examples seen refers to the number of images after creating mosaics; the total number of raw images seen is 13.2 that number (Section 3.2).

4.2 Main Result

We compare our best models to the literature in Table 1. We broadly include state-of-the-art openvocabulary detectors in the comparison. Our self-training approach, using only machine-generated pseudo-annotation queries, improves over previous methods even without ﬁne-tuning (Table 1, OWLST, rows 11 13). Our OWL-ST B/16 model (row 11) achieves 29.6% LVIS m APrare, 9 points more than the equivalent OWL-Vi T model (row 2). Our largest model, G/14 (row 13), reaches 37.5% m APrare, 4.7 points better than the next-best model from the literature (F-VLM R50x64, row 8). Interestingly, after self-training, our models perform better on LVIS m APrare than m APall (which includes frequent and common classes). We speculate that this may be because weak Web-data supervision may be better for speciﬁc terms than general terms: Image/text pairs involving unusual objects (such as LVISrare categories) may be more likely to be speciﬁcally about these objects, whereas common terms like person or car may occur often without being related to the image.

Fine-tuning on LVISbase provides additional signiﬁcant improvements, even on m APrare (OWLST+FT, rows 14 16). Our best model, which has only seen machine-generated queries during self-training, reaches 47.2% LVIS m APrare after ﬁne-tuning, a 14.4-point improvement over the next best model (F-VLM R50x64, row 8).

Including a human-curated list of common object classes as pseudo-annotation queries can further improve the results on LVIS (rows 20 21), but this approach is not fully open-vocabulary since the model sees a curated label space, including the LVIS classes, at training time. While the beneﬁt of the curated label space is signiﬁcant for our smallest model, is is minor on m APrare for the larger L/14 model (compare rows 15 and 21).

To measure more general open-world performance, Table 1 also includes zero-shot results on ODin W13 [21], a suite of in the wild datasets. Performance on ODin W is best right after selftraining and is reduced by ﬁne-tuning on LVISbase. We discuss this further in Section 4.6. We also ﬁne-tuned on COCO, where our B/16 and L/14 models reach 54.3% and 56.0% COCO m AP, respectively. OWLv2 therefore matches the performance of Vi TDet with a Cascade Mask-RCNN head [23], despite using a simpler head architecture. Further results and examples in Appendix A.9.

4.3 Pseudo-Annotation Label Space

Figure 2 takes a closer look at the impact of the pseudo-annotation label space on performance after ﬁne-tuning. Performance on ﬁne-tuned classes (m APfrequent; left plot) is highest if the pseudoannotation label space included these classes (blue circles). Therefore, if the target label space is known ahead of time, pseudo-labeling on that space leads to the best results.

However, performance on unseen classes (m APrare) and In the Wild datasets is much better if the pseudo-labeling included diverse queries that were machine-generated from the image-associated text (orange squares and green diamonds). A mixture of human and machine-generated label spaces

101 102 103 Detection training compute

(exa FLOPs)

LVIS APfrequent (%)

Fine-tuned classes

101 102 103 Detection training compute

(exa FLOPs)

LVIS APrare (%)

Unseen classes

101 102 103 Detection training compute

(exa FLOPs)

ODin W13 mean AP (%)

"In the Wild" datasets

Model OWL-ST+FT B/16 OWL-ST+FT L/14 OWL L/14 (annotator)

Figure 4: Scaling of detection performance with model size and training compute. Models show classic scaling behavior [42]: Performance increases monotonically with training compute, with larger models being necessary to beneﬁt from larger amounts of compute/data. Models were self-trained on N-gram pseudo-annotations and ﬁne-tuned on LVISbase.

performs well in all settings, but does not signiﬁcantly outperform the purely machine-generated label space on the In the Wild datasets. These results suggest that a human-curated label space can help if the target label space is known, but that strong in-the-wild generalization is driven by the weakly supervised machine-generated label space. Our results also show that a simple N-grams approach is sufﬁcient to leverage the weak supervision and outperforms more complex methods (Table 1).

4.4 Filtering of Pseudo-Annotations

Besides the label space, a second important decision in self-training is the ﬁltering of pseudoannotations. We ﬁlter based on the detection conﬁdence score of the annotator and vary the score threshold in Figure 3. For conﬁdence-based ﬁltering, a bias-variance trade-off exists between including only high-conﬁdence pseudo-annotations but inheriting the annotator s biases, or lowering the bias but increasing the noise by including lower-conﬁdence pseudo-annotations. Many prior works err on the side of high bias and low variance, applying high conﬁdence thresholds [35] or including only the single highest-conﬁdence detection for an image [47, 1]. In our setting, we ﬁnd that including all pseudo-annotations that pass a moderate threshold of 0.3 works well, while strict thresholds lead to poor results (Figure 3). As training continues for longer than what was possible for Figure 3, we suspect that lower thresholds may scale better. Therefore, for our main results, we chose to include all annotations above 0.1, but only kept images with at least one annotation above 0.3.

4.5 Scaling

The use of abundant Web image-text data with little ﬁltering means that our self-training dataset is large (approximately 2B images). We can therefore study detection training scaling in the same regime as prior work on classiﬁcation (Figure 4; models see each image at most once for these experiments). We make several noteworthy observations:

1. Self-training is beneﬁcial already at moderate compute budgets, less than that of the annotator. 2. Models show similar scaling behavior for detection as for classiﬁcation [42]: Both overall performance and the size of the Pareto-optimal model increase with compute/data size. 3. As we move further out of distribution, the amount of compute at which L/14 overtakes B/16 increases. In other words, for in-the-wild performance, at most compute budgets, it may be better to train a smaller model for longer than a larger model for shorter.

These results suggests that self-training on Web data is further scalable as an approach for improving open-vocabulary localization models without the need for further human annotations. The large datasets also makes it possible to scale model size. We trained a G/14 model, which has 5.2 the number of parameters and 4.3 the inference FLOPs of our L/14 model. To our knowledge, this is the largest open-vocabulary detection model to date. Since the G/14 model uses a different backbone than our other models (Sig LIP [43] instead of CLIP [30]), we do not include it in Figure 4, but show in Table 1 that it is currently the best-performing model on zero-shot LVIS, with 47.2% m APrare.

44 46 48 50 52 54 "In the Wild" performance

(ODin W13 mean AP (%))

LVIS APfrequent (%)

1k 2k 5k 20k

.3 .4 .5 .6 .7 .8 .9

Longer self-training

Longer fine-tuning

44 46 48 50 52 54 "In the Wild" performance

(ODin W13 mean AP (%))

LVIS APrare (%)

100 1k 2k 20k

.3.4 .5 .6 .7 .8 .9

Self-training only Fine-tuned on LVIS base Weight ensemble OWL L/14 (annotator)

Figure 5: Trade-off between ﬁne-tuned and open-world performance. Self-training yields continued improvements on a suite of diverse datasets (ODin W13; x-axis), but performance on any given dataset (e.g. LVIS; y-axis) may saturate (red circles). Fine-tuning on a target dataset improves performance on that dataset, but reduces the open-world generalization ability in proportion to the ﬁnetuning duration (light blue squares; numbers indicate ﬁnetuning steps). This trade-off can be improved through weight-space ensembling (averaging) of the pretrained and ﬁne-tuned checkpoints [37] (purple diamonds; numbers indicate the mixing coefﬁcient for the ﬁne-tuned weights). The plot shows B/16 models self-trained on N-gram pseudo-annotations and evaluated either directly after selftraining or after ﬁne-tuning on LVISbase. Ensembles were created between the longest-self-trained checkpoint and the weights obtained after ﬁnetuning that checkpoint for 20k steps. Note that there is signiﬁcant variability in ODin W13 performance between checkpoints towards the end of self-training.

4.6 Effect of Fine-Tuning on Open-Vocabulary Performance

For contrastively trained image-text models, ﬁne-tuning improves performance on the target distribution but reduces the (originally very high) robustness to distribution shift [30, 28, 37]. We observe the same effect for detection, using ODin W13 AP as a proxy for out-of-distribution performance: Compared to the performance after self-training (red dots in Figure 5), ﬁne-tuning on LVISbase improves performance on the ﬁne-tuned classes (LVIS m APfrequent), but OOD performance (ODin W13 AP) is simultaneously reduced in proportion to the amount of ﬁne-tuning (light blue line in Figure 5).

A simple approach to improve on this trade-off is to create an ensemble of the model before and after ﬁne-tuning by averaging the model weights [37]. This approach comes at no additional training cost and improves the Pareto-frontier for all ensemble mixing ratios (Figure 5, purple line). We also tried co-training on Web LI and LVIS but found it to perform worse than weight ensembling.

Notably, performance on LVISrare behaves similarly to LVISfrequent and improves during ﬁne-tuning, even though no LVISrare classes are seen (Figure 5, right). This may be because LVISrare classes are semantically and visually close to LVISfrequent classes. For example, seeing many annotations for "bird" may improve performance on rare classes such as "heron", "mallard", or "puffin". LVIS m APrare therefore only measures a narrow concept of open-vocabulary performance, and does not reveal the fact that ﬁne-tuning signiﬁcantly reduces generalization to broader distribution shifts. Benchmarks such as ODin W therefore provide signiﬁcant additional insight.

5 Limitations

The main limitation of our method is the amount of compute and data needed for self-training. As we show in Section 4.5, performance improves consistently with training compute and data. This means that further improvements are possible, but also that these will come at increasingly large costs. In fact, cost likely increases faster than resources can realistically be grown in practice. New approaches will therefore be eventually necessary for further improvements.

A second important limitation of our method, similar to other open-vocabulary models [30, 28, 37], is the trade-off between ﬁne-tuned and open-vocabulary performance addressed in Section 4.6. For out-of-distribution queries, predictions of ﬁne-tuned models may be poorly calibrated and may depend on the precise wording of the query. These issues can be mitigated with weight ensembling [37], but more research is needed to fully understand the open-vocabulary robustness of these models.

6 Conclusion

In the past, open-vocabulary detection performance has been limited by the availability of humanannotated detection training data. Here, we show that self-training can be scaled up to overcome the dependency on human annotations. Our OWL-ST recipe delivers large improvements in detection performance using weak supervision from abundant Web data, similar to what has been seen for image classiﬁcation and language modelling.

Acknowledgments and Disclosure of Funding

We would like to thank Xiao Wang for help with the Web LI dataset, Xiaohua Zhai and Lucas Beyer for providing the Sig LIP model, and Rich Munoz and Alexey Dosovitskiy for insightful comments.

[1] Arandjelovi c, R., Andonian, A., Mensch, A., Hénaff, O.J., Alayrac, J.B., Zisserman, A.: Three ways to improve feature alignment for open vocabulary detection. ar Xiv preprint ar Xiv:2303.13518 (2023)

[2] Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your vit but faster. ar Xiv preprint ar Xiv:2210.09461 (2022)

[3] Bradbury, J., Frostig, R., Hawkins, P., Johnson, M.J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., Zhang, Q.: JAX: composable transformations of Python+Num Py programs (2018), http://github.com/google/jax

[4] Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L., Thapliyal, A.V., Bradbury, J., Kuo, W., Seyedhosseini, M., Jia, C., Ayan, B.K., Ruiz, C.R., Steiner, A.P., Angelova, A., Zhai, X., Houlsby, N., Soricut, R.: Pa LI: A jointly-scaled multilingual language-image model. ICLR (2023)

[5] Dave, A., Dollár, P., Ramanan, D., Kirillov, A., Girshick, R.: Evaluating large-vocabulary object detectors: The devil is in the details. ar Xiv preprint ar Xiv:2102.01066 (2021)

[6] Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A., Caron, M., Geirhos, R., Alabdulmohsin, I., Jenatton, R., Beyer, L., Tschannen, M., Arnab, A., Wang, X., Riquelme, C., Minderer, M., Puigcerver, J., Evci, U., Kumar, M., van Steenkiste, S., Elsayed, G.F., Mahendran, A., Yu, F., Oliver, A., Huot, F., Bastings, J., Collier, M.P., Gritsenko, A., Birodkar, V., Vasconcelos, C., Tay, Y., Mensink, T., Kolesnikov, A., Paveti c, F., Tran, D., Kipf, T., Luˇci c, M., Zhai, X., Keysers, D., Harmsen, J., Houlsby, N.: Scaling vision transformers to 22 billion parameters. ICML (2023)

[7] Dehghani, M., Gritsenko, A.A., Arnab, A., Minderer, M., Tay, Y.: SCENIC: A JAX library for computer vision research and beyond. ar Xiv preprint ar Xiv:2110.11403 (2021)

[8] Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V., Zoph, B.: Simple copy-paste is a strong data augmentation method for instance segmentation. CVPR (2021)

[9] Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. ar Xiv preprint ar Xiv:2104.13921 (2021)

[10] Gupta, A., Dollar, P., Girshick, R.: LVIS: A dataset for large vocabulary instance segmentation. CVPR (2019)

[11] Heek, J., Levskaya, A., Oliver, A., Ritter, M., Rondepierre, B., Steiner, A., van Zee, M.: Flax: A neural network library and ecosystem for JAX (2020), http://github.com/google/flax

[12] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. ICML (2021)

[13] Jouppi, N.P., Yoon, D.H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., Patterson, D.: A domainspeciﬁc supercomputer for training deep neural networks. Communications of the ACM 63(7), 67 78 (2020)

[14] Kamath, A., Singh, M., Le Cun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR - modulated detection for end-to-end multi-modal understanding. ICCV (2021)

[15] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 (2014)

[16] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. ar Xiv preprint ar Xiv:2304.02643 (2023)

[17] Kolesnikov, A., Dosovitskiy, A., Weissenborn, D., Heigold, G., Uszkoreit, J., Beyer, L., Minderer, M., Dehghani, M., Houlsby, N., Gelly, S., Unterthiner, T., Zhai, X.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021)

[18] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123(1), 32 73 (2017)

[19] Kuo, W., Cui, Y., Gu, X., Piergiovanni, A., Angelova, A.: Open-vocabulary object detection upon frozen vision and language models. ICLR (2023)

[20] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., Ferrari, V.: The Open Images Dataset V4. International Journal of Computer Vision 128(7), 1956 1981 (Mar 2020)

[21] Li*, C., Liu*, H., Li, L.H., Zhang, P., Aneja, J., Yang, J., Jin, P., Lee, Y.J., Hu, H., Liu, Z., et al.: Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Neur IPS (2022)

[22] Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. ar Xiv preprint ar Xiv:2112.03857 (2021)

[23] Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. ar Xiv preprint ar Xiv:2203.16527 (2022)

[24] Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. ICLR (2022)

[25] Marin, D., Chang, J.H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classiﬁcation. CVPR (2023)

[26] Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., Houlsby, N.: Simple open-vocabulary object detection with vision transformers. ECCV (2022)

[27] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T.: Model cards for model reporting. In: Proceedings of the conference on fairness, accountability, and transparency. pp. 220 229 (2019)

[28] Pham, H., Dai, Z., Ghiasi, G., Kawaguchi, K., Liu, H., Yu, A.W., Yu, J., Chen, Y.T., Luong, M.T., Wu, Y., Tan, M., Le, Q.V.: Combined scaling for zero-shot transfer learning (2021)

[29] Pham, H., Dai, Z., Ghiasi, G., Liu, H., Yu, A.W., Luong, M.T., Tan, M., Le, Q.V.: Combined scaling for zero-shot transfer learning. ar Xiv preprint ar Xiv:2111.10050 (2021)

[30] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. ICML (2021)

[31] Ramanathan, V., Wang, R., Mahajan, D.: Dlwl: Improving detection for lowshot classes with weakly labelled data. CVPR (2020)

[32] Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efﬁcient vision transformers with dynamic token sparsiﬁcation. Neur IPS (2021)

[33] Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A Large-Scale, High-Quality Dataset for Object Detection. ICCV (2019)

[34] Shazeer, N., Stern, M.: Adafactor: Adaptive learning rates with sublinear memory cost. ar Xiv preprint ar Xiv:1804.04235 (2018)

[35] Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pﬁster, T.: A simple semi-supervised learning framework for object detection. ar Xiv preprint ar Xiv:2005.04757 (2020)

[36] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Neur IPS (2017)

[37] Wortsman, M., Ilharco, G., Kim, J.W., Li, M., Kornblith, S., Roelofs, R., Gontijo-Lopes, R., Hajishirzi, H., Farhadi, A., Namkoong, H., Schmidt, L.: Robust ﬁne-tuning of zero-shot models. CVPR (2022)

[38] Yao, L., Han, J., Liang, X., Xu, D., Zhang, W., Li, Z., Xu, H.: Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. ar Xiv preprint ar Xiv:2304.04514 (2023)

[39] Yao, L., Han, J., Wen, Y., Liang, X., Xu, D., Zhang, W., Li, Z., Xu, C., Xu, H.: Det CLIP: Dictionaryenriched visual-concept paralleled pre-training for open-world detection. Neur IPS (2022)

[40] Yao, Z., Ai, J., Li, B., Zhang, C.: Efﬁcient detr: Improving end-to-end object detector with dense prior. ar Xiv preprint ar Xiv:2104.01318 (2021)

[41] Yin, H., Vahdat, A., Alvarez, J., Mallya, A., Kautz, J., Molchanov, P.: Adavit: Adaptive tokens for efﬁcient vision transformer. CVPR (2022)

[42] Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. CVPR (2022)

[43] Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. ar Xiv preprint ar Xiv:2303.15343 (2023)

[44] Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Li T: Zero-shot transfer with locked-image text tuning. ar Xiv preprint ar Xiv:2111.07991 (2021)

[45] Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L.H., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: Glipv2: Unifying localization and vision-language understanding. Neur IPS (2022)

[46] Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Region CLIP: Region-based language-image pretraining. ar Xiv preprint ar Xiv:2112.09106 (2021)

[47] Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. ar Xiv preprint ar Xiv:2201.02605 (2021)

[48] Zoph, B., Ghiasi, G., Lin, T.Y., Cui, Y., Liu, H., Cubuk, E.D., Le, Q.V.: Rethinking pre-training and self-training. Neur IPS (2020)

The Appendix provides a Model Card [27] for OWLv2 as well as additional methodological details, hyperparameters, and results. At the end of the Appendix, we provide qualitative examples of the self-training data and model predictions.

The Appendix is structured as follows:

A.1 Model Card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

A.2 Human-Curated Label Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.3 Machine-Generated Label Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.4 Combined Label Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.5 Augmentations for Self-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.6 Token Dropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.7 Further Efﬁciency Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

A.8 Model Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

A.9 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.9.1 Fixed Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.9.2 Per-Dataset ODin W Results . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.9.3 Fine-Tuning Robustness Trade-Off for OWLv2 L/14 . . . . . . . . . . . . 19

A.10 Qualitative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.1 Model Card

Model Summary

Model Architecture OWL v2 is an open-vocabulary object detector based on OWL-Vi T [26]. It consists of an image encoder with a Vision Transformer [17] architecture, a text encoder with a similar Transformer architecture, and heads that predict bounding boxes and label scores from provided images and text queries.

Input(s) An image and a list of free-text object descriptions (queries).

Output(s) A list of bounding boxes and a score for each box/query pair.

Application The model is intended for open-vocabulary object detection.

Known Caveats (1) Conﬁdence scores of predictions are not intended to be compared across text queries. While the training loss encourages cross-query calibration for seen queries, scores for unseen queries are not calibrated. Further, the mean Average Precision (m AP) metric does not measure cross-query calibration, so higher m AP does not imply better cross-query calibration. Also see Section 5. (2) Fine-tuning the model creates a trade-off between the performance on ﬁnetuned texts and unseen texts. See Section 4.6 for details.

System Type

System Description This is a standalone model.

Upstr. Dependencies None.

Downstr. Dependencies None.

Implementation Frameworks

Hardware & Software Hardware: TPU [13] v2 or v3 (for Band L-sized models) or v4 (for G-sized models). Software: JAX [3], Flax [11], Scenic [7].

Compute Requirements Reported in Section 4.5.

Model Characteristics

Model Initialization The model is initialized from pre-trained language CLIP [30] or Sig LIP [43] checkpoints.

Model Status This is a static model trained on an ofﬂine dataset.

Model Stats The largest OWLv2 model has 2.3B parameters, of which 2B are used for the image encoder and 300M for the text encoder (the heads have a negligible number of parameters). We also trained models with 430M and 150M parameters.

Data Overview

Training dataset The model is self-trained on bounding boxes predicted by the original OWL-Vi T L/14 model [26] on the Web LI dataset [4]. Details on the annotation procedure are provided in Section 3.1.

Evaluation & Fine-tuning Dataset

Open-vocabulary object detection performance is evaluated using the LVIS [10] and ODin W13 [21] datasets. As indicated in Table 1, some models are ﬁne-tuned on the base annotations of LVIS, i.e. only annotations for frequent and common object categories as deﬁned in the ofﬁcial annotations [10]. None of our models have seen any human annotations for LVIS rare categories, such that LVIS m APrare measures zero-shot performance.

Evaluation Results

Evaluation Results Reported in Table 1.

Model Usage & Limitations

Sensitive Use The model detects objects matching free-text descriptions. This capability should not be used for unethical use cases such as surveillance.

Known Limitations Reported in Section 5.

Ethical Considerations Reported in Section 5.

A.2 Human-Curated Label Space

The human-curated label space was obtained by merging common dataset class lists with the Python code below.

1 # Dataset class names, as available e.g. from Tensor Flow Datasets.

2 # For Visual Genome, we used the 1600 most common label strings.

3 LVIS_CLASS_NAMES = [...]

4 OBJECTS365_CLASS_NAMES = [...]

5 OPEN_IMAGES_V4_BOXABLE_CLASS_NAMES = [...]

6 VISUAL_GENOME_CLASS_NAMES = [...]

8 queries = (

9 LVIS_CLASS_NAMES

10 + OBJECTS365_CLASS_NAMES

11 + OPEN_IMAGES_V4_BOXABLE_CLASS_NAMES

14 # Remove duplicates:

15 queries = set([q.lower() for q in queries])

17 # Remove plural forms:

18 remove = set()

19 for singular in queries:

20 plurals = [singular + 's', singular + 'es']

21 for plural in plurals:

22 if plural in queries:

23 remove.add(plural)

25 # Same queries for all images:

26 queries = list(queries.difference(remove))

A.3 Machine-Generated Label Space

The machine-generated label space was obtained from the image-associated text, for each image separately, using the Python code below. Figure A3 shows example pseudo-annotations using the N-gram label space.

1 from typing import Iterable, List

2 import nltk

4 # Stopwords from nltk.corpus.stopwords.words('english'):

5 STOPWORDS_EN = frozenset({

6 'a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an',

7 'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'before', 'being',

8 'below', 'between', 'both', 'but', 'by', 'can', 'did', 'do', 'does',

9 'doing', 'don', 'down', 'during', 'each', 'few', 'for', 'from', 'further',

10 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself',

11 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its',

12 'itself', 'just', 'me', 'more', 'most', 'my', 'myself', 'no', 'nor', 'not',

13 'now', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours',

14 'ourselves', 'out', 'over', 'own', 's', 'same', 'she', 'should', 'so',

15 'some', 'such', 't', 'than', 'that', 'the', 'their', 'theirs', 'them',

16 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through',

17 'to', 'too', 'under', 'until', 'up', 'very', 'was', 'we', 'were', 'what',

18 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with',

19 'you', 'your', 'yours', 'yourself', 'yourselves'

22 # These words were found by manually going through the most common 1000 words

23 # in a sample of alt-texts and selecting generic words without specific meaning:

24 COMMON_GENERIC_WORDS = frozenset({

25 'alibaba', 'aliexpress', 'amazon', 'available', 'background', 'blog', 'buy',

26 'co', 'com', 'description', 'diy', 'download', 'facebook', 'free', 'gif',

27 'hd', 'ideas', 'illustration', 'illustrations', 'image', 'images', 'img',

28 'instagram', 'jpg', 'online', 'org', 'original', 'page', 'pdf', 'photo',

29 'photography', 'photos', 'picclick', 'picture', 'pictures', 'png', 'porn',

30 'premium', 'resolution', 'royalty', 'sale', 'sex', 'shutterstock', 'stock',

31 'svg', 'thumbnail', 'tumblr', 'tumgir', 'twitter', 'uk', 'uploaded', 'vector',

32 'vectors', 'video', 'videos', 'wallpaper', 'wallpapers', 'wholesale', 'www',

33 'xxx', 'youtube'

36 def _is_all_stopwords(ngram: Iterable[str]) -> bool:

37 return set(ngram).issubset(STOPWORDS_EN)

40 def _get_ngrams(

41 caption: str, max_num_queries: int, max_ngram_len: int

42 ) -> List[str]:

43 """Returns image caption ngrams as queries."""

45 # Make lower-case:

46 caption = caption.lower()

48 # Remove common generic words:

49 words = [w for w in caption.split() if w not in COMMON_GENERIC_WORDS]

51 queries = []

52 for ngram in nltk.everygrams(words, max_len=max_ngram_len):

53 # Don't use ngram if it only consists of stop words:

54 if _is_all_stopwords(ngram):

55 continue

56 queries.append(' '.join(ngram))

57 if len(queries) == max_num_queries:

59 return queries

61 # Example command to get queries for one image:

62 queries = _get_ngrams(caption, max_num_queries=300, max_ngram_len=10)

A.4 Combined Label Space

When merging pseudo-annotations obtained with human-curated and machine-generated queries, it is important to consider that human-curated queries tend to be closer to the training distribution of the annotator and therefore tend to have higher scores than pseudo-annotations based on machinegenerated queries. Simply merging annotations from the two label spaces and ﬁltering them with the same conﬁdence threshold would therefore retain primarily annotations based on human-curated queries. To achieve a more even balance when using the combined label space ( N-grm+curated in Table 1), we therefore re-scaled scores of pseudo-annotations obtained with the human-curated queries by a factor of 0.3 before applying the same conﬁdence threshold to all (human-curated and machine-generated) annotations.

A.5 Augmentations for Self-Training

Since Web-scale image-text data differs in important aspects from human-curated detection datasets, we depart from the augmentation strategy of [26] in several ways. As described in Section 3.2, since Web images tend to be smaller and show fewer objects than e.g. LVIS images, we use stronger image mosaics with up do 6 6 tiles (Figure A1). For the same reason, we additionally randomly resize each raw image such that its width is between 0.5 and 1.0 the width of the full mosaic tile, padding on the bottom and right to preserve the aspect ratio (Figure A4).

On the other hand, given the large size of our dataset, some other augmentations can be avoided: We do not use left/right ﬂipping or random cropping during self-training. We also do not add random

3 3 6 6 8 8 12 12 Max. mosaic size

LVIS APall (%)

3 3 6 6 8 8 12 12 Max. mosaic size

LVIS APfrequent (%)

3 3 6 6 8 8 12 12 Max. mosaic size

LVIS APrare (%)

3 3 6 6 8 8 12 12 Max. mosaic size

LVIS APlarge (%)

3 3 6 6 8 8 12 12 Max. mosaic size

LVIS APmedium (%)

3 3 6 6 8 8 12 12 Max. mosaic size

LVIS APsmall (%)

Figure A1: Sweep over mosaic sizes. OWL-Vi T B/16 models were trained on pseudo-box annotations ( ngrams label space) for 100 000 steps with different mosaic sizes. At a given max. mosaic size , the model is trained on equal proportions of mosaics up to that size. For example, for max. size = 12 12, the model receives images with 1, 22, 32, 42, 62, 82, or 122 tiles, respectively (only sizes with prime factors 1, 2, and 3 are supported). For this ﬁgure, the model input resolution was 768 768. Mosaic sizes up to 12 12 improve overall performance (m APall) and especially rare and small object performance. The beneﬁt may be due to seeing smaller objects on average, or due to seeing more Web LI images per training step (a 12 12 mosaic contains 144 Web LI images).

prompt templates to the pseudo-labels during self-training. During ﬁne-tuning, we use the same augmentations as [26].

A.6 Token Dropping

To improve training efﬁciency, we drop image patches based on their pixel variance (Section 3.2). Table A2 shows how the performance of a standard OWL-Vi T model varies for different amounts of token dropping. Dropping up to 50% of tokens is within one standard deviation of the full performance. We therefore drop 50% of tokens during all of our experiments.

Table A2: Performance of standard OWL-Vi T (L/14), trained on Objects365 and Visual Genome as in [26], for different token drop rates. For drop rate 0.0, the standard deviation over three runs is given.

Token drop rate

Metric 0.00 0.25 0.33 0.50 0.70

LVIS APval all 33.3 0.33 33.1 33.6 32.9 30.4 LVIS APval rare 31.8 1.16 31.0 32.6 30.8 28.2

To inject some stochasticity to the patch selection, we add a small amount of noise to the image before computing patch variance (uniformly distributed between 0.0 and 0.01 for images in the range [0.0, 1.0]). Figure A4 shows an example training image before and after token dropping.

Table A3: Hyperparameters of the models shown in Table 1. Only parameters that vary between models are shown; constant parameters are described in the text (Appendix A.8). For Dropout rate and Droplayer rate, the ﬁrst number indicates the value used for the image encoder and the second for the text encoder. Examples seen includes both self-training and ﬁne-tuning.

Method Backbone

Learning rate

Dropout rate

Droplayer rate

Instance top k

Batch size (ST)

Batch size (FT)

Examples seen

Open vocabulary: 11 OWL-ST CLIP B/16 960 5 10 5 .0/.0 .2/.1 256 256 3.7 108

12 OWL-ST CLIP L/14 1008 2 10 5 .0/.0 .2/.1 512 256 2.3 108

13 OWL-ST Sig LIP G/14 1008 2 10 5 .0/.1 .2/.4 512 128 1.6 108

14 OWL-ST+FT CLIP B/16 960 5 10 5 .0/.0 .2/.1 256 256 256 3.6 108

15 OWL-ST+FT CLIP L/14 1008 2 10 5 .0/.0 .2/.1 512 256 128 2.3 108

16 OWL-ST+FT Sig LIP G/14 1008 2 10 5 .0/.1 .2/.4 512 128 128 1.6 108

Human-curated vocabulary: 20 OWL-ST+FT CLIP B/16 960 5 10 5 .0/.0 .2/.1 256 256 256 8.2 108

21 OWL-ST+FT CLIP L/14 1008 2 10 5 .0/.0 .2/.1 512 256 128 3.6 108

A.7 Further Efﬁciency Improvements

To further improve training efﬁciency beyond the methods described in Section 3.2, we also adopt previously proposed methods for large-scale Transformer training: To save memory, we use a variant [42] of the Adafactor optimizer [34] instead of Adam [15]. To avoid having to choose and optimize the total training duration ahead of time, we use the open-ended inverse square-root schedule [36, 42] with a ﬁxed time-scale of 10 000 steps for all experiments and linearly cool down checkpoints along the way for evaluation (see Section 3.3).

A.8 Model Hyperparameters

We use the following hyperparameters for all of our models. Hyperparameters that vary between models are listed in Table A3.

Optimizer: Adafactor variant as in [42]

Learning rate schedule: Inverse square-root [36] with timescale 10 000 steps

Learning rate for the text encoder: 2 10 6

Token dropping rate during training: 0.5

Pseudo-annotation conﬁdence score threshold: 0.3 (except for Figure 3)

Augmentations: See Appendix A.5

All remaining hyperparameters are as in [26].

Hyperparameter selection. Most hyperparameters were either taken directly from [26] or technically constrained, e.g. we chose the largest batch size that ﬁt into the memory of the available accelerators. Where hyperparameters were tuned, we ran short B/16-scale trial experiments and selected the parameters with the highest LVIS m APrare for our main runs.

Sig LIP G/14. For the G/14 model, we started self-training with a learning rate of 5 10 5, a droplayer rate of .1/.0, and no dropout. We found that the model overﬁt during ﬁne-tuning with these settings, and switched to a learning rate of 2 10 5, a droplayer rate of .2/.4, and a dropout rate of .0/.1 after 740 000 self-training steps. To save resources, we did not start training from the beginning. With the new settings, we observed no overﬁtting during ﬁne-tuning, but it is possible that these settings are still not optimal.

Table A4: Open-vocabulary detection results on LVIS using the ﬁxed AP metric [5]. Fixed AP is implemented as proposed in [5] by evaluating AP on the top 10 000 predictions per class over the entire validation set.

Method Backbone APmini all APmini rare APval all APval rare old ﬁxed old ﬁxed old ﬁxed old ﬁxed

Open vocabulary: 1 Region CLIP [46] R50x4 32.3 22.0 2 OWL [26] CLIP B/16 27.2 20.6 3 OWL [26] CLIP L/14 34.6 31.2 4 GLIPv2 [45] Swin-T 29.0 5 GLIPv2 [45] Swin-B 48.5 6 GLIPv2 [45] Swin-H 50.1 7 F-VLM [19] R50x4 28.5 26.3 8 F-VLM [19] R50x64 34.9 32.8 9 3Ways [1] NFNet-F0 35.7 25.6 10 3Ways [1] NFNet-F6 44.6 30.1

11 OWL-ST CLIP B/16 31.8 34.4 35.4 38.3 27.0 28.6 29.6 30.3 12 OWL-ST CLIP L/14 38.1 40.9 39.0 41.5 33.5 35.2 34.9 36.2 13 OWL-ST Sig LIP G/14 37.8 40.9 33.7 37.5

14 OWL-ST+FT CLIP B/16 47.2 48.7 37.8 42.1 41.8 43.2 36.2 39.0 15 OWL-ST+FT CLIP L/14 54.1 56.2 46.1 52.3 49.4 51.1 44.6 47.4 16 OWL-ST+FT Sig LIP G/14 51.3 50.9 47.0 47.2

Human-curated vocabulary: 17 Detic [47] R50 32.4 24.6 18 Det CLIPv2 [38] Swin-T 40.4 36.0 32.8 31.0 19 Det CLIPv2 [38] Swin-L 44.7 43.1 36.6 33.3

20 OWL-ST+FT CLIP B/16 51.1 52.3 41.9 46.5 45.6 46.7 40.5 42.5 21 OWL-ST+FT CLIP L/14 55.8 57.2 50.0 54.5 50.4 52.0 45.9 48.5

A.9 Additional Results

A.9.1 Fixed Average Precision

In the standard Average Precision metric (APold), performance on one class depends on the performance on other classes. This dependence makes the metric gameable by re-scaling the scores of certain classes [5]. To avoid this issue, some prior work reports a ﬁxed version of AP proposed in [5]. In Table 1, we report APold for our models. For models from the literature, we report whichever AP version is available. Since APﬁxed tends to produce higher values than APold, Table 1 tends to underestimate the advantage of our method over prior work using APﬁxed. We provide APﬁxed for all of our models in Table A4. As proposed in [5], we implement APﬁxed by evaluating AP on the top 10 000 predictions per class over the entire validation set. This ensures that classes do not compete with each other for inclusion in the evaluated predictions.

A.9.2 Per-Dataset ODin W Results

Table A5 shows un-aggregated results on all 35 ODin W datasets for our main models. In addition, in the last row, we provide results for a weight-space ensemble of a self-trained and ﬁne-tuned OWLv2 L/14 model (the same model is shown in Figure A2).

A.9.3 Fine-Tuning Robustness Trade-Off for OWLv2 L/14

In Figure A2, we provide the same analysis of the robustness trade-off after ﬁne-tuning for an L/14 model that we provided for a B/16 model in Figure 5.

A.10 Qualitative Examples

In Figures A5 to A7, we provide qualitative examples of detection predictions from OWLv2 L/14 models. In each ﬁgure, the top image shows predictions obtained directly after self-training, and

the bottom image shows predictions after ﬁne-tuning on LVISbase. Example images are from the LVIS validation set and the model was queried with all LVIS classes. All predictions meeting the conﬁdence threshold speciﬁed in the caption are shown.

44 46 48 50 52 54 56 58 "In the Wild" performance

(ODin W13 mean AP (%))

LVIS APfrequent (%)

1k 2k 5k 20k

.4 .5 .6 .7 .8 .9

44 46 48 50 52 54 56 58 "In the Wild" performance

(ODin W13 mean AP (%))

LVIS APrare (%)

.3.4 .5 .6 .7 .8 .9

Self-training only Fine-tuned on LVIS base Weight ensemble OWL L/14 (annotator)

Figure A2: Trade-off between ﬁne-tuned and open-world performance. Similar to Figure 5, but for OWLv2 L/14.

Table A5: Zero-shot AP of the models in Table 1 on all 35 ODin W datasets [21]. The subset of 13 datasets deﬁned in [22] and used in the main paper is shown in bold. The last row (OWL-ST/FT ens) shows the weight-space ensemble [37] of the checkpoints after self-training and after ﬁne-tuning of the model in row 21 (weight of the ﬁne-tuned checkpoint in the ensemble is 0.4; also see Figure A2). This is our best model by ODin W13 performance.

Method Backbone

Mean (13 datasets)

Mean (35 datasets)

Median (35 datasets)

Aerial Maritime Drone Large

Aerial Maritime Drone Tiled

American Sign Language

Boggle Boards

Brackish Underwater

Chess Pieces

Cottontail Rabbits

Dice Medium Color

Drone Control

Ego Hands Generic

Ego Hands Speciﬁc

Hard Hat Workers

Mask Wearing

Mountain Dew Commercial

North America Mushrooms

Open Poetry Vision

Oxford Pets By Breed

Oxford Pets By Species

Selfdriving Car

Shellﬁsh Open Images

Thermal Cheetah

Thermal Dogs And People

Uno Cards Raw

Vehicles Open Images

Website Screenshots

Wildﬁre Smoke

Open vocabulary: 11 OWL-ST CLIP B/16 48.8 22.1 11.6 11.6 19.4 1.1 33.2 11.6 0.3 4.8 4.1 85.5 0.1 2.7 46.9 5.5 2.0 0.4 22.0 33.9 0.4 2.7 3.4 75.9 52.7 60.1 0.1 4.8 19.2 66.6 5.5 40.1 19.1 51.1 1.0 57.2 1.8 25.4 12 OWL-ST CLIP L/14 53.0 24.4 16.2 19.9 21.2 1.1 32.3 16.2 0.2 5.9 7.8 84.9 0.1 4.7 47.1 3.5 1.9 0.5 27.3 76.6 0.6 3.1 2.7 70.9 53.9 62.6 0.0 4.4 27.5 63.8 4.9 35.0 25.5 55.6 1.1 58.5 1.8 31.1 13 OWL-ST Sig LIP G/14 49.9 22.9 17.5 22.0 17.5 2.0 36.7 21.4 0.2 3.3 5.6 88.1 0.1 4.9 37.8 4.3 1.4 0.2 22.6 42.4 0.5 3.0 3.2 62.8 53.4 58.4 0.1 6.5 25.7 63.9 5.8 42.5 25.0 56.6 1.2 58.1 2.0 23.4

14 OWL-ST+FT CLIP B/16 48.6 20.8 6.0 13.7 16.6 0.2 35.8 3.9 0.1 4.2 3.1 85.5 0.1 0.9 50.7 1.3 2.7 0.5 16.0 37.4 0.2 1.9 2.1 71.3 57.4 59.4 0.2 2.7 7.6 61.7 6.0 42.5 15.3 45.6 1.3 62.8 1.2 15.8 15 OWL-ST+FT CLIP L/14 50.1 22.3 6.3 20.6 16.3 0.2 37.4 4.0 0.1 5.1 5.6 83.4 0.1 4.8 58.5 2.2 2.1 0.6 28.5 42.2 0.3 2.5 1.9 65.5 58.9 63.7 0.2 1.5 9.1 57.2 6.3 43.0 24.7 47.7 1.3 64.3 1.8 20.3 16 OWL-ST+FT Sig LIP G/14 50.1 22.5 9.5 21.3 16.5 0.3 39.8 9.5 0.3 5.6 5.8 82.5 0.0 3.6 50.9 0.5 1.7 0.2 25.5 44.9 0.2 2.8 2.3 68.1 56.4 58.5 0.7 5.3 17.4 58.3 6.1 42.7 23.6 47.9 1.9 61.9 1.9 23.9

Human-curated vocabulary: 20 OWL-ST+FT CLIP B/16 48.9 21.7 6.8 16.7 17.2 0.3 35.3 4.5 0.1 4.6 4.4 85.1 0.1 2.4 51.8 0.9 2.9 0.4 27.3 36.9 0.3 2.1 2.5 71.3 59.0 61.3 0.4 2.7 9.6 58.7 6.8 42.0 20.0 45.7 1.2 62.6 1.5 20.6 21 OWL-ST+FT CLIP L/14 48.7 21.9 7.0 18.8 17.5 0.2 36.4 5.3 0.1 5.4 5.7 85.1 0.1 4.9 53.9 2.5 2.2 0.3 28.8 41.2 0.3 2.4 2.1 61.1 59.2 65.7 0.1 1.8 9.5 57.9 7.0 44.0 23.8 36.8 0.9 63.2 1.6 20.7 OWL-ST/FT ens CLIP L/14 56.3 25.6 10.6 21.7 20.0 1.0 39.1 10.6 0.2 7.6 7.0 87.0 0.0 6.1 53.1 3.2 2.1 0.3 31.3 80.6 0.4 3.1 2.9 66.3 61.8 66.2 0.1 4.0 26.0 65.4 6.2 45.1 24.1 56.7 1.1 63.3 1.9 30.9

file 2005 citroën c3 interior 8th july 2015...

c3 c3 interior

interior interior

citroën c3 interior

αρχείο map athenian

empire 431 bc el svg

empire 431 bc el

file map athenian

file map athenian

file map athenian

file bus luxembourg city

file bus luxembourg city 3

3 file bus luxembourg city 3

3 datei bus luxe

3 3 datei bus lux bus

file patrick henry bruce

landscape google art...

patrick henry bruce land patrick henry bruce land

file fresco showing a silver tray containing...

dried figs and dates a

dried figs and

prunes dried

dried figs and

fresco showing a silver tray conta

prunes dried

fresco showing

file monumento a josé maria dos santos pinhal...

file monument to

monument to file monument to jo

file iss 39 grand canyon

canyon for version as of 23 canyon for version a

canyon for version as

39 grand canyon for version as grand canyon for version as of 23

file kalisz bazylika wniebowzięcia nmp...

of st. joseph

of st. joseph

of the church of st. joseph sanctuary

fichier h with hook uppercase and lowercase...

version as of

with hook uppercase and lowercase lowercase for version as of h with hook upp h with hook uppercase and

archivo red script palestine flag svg

archivo red script

archivo red script

archivo red script

script palestine flag for version

version as of

archivo red script

file station tramway ligne 3b marie miribel...

tram station tram station

tram station

file gleisanlage zürich

hauptbahnhof jpg

track system

track system track system

track system zurich main

track system track system

track system zurich main station

track system

track system

track system

track system

file zürich industriequartier prime...

industrial quarter

zürich industr quarter prime and mobimo tower

file zürich industrial quarter

zürich industrial

zürich indus

industrial quarter

tower käferberg

and mobimo tower käferberg zürich industrial quarter

industrial quarter

quarter prime and mobimo tower industrial quarter industrial quarter

industrial quarter

zürich industrial quarter

industrial quarter prime and mobim

파일 chikoo sapodilla noseberry mudapples...

sapodilla noseberry mudapples

sapodilla nose

chikoo chikoo sapodil

datei kitchen impossible

impossible map

impossible map

gata isaac onora ramaria

aurea wikipedia

onora ramaria

isaac onora ramaria

september ramaria

statue mentuhotep aa by

khruner jpg

statue mentuhotep statue mentuhotep

file athena promachos

reggio calabria jpg

athena promachos

athena promachos athena

スポンサー ドリンク アンティーク 風 時計...

file birkenstock jpg

birkenstock birkenstock birkenstock birkenstock

Figure A3: Example pseudo-annotations on Web LI [4]. Image-associated text (from the HTML alt_text tag) is shown above the images. If the text is not in English, an automatically generated translation is used. N-grams are extracted from these texts to generate queries for the annotator model. Pseudo-annotations were ﬁltered as for our main experiments: To be included, boxes must have a score of at least 0.1, and images must have at least one box with a score above 0.3. All images from Wikimedia Commons.

Figure A4: Training inputs after pre-processing. Top: A 4 4 mosaic of randomly resized and padded images as used for self-training. Bottom: The same mosaic after dropping the 50% of patches with lowest pixel variance (image size: 1008 1008; patch size: 14 14). Most dropped patches belong to padding areas or uniform image backgrounds. All images from Wikimedia Commons.

OWL-ST L/14 self-trained on N-grams, not fine-tuned (Table 1 row 12)

date (fruit)

pear clementine

date (fruit)

date (fruit)

bucket apple

apple apple

orange (fruit)

date (fruit)

kiwi fruit apple

canteen basket

elevator car

date (fruit)

vending machine

water cooler

apple apple

orange (fruit)

orange (fruit)

g g canteen

date (fruit)

date (fruit)

salmon (food)

orange (fruit)

date (fruit) orange (fruit)

water heater p

orange (fruit)

orange (fruit)

tabasco sauce

orange (fruit)

orange (fruit)

orange (fruit)

orange (fruit)

speaker (stero equipment)

turkey (food)

plastic bag

date (fruit)

parking meter

orange (fruit)

orange (fruit)

orange (fruit)

date (fruit)

pineapple pineapple

orange (fruit)

date (fruit)

cucumber cucumber

speaker (stero equipment)

grocery bag

zucchini zucchini

date (fruit)

thermometer

OWL-ST+FT L/14 self-trained on N-grams and fine-tuned on LVISbase (Table 1 row 15)

date (fruit)

orange (fruit)

speaker (stero equipment)

orange (fruit)

pear orange (fruit)

orange (fruit)

orange (fruit)

orange (fruit)

pear kiwi fruit

kiwi fruit potato

orange (fruit)

orange (fruit)

kiwi fruit pear

kiwi fruit pear

orange (fruit)

kiwi fruit pear pear

orange (fruit) kiwi fruit

orange (fruit)

orange (fruit)

kiwi fruit orange (fruit) pear kiwi fruit

orange (fruit)

kiwi fruit kiwi fruit pear orange (fruit)

apple apple

apple apple

orange (fruit)

kiwi fruit kiwi fruit

kiwi fruit kiwi fruit pearpear

kiwi fruit orange (fruit)

apple apple

kiwi fruit pear

kiwi fruit pear

kiwi fruit kiwi fruit

date (fruit)

orange (fruit)

orange (fruit)

orange (fruit)

orange (fruit)

pear orange (fruit)

plastic bag

kiwi fruit pear

apple apple

orange (fruit)

orange (fruit)

orange (fruit)

orange (fruit)

apple apple

orange (fruit)

apple apple

orange (fruit)

orange (fruit)

orange (fruit)

thermometer

orange (fruit)

orange (fruit)

orange (fruit)

apple apple

orange (fruit)

orange (fruit)

shopping cart

orange (fruit)

apple apple

plastic bag

orange (fruit)

orange (fruit)

orange (fruit)

orange (fruit) kiwi fruit

orange (fruit)

orange (fruit)

orange (fruit)

orange (fruit)

orange (fruit)

orange (fruit)

orange (fruit)

apple apple

apple apple

apple apple

Figure A5: Qualitative example for OWLv2 L/14 from the LVIS val set. For the visualization, all LVIS classes were used as prompts. LVISrare classes are labeled in black. Top: OWL-ST self-trained on N-grams, not ﬁne-tuned (Table 1 row 12). Bottom: OWL-ST+FT self-trained on N-grams and ﬁne-tuned on LVISbase (Table 1 row 15). Boxes above score 0.08 (top) or 0.3 (bottom) are shown.

OWL-ST L/14 self-trained on N-grams, not fine-tuned (Table 1 row 12)

black sheep

ram (animal)

black sheep

yoke (animal equipment)

yoke (animal equipment)

yoke (animal equipment)

freight car

yoke (animal equipment)

light lightning rod

black sheep

yoke (animal equipment)

lamb (animal)

yoke (animal equipment)

lightning rod

lightning rod

lamb (animal)

sheep black sheep

yoke (animal equipment)

sheep sheep

black sheep

lightning rod

black sheep

sheep sheep

lamb (animal)

lamb (animal)

black sheep

black sheep

lightning rod

lamb (animal)

lamb (animal)

yoke (animal equipment) sheep

lamb (animal)

sheep sheep

lamb (animal)

black sheep

lamb (animal)

lamb (animal)

lamb (animal)

lightning rod

lamb (animal)

ram (animal)

measuring stick

lamb (animal)

OWL-ST+FT L/14 self-trained on N-grams and fine-tuned on LVISbase (Table 1 row 15)

sheep sheep

sheep sheep

sheep sheep

sheep sheep sheep

sheep sheep

sheep sheep

sheep sheep

sheep sheep

sheep sheep

sheep sheep

black sheep

sheep sheep

sheep sheep

sheep sheep

Figure A6: Qualitative example for OWLv2 L/14 from the LVIS val set. For the visualization, all LVIS classes were used as prompts. LVISrare classes are labeled in black. Top: OWL-ST self-trained on N-grams, not ﬁne-tuned (Table 1 row 12). Bottom: OWL-ST+FT self-trained on N-grams and ﬁne-tuned on LVISbase (Table 1 row 15). Boxes above score 0.08 (top) or 0.3 (bottom) are shown.

OWL-ST L/14 self-trained on N-grams, not fine-tuned (Table 1 row 12)

zucchini pepper

folding chair

dropper handle measuring cup

bow (decorative ribbons)

folding chair

handle lettuce

bell pepper

bow (decorative ribbons)

flute glass

cooking utensil

measuring cup

ironing board

cooking utensil

ladle handle

folding chair

pan (metal container)

underdrawers

underdrawers

cooking utensil

underdrawers

cooking utensil

pitcher (vessel for liquid)

cooking utensil

measuring cup

orange (fruit)

underdrawers

dixie cup flower arrangement

folding chair

cooking utensil cooking utensil

cooking utensil

folding chair

cooking utensil

cooking utensil cooking utensil

cooking utensil cooking utensil

pan (metal container) cooking utensil

frying pan cooking utensil cooking utensil frying pan

folding chair

cooking utensil

frying pan cooking utensil

kitchen table

pan (metal container) sewing machine

cooking utensil

cayenne (spice)

bread parchment lettuce

pan (metal container) pan (metal container)

pan (metal container)

OWL-ST+FT L/14 self-trained on N-grams and fine-tuned on LVISbase (Table 1 row 15)

gourd basket

cooking utensil hook

cooking utensil cooking utensil cooking utensil

frying pan pan (metal container) cooking utensil

cooking utensil

flower arrangement

cooking utensil

frying pan frying pan

cooking utensil

pan (metal container)

frying pan frying pan

iron (for clothing)

frying pan frying pan frying pan

pan (metal container)

Figure A7: Qualitative example for OWLv2 L/14 from the LVIS val set. For the visualization, all LVIS classes were used as prompts. LVISrare classes are labeled in black. Top: OWL-ST self-trained on N-grams, not ﬁne-tuned (Table 1 row 12). Bottom: OWL-ST+FT self-trained on N-grams and ﬁne-tuned on LVISbase (Table 1 row 15). Boxes above score 0.08 (top) or 0.3 (bottom) are shown.