# segment_everything_everywhere_all_at_once__d4acfb23.pdf Segment Everything Everywhere All at Once Xueyan Zou 2, Jianwei Yang 1, Hao Zhang , Feng Li , Linjie Li , Jianfeng Wang Lijuan Wang , Jianfeng Gao , Yong Jae Lee University of Wisconsin-Madison Microsoft Research, Redmond HKUST Microsoft Cloud & AI Equal Contribution Equal Advisory Contribution 1. Project Lead 2. Main Technical Contribution {xueyan,yongjaelee}@cs.wisc.edu {jianwyan,jfgao,linjli}@microsoft.com {hzhangcx,fliay}@connect.ust.hk Figure 1: SEEM supports generic segmentation tasks including semantic, instance, and panoptic segmentation in an open-set fashion when no prompt is provided. SEEM also enables the use of visual, textual, and referring region prompts in flexbile combinations, making it a promptable and interactive segmentation interface. In this work, we present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image, as shown in Fig. 1. In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like large language models (LLMs). More specifically, SEEM is designed with four desiderata: i) Versatility. We introduce a new visual prompt to unify different spatial queries including points, boxes, scribbles and masks, which can further generalize to a different referring image; ii) Compositionality. We learn a joint visual-semantic space between text and visual prompts, which facilitates the dynamic composition of two prompt types required for various segmentation tasks; iii) Interactivity. We further incorporate learnable memory prompts into the decoder to retain segmentation history through mask-guided cross-attention from decoder to image features; and iv) Semantic-awareness. We use a text encoder to encode text queries and mask labels into the same semantic space for openvocabulary segmentation. We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks. Notably, our single SEEM model achieves competitive performance across interactive segmentation, generic segmentation, referring segmentation, and video object segmentation on 9 datasets with minimum 1/100 supervision. Furthermore, SEEM showcases a remarkable capacity for generalization to novel prompts or their combinations, rendering it a readily universal image segmentation interface. 1 Introduction Image segmentation is arguably the most important yet challenging problem in computer vision. In the past, we have witnessed significant progress in a wide range of segmentation tasks including 37th Conference on Neural Information Processing Systems (Neur IPS 2023). instance, semantic and panoptic segmentation [1, 2, 3, 4, 5, 6, 7]. Most recently, we are observing a clear trend toward more flexible segmentation models in different aspects: 1) From closed-set to open-vocabulary segmentation. Many recent works proposed to either leverage contrastive learning methods or pretrained multi-modal foundation models (e.g., CLIP [8]) to make the segmentation models more transferable to unseen concepts [9, 10, 11, 12]; 2) From generic to referring segmentation. In addition to generic segmentation that segments an image thoroughly given a predetermined set of concepts, language-based referring segmentation provides a user-friendly way of segmenting a specific region referred by an arbitrary text phrase [13, 14, 15, 16, 17]; and 3) From one-shot to interactive segmentation. In practice, segmentation models do not necessarily produce satisfactory masks in one round. As such, people are also studying how to progressively refine the segmentation results through intimate interactions between humans and models [18, 19, 20, 21]. Despite the aforementioned efforts taken to design more powerful and feasible segmentation models, we are still lacking a universal segmentation interface that is capable of accommodating various types of human prompts and tackling different segmentation tasks as studied in individual works. In contrast, Large Language Models (LLMs) have already emerged as such a universal interaction interface for language tasks, from early models like GPT-3 [22] and T5 [23], to conversational agent [24] augmented by advanced prompting [25, 26, 27] and chain-of-thought [28, 29, 30]. In this work, we strive for a universal interface for segmenting everything everywhere all at once in an image. On this interface, we are targeted at unifying all segmentation tasks with a single model in a promptable manner. To achieve this goal, we propose a new prompting scheme in mask decoder that has four important properties: versatility, compositionality, interactivity, and semantic-awareness. Specifically, we propose to encode points, masks, text, boxes, and even a referred region from another image into prompts in the same joint visual-semantic space. As such, our model can deal with any combination of the input prompts, leading to strong compositionality. To enable interactivity, we further introduce memory prompts for condensing the previous segmentation information followed by communication with other prompts. As for semantic awareness, our model can provide an open-set semantic label to any output segmentation. With the proposed prompting scheme, we build a segment-everything-everywhere model called SEEM comprised of a simple Transformer encoder-decoder architecture [31, 6] with an extra text encoder [11, 32]. In SEEM, the decoding process emulates a generative LLM but with a multimodalityin-multimodality-out interface. An image encoder and text encoder are used as the prompt encoder to encode all types of queries, which are fed into the decoder. Concretely, we encode all spatial queries, namely, points, boxes, scribbles and masks into visual prompts by pooling their corresponding visual features from the image encoder, and use the text encoder to convert text queries into text prompts. By training on diverse segmentation tasks, our model learns to deal with various prompts, align the visual and text prompts, and promote their synergy via cross-attention between them. As a result, our single model after pretraining attains competitive performance across all segmentation tasks. Since the prompts of all 5 different types are mapped to the joint visual-semantic space, we can feasibly combine prompts to resolve the ambiguity to obtain better segmentation results and enable zero-shot adaptation to unseen user prompts. Furthermore, our model can immediately generalize to the case of using an exemplar image segment as the prompt and video object segmentation in a zero-shot fashion. In addition to its strong generalization capability, SEEM is also more efficient for interactive segmentation compared with the counterparts like Simple Click [32]. Since we take the prompts as input to the decoder, when doing multi-round interactions with humans, our model only needs to run the feature extractor once at the beginning and lightweight decoding each per round. To the end, we build a segmentation interface with a single pre-trained model that can segment every object with semantics (everything), cover every pixel in the image (everywhere), and support all possible compositions of prompts (all at once). In summary, our contributions are threefold: We design a new prompting scheme that can encode various user intents into prompts in a joint visual-semantic space, enabling strong flexibility for various segmentation tasks and generalization capability to unseen prompts or their combinations. We build SEEM, a universal and interactive segmentation interface that integrates the newly designed prompting mechanism into a lightweight decoder for all segmentation tasks, leading to a model possessing properties of versatility, compositionality, interactivity, and semantic awareness. We conduct extensive experiments and visualizations to show that our model has strong performance on many segmentation tasks including open-vocabulary generic segmentation, interactive segmentation, referring segmentation, and segmentation tasks with combined prompts. Figure 2: Overview of SEEMDecoder. (a) SEEM encodes image, text, and human inputs into joint visual-semantic space as queries, features, and prompts, and then decodes queries to class and mask embeddings. (b) With the benefit of SEEM decoder, the machine loop enables memorizing history mask information, and the human loop provides new corrections to the next round. 2 Related Work Interactive segmentation. Interactive segmentation is the task of segmenting objects by interactively taking user inputs. It has been a longstanding problem and has achieved considerable progress [33, 34, 35, 20, 21, 36]. Generally, the interaction types can take various forms, such as clicks, boxes, polygons, and scribbles, among which click-based interaction models are the most prevalent. Concurrent to our work, SAM [36] proposed a promptable segmentation model trained on 11 million images and 1.1 billion masks. It takes user interactions as prompts for general segmentation. Though SAM demonstrates strong zero-shot performance, it produces segmentations without semantic meaning. In addition, its prompt types are limited to points, boxes, and text, whereas our model can also take in a referred region from another image as a prompt. Generic segmentation. Segmentation of visual concepts has been a persistent challenge in the field of computer vision, as evidenced by its extensive literature [37, 38, 39, 40]. Generic segmentation techniques encompass several subtasks, including instance segmentation, semantic segmentation, and panoptic segmentation [4, 2, 3], each focusing on a different semantic level. For example, semantic segmentation aims to identify and label each pixel within an image based on its corresponding semantic class [41, 6, 42]. On the other hand, instance segmentation involves grouping pixels that belong to the same semantic class into separate object instances [4, 43, 7]. Recently, the Detection Transformer (DETR)[31], a model based on the Transformer [44] architecture, has made significant advances in segmentation [45, 6, 7, 46, 47] tasks. However, these approaches cannot recognize objects absent in the training set, which constrains the model to a limited vocabulary size. Unified vision models. Unified vision models [11, 48, 49, 36, 50] have recently drawn a lot of attention because of their advantage in generalizing to various tasks and flexibility. These models can deal with multiple vision tasks or data distributions. Among them, some [11, 48, 49] train multiple tasks together with only one model and thus can deal with all training tasks without finetuning on each target task. On the other hand, SAM [36] and Seg GPT [50] propose training strategies that enable their models to handle new tasks and data distributions in a zero-shot manner. The second approach is more favorable since there is no need to resolve conflicts among tasks during training. 3.1 Model Design SEEM employs a generic encoder-decoder architecture but also employs a sophisticated interaction scheme between queries and prompts, as shown in Fig. 2 (a). Given an input image I RH W 3, an image encoder is first used to extract image features Z. Then, SEEM-Decoder predicts the masks M and semantic concepts C based on the query outputs Om h (mask embeddings) and Oc h (class embeddings), which interact with text, visual, and memory prompts Pt, Pv, Pm : Om h , Oc h = Decoder(Qh; Pt, Pv, Pm |Z) (1) M = Mask Predictor(Om h ) (2) C = Concept Classifier(Oc h) (3) Figure 3: Queries and prompt interaction during training and evaluation. (a) Learnable queries are duplicated as object, grounding, and visual queries with the same set of weights for each task. (b) Attention mask between any two kinds of tokens (denoted as qpm in Algorithm. 1). Tentative means the interaction is not trained but able to do inference without any modification. where Qh is the learnable queries, and Pt, Pv, Pm represent the text prompts, visual prompts, and memory prompts, respectively. During training, Qh is duplicated for generic, referring, and interactive segmentation, as shown in Fig. 3. The corresponding prompts interact with their queries through self-attention. The learnable queries can freely interact with all prompts at inference time, thereby enabling zero-shot composition. Our design is inspired by the successful practice in XDecoder [11]. However, we highlight the differences in Eq. (1), marked in red, which allow for a universal model for image segmentation with the following properties: Versatile. In SEEM, we introduce visual prompts Pv to handle all non-textual inputs, such as points, boxes, scribbles, and a referred region from another image. These non-textual queries are beneficial to disambiguate the user s intent when textual prompts alone fail to identify the correct segment. For interactive segmentation, previous works either convert spatial queries to masks and feed them into the image backbone [20] or use different prompt encoders for each input type (points, boxes) [36]. The first approach can be too heavy in applications because each interaction requires the image to go through the feature extractor. The second approach is hard to generalize to unseen prompts. To address these limitations, we propose a visual sampler (Fig. 2 (a)) to convert all kinds of non-textual queries to visual prompts that lie in the same visual embedding space: Pv = Visual Sampler(s, ˆZ) (4) where ˆZ is the feature maps extracted from either the target image (i.e., ˆZ = Z) or a referred image, and s {points, box, scribbles, polygons} are the sampling locations specified by the user. We first pool the corresponding region from the image feature through point sampling [6]. For all visual prompts, we interpolate at most 512 point feature vectors uniformly from the region specified by the prompt. A notable merit of our proposed method is that the visual prompts are naturally well-aligned with the textual prompts, as our model continuously learns a common visual-semantic space through panoptic and referring segmentation. Compositional. In practice, a user may cast their intent using different or combined prompt types. Hence, a compositional approach to prompting is essential for real-world applications. However, we confront two issues during model training. First, the training data usually only covers a single type of interaction (e.g., none, textual, visual). Second, although we use visual prompts to unify all non-textual prompts and align them with textual prompts, their embedding spaces remain inherently different. To mitigate this problem, we propose to match prompts of different types with different outputs. Considering that visual prompts Pv come from image features while textual prompts Pt come from the text encoder, we select matched output indices for visual and textual prompts by matching them with the mask embeddings Om h or class embeddings Oc h, respectively: IDv Match(Om h Pv + Io Umask) (5) IDt Match(Oc h Pt + Io Umask) (6) where Io Umask is the Io U between ground-truth and predicted masks. The proposed separate matching method outperforms approaches that only match with either Om h or Oc h for all prompts. After training, our model becomes familiar with all prompt types and supports a variety of compositions, such as no prompts, one prompt type, or both visual and textual prompts using the same model and weights. In particular, the visual and textual prompts can be simply concatenated and fed to SEEM-Decoder, even though it was never trained in this way. Interactive. Interactive segmentation usually cannot be completed in one shot and requires multiple interaction rounds for refinement, similar to conversational agents like Chat GPT. In SEEM, we Algorithm 1: Pseudo code for SEEM. # Inputs: Image(img)[B,3,H,W]; Pos_Mask(pm), Neg_Mask(nm)[B,1,H,W]; Text(txt)[abc...]; # Variables: Learnable Queries(Qh); Attention Masks between Q and P (qpm) # Functions: Img_Encoder(),Text_Encoder(),Visual_Sampler(),feature_attn(),prompt_attn(),output(); 1 def init( ): 2 Qo,Qt,Qv = Qh.copy(); # Initialize object, text and visual queries. 3 Fv,Pt = Img_Encoder(img), Text_Encoder(txt); # Fv and Pt denote image feature, text prompt. 4 Pv = Visual_Sampler(Fv, pm, nm); # Sample visual prompt from image feature, pos/neg mask. 5 def SEEM_Decoder(Fv,Qo,Qt,Qv,Pv,Pt,Pm): 6 Qo,Qt,Qv = feature_attn(Fv,Qo,Qt,Qv); # Cross attend queries with image features. 7 Qo,Qt,Qv = prompt_attn(qpm,Qo,Qt,Qv,Pv,Pt,Pm); # Self attend queries and prompts. 8 Om,Oc,Pm = output(Fv,Qo,Qt,Qv); # Compute mask and class outputs. 9 def forward(img,pm,nm,txt): 10 Fv,Qo,Qt,Qv,Pv,Pt = init(); Pm = None; # Initialize variables. 11 for i in range(max_iter): 12 Om,Oc,Pm = SEEM_Decoder(Fv,Qo,Qt,Qv,Pv,Pt,Pm) propose a new type of prompt called memory prompts Pm and use them to convey the knowledge of the masks from the previous iteration to the current one. Unlike previous works that use a network to encode the previous mask [20, 36], we introduce no extra module but simply a few memory prompts. These memory prompts encode the history information by using a mask-guided cross-attention layer [6]: Pl m = Masked Cross Att(Pl 1 m ; Mp|Z) (7) where Mp is the previous mask, and Z is the image feature map. In this way, cross-attention only takes effect inside the regions specified by the previous mask. The updated memory prompts Pl m then interact with the other prompts via self-attention to convey the historical information for the current round. Semantic-aware. Different from previous class-agnostic interactive segmentation works such as Simple Click [20] and the concurrent work SAM [36], our model produces semantic labels to masks for all kinds of prompt combinations in a zero-shot manner, since our visual prompt features are aligned with textual features in a joint visual-semantic space. As shown in Fig. 3, semantic labels are directly computed using Oc h (output of visual queries) and the text embedding. Although we do not train with any semantic labels for interactive segmentation, the calculated logits are well-aligned, benefiting from the joint visual-semantic space. 3.2 Model Pipeline and Loss Functions We summarize the training and evaluation pipeline of the proposed method with Pytorch-style pseudocode in Algorithm 1. SEEM is trained with a linear combination of losses for panoptic segmentation, referring segmentation, and interactive segmentation: L =αLc_CE_pano + βLm_BCE_pano + γLm_DICE_pano + a Lc_CE_ref + b Lm_BCE_ref +c Lm_DICE_ref + a Lc_CE_iseg + b Lm_BCE_iseg + c Lm_DICE_iseg (8) Where α = 2, β = γ = 5, a = 0.2, b = c = 2, CE, BCE, and DICE denotes cross-entropy, binary cross entropy and dice loss, respectively. 4 Experiments Datasets and Settings. SEEM is trained on three tasks: panoptic segmentation, referring segmentation, and interactive segmentation. Panoptic and interactive segmentation are trained on COCO2017 [51] with panoptic segmentation annotations. Following [11], we exclude the validation set of Ref-COCOg [52], resulting in 107K segmentation images in total. For referring segmentation, we use a combination of Ref-COCO, Ref-COCOg, and Ref-COCO+ for COCO image annotations. We evaluate generic segmentation (instance/panoptic/semantic), referring segmentation, and interactive segmentation. Implementation Details and Evaluation Metrics. Our model framework follows X-Decoder [11] except the decoder. That is, we have a vision backbone, a language backbone, an encoder, and Table 1: One model for segmentation on a wide range of segmentation tasks. SEEM is the first model to simultaneously support generic segmentation, referring segmentation, and interactive segmentation, as well as prompt compositionality. (#Concurrent work. - indicates the model does not have capability for the task, * indicates do not have reported number.) Method Segmentation Data Type Generic Segmentation Referring Segmentation Interactive Segmentation COCO Ref COCOg Pascal VOC PQ m AP m Io U c Io U m Io U AP50 5-No C85 10-No C85 20-No C85 5-No C90 10-No C90 20-No C90 Mask2Former (T) [6] COCO (0.12M) Segmentation 53.2 43.3 63.2 - - - - - - - - - Mask2Former (B) [6] COCO (0.12M) 56.4 46.3 67.1 - - - - - - - - - Mask2Former (L) [6] COCO (0.12M) 57.8 48.6 67.4 - - - - - - - - - Pano/Seg Former (B) [45] COCO (0.12M) 55.4 * * - - - - - - - - - LAVT (B) [53] Ref-COCO (0.03M) - - - 61.2 * * - - - - - - Poly Former (B) [17] Ref-COCO+VG+... (0.16M) - - - 69.3 * * - - - - - - Poly Former (L) [17] Ref-COCO+VG+... (0.16M) - - - 71.1 * * - - - - - - RITM (