# generalized_class_discovery_in_instance_segmentation__759175d6.pdf

Generalized Class Discovery in Instance Segmentation

Cuong Manh Hoang1, Yeejin Lee1, Byeongkeun Kang2*

1Seoul National University of Science and Technology, Republic of Korea 2Chung-Ang University, Republic of Korea {cuonghoang, yeejinlee}@seoultech.ac.kr, byeongkeunkang@cau.ac.kr

This work addresses the task of generalized class discovery (GCD) in instance segmentation. The goal is to discover novel classes and obtain a model capable of segmenting instances of both known and novel categories, given labeled and unlabeled data. Since the real world contains numerous objects with long-tailed distributions, the instance distribution for each class is inherently imbalanced. To address the imbalanced distributions, we propose an instance-wise temperature assignment (ITA) method for contrastive learning and classwise reliability criteria for pseudo-labels. The ITA method relaxes instance discrimination for samples belonging to head classes to enhance GCD. The reliability criteria are to avoid excluding most pseudo-labels for tail classes when training an instance segmentation network using pseudo-labels from GCD. Additionally, we propose dynamically adjusting the criteria to leverage diverse samples in the early stages while relying only on reliable pseudo-labels in the later stages. We also introduce an efficient soft attention module to encode object-specific representations for GCD. Finally, we evaluate our proposed method by conducting experiments on two settings: COCOhalf + LVIS and LVIS + Visual Genome. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art methods.

1 Introduction While supervised instance segmentation methods (He et al. 2017; Cheng et al. 2022; Jain et al. 2023) have achieved impressive performance, they require large-scale datasets with expensive human annotations. To reduce annotation costs, researchers have investigated semi-supervised learning (SSL) methods (Bellver et al. 2019; Yang et al. 2023; Berrada et al. 2024) that utilize unlabeled images along with small-scale labeled data, as well as weakly-supervised methods (Lan et al. 2021) relying on weak annotations. However, all these methods rely on the closed-world assumption and can recognize only the objects belonging to the classes (i.e., known classes) in the labeled dataset. To address this limitation, researchers have introduced novel category discovery (NCD) (Han, Vedaldi, and Zisserman 2019). Unlike SSL, where the unlabeled images contain only known classes, NCD assumes that the unlabeled

*Corresponding author. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

data include novel categories. Recently, generalized (novel) category discovery (GCD) (Vaze et al. 2022; Cao, Brbic, and Leskovec 2022) was introduced, further relaxing the assumptions on the unlabeled data. It assumes that the unlabeled data may contain both known and novel classes, making the problem more challenging and realistic. Given labeled and unlabeled data, GCD aims to train a model capable of recognizing both the known classes (e.g., person and car) in the labeled data and the novel categories (e.g., unknown1 and unknown2) discovered from the unlabeled data. Most previous works have investigated GCD for curated and balanced image classification datasets (Vaze et al. 2022; An et al. 2023; Zhang et al. 2023; Pu, Zhong, and Sebe 2023; Wen, Zhao, and Qi 2023). Recently, researchers have explored GCD for image classification on imbalanced datasets (Bai et al. 2023; Li et al. 2023; Li, Meinel, and Yang 2023), semantic segmentation (Zhao et al. 2022), 3D point cloud semantic segmentation (Riz et al. 2023), and instance segmentation (Fomenko et al. 2022; Weng et al. 2021). Most of these works leverage semi-supervised contrastive learning and pseudo-label generation regardless of tasks. In this work, we also investigate GCD in instance segmentation. It is worth noting that because the real world contains numerous objects with long-tailed distributions, the instance distribution for each class in instance segmentation datasets is inherently imbalanced. For example, a car appears more frequently than an ashtray . (1) To address this imbalanced distribution, we propose an instance-wise temperature assignment method for contrastive learning. While typical contrastive learning losses treat samples from head and tail classes equally, we aim to emphasize group-wise discrimination for head class samples while focusing on instance-wise discrimination for tail class samples, inspired by (Kukleva et al. 2023). (2) Although relying on reliable pseudo-labels is important (Yang et al. 2022), applying fixed and global reliability criteria to imbalanced data tends to exclude most pseudo-labels for tail classes. Therefore, we propose to utilize class-wise reliability criteria to apply varying thresholds for head and tail classes. Additionally, we dynamically adjust the reliability criteria throughout training to leverage diverse samples in the early stages while focusing on reliable pseudo-labels in later stages. (3) Finally, we introduce an efficient soft attention module based on spatial pooling and depth reduction to effectively encode repre-

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

sentations for target instances while suppressing those from background or adjacent objects. The contributions of this paper are as follows: (1) We propose an instance-wise temperature assignment method for semi-supervised contrastive learning in GCD to enhance the separability of classes in long-tailed distributions. (2) We introduce a reliability-based dynamic learning method for training with pseudo-labels to apply different reliability criteria to each class based on its tailness. Additionally, we adjust these criteria during training to rely on strictly reliable pseudo-labels in the later stages while using diverse data in the early stages. (3) We propose an efficient soft attention module for encoding object-specific representations for GCD. (4) We validate the effectiveness of the proposed method by discovering novel classes and training an instance segmentation network using labeled and unlabeled data.

2 Related Works Generalized Class Discovery in Image Classification. (Vaze et al. 2022) introduced GCD, which aims to categorize unlabeled images given both labeled and unlabeled data. They trained an embedding network using unsupervised contrastive learning on all the data and supervised contrastive learning on the labeled data. They then applied semisupervised k-means clustering to assign class or cluster labels to the unlabeled images. Additionally, they proposed a method for estimating the number of novel classes. (An et al. 2023) proposed DPN, which utilizes two sets of category-wise prototypes: one for labeled data and the other for unlabeled images based on k-means clustering. For clustering, they assumed prior knowledge of the total number of categories. They discovered novel classes in the unlabeled data by applying the Hungarian algorithm (Kuhn 1955) to the two sets. (Zhang et al. 2023) presented Prompt CAL, which uses an affinity graph to generate pseudo-labels for unlabeled data. (Pu, Zhong, and Sebe 2023) introduced the DCCL framework, which employs a hyperparameterfree clustering algorithm (Rosvall and Bergstrom 2008) to generate pseudo-labels in the absence of ground-truth cluster numbers. (Wen, Zhao, and Qi 2023) proposed Sim GCD, a one-stage framework that replaces the separate semisupervised clustering in (Vaze et al. 2022) with a jointly trainable parametric classifier. They analyzed the problem of using unreliable pseudo-labels for training a parametric classifier and proposed using soft pseudo-labels. Recently, (Bai et al. 2023) introduced GCD for long-tailed datasets to address imbalanced distributions in real-world scenarios. They proposed the Ba Con framework, which includes a pseudo-labeling branch and a contrastive learning branch. To handle imbalanced distributions, they estimated data distribution using k-means clustering and the Hungarian algorithm (Kuhn 1955). Concurrently, (Li et al. 2023) proposed Imba GCD, which estimates class prior distributions assuming unknown classes are usually tail classes. They then generated pseudo-labels for unlabeled images using the estimated prior distribution and the Sinkhorn-Knopp algorithm (Cuturi 2013). Later, (Li, Meinel, and Yang 2023) replaced the expectation maximization (EM) and Sinkhorn Knopp (Cuturi 2013) algorithms in (Li et al. 2023) with

cross-entropy-based regularization losses to reduce computational costs. Class Discovery in Segmentation and Detection. (Zhao et al. 2022) introduced NCD in semantic segmentation. They proposed to find novel salient object regions using a saliency model and a segmentation network trained on labeled data. They then applied clustering to these object regions to obtain pseudo-labels. Finally, they trained a segmentation network using the labeled data and the unlabeled images with clean pseudo-labels from clustering and online pseudo-labels. The clean pseudo-labels were dynamically assigned based on entropy ranking. (Riz et al. 2023) extended the method from (Zhao et al. 2022) to 3D point cloud semantic segmentation. Specifically, they utilized online clustering and exploited uncertainty quantification to generate pseudo-labels. In instance segmentation, (Weng et al. 2021) investigated unsupervised long-tail category discovery. They initially employed a class-agnostic mask proposal network to obtain masks for all objects. They then trained an embedding network using self-supervised triplet losses. Finally, they applied hyperbolic k-means clustering to discover novel categories. (Fomenko et al. 2022) introduced novel class discovery and localization, which can be viewed as GCD in object detection and instance segmentation. They first trained a Faster R-CNN (Ren et al. 2015) or Mask R-CNN (He et al. 2017) using the labeled data and froze the network except for its classification head. Then, they applied the frozen network to unlabeled and labeled images to obtain region proposals. Subsequently, they expanded the classification head to incorporate new classes and trained it using pseudo-labels generated by online clustering based on the Sinkhorn-Knopp algorithm (Cuturi 2013).

3 Proposed Method We first define the GCD problem in instance segmentation, which aims to discover novel classes and learn to segment instances of both known and novel categories, given labeled and unlabeled data. We then present our GCD method in Section 3.2 and the method for training an instance segmentation network in Section 3.3. An overview of the proposed framework during training is illustrated in Figure 1.

3.1 Preliminaries Problem Formulation. We are given a labeled dataset Dl

and an unlabeled dataset Du. Dl contains images {Il} along with instance-wise class and mask labels ({yl}, {M l}) for known classes Ck, while Du comprises only images {Iu}. Given Dl and Du, GCD in instance segmentation aims to discover novel categories Cn (i.e., Ck Cn = ) and to obtain a model capable of segmenting instances of both the known and novel classes C = Ck Cn. Hence, during inference, the network is expected to segment instances of known classes (e.g., person and car) as well as novel categories (e.g., unknown1 and unknown2) given an image I. The images in Dl and Du may contain instances of both the known and novel classes. Contrastive Learning for GCD. (Vaze et al. 2022) introduced a contrastive learning (CL) method for GCD in bal-

Figure 1: Overview of the proposed framework during training. We first train a class-agnostic instance segmentation network fo( ) and apply it to unlabeled images to generate class-agnostic instance masks. Then, we train the GCD model fd( ) using unlabeled object images and labeled object images to discover novel classes in the unlabeled data. Finally, we train an instance segmentation network fs( ) using the labeled data and the unlabeled images with pseudo-labels.

anced image classification datasets. They first pre-trained a backbone with DINO (Caron et al. 2021) on the Image Net dataset (Russakovsky et al. 2015) without labels. Subsequently, they fine-tuned the backbone and a projection head using supervised CL on the labeled data and unsupervised CL on both the labeled and unlabeled data. Following (Wen, Zhao, and Qi 2023; Vaze et al. 2022), our GCD model fd( ) consists of a backbone b( ) and a projection head g( ). We utilize an MLP for g( ) and a Res Net50 backbone (He et al. 2016) for b( ), which is pre-trained on the unlabeled Image Net dataset (Russakovsky et al. 2015) using DINO (Caron et al. 2021). Additionally, we employ a momentum encoder f d( ) whose parameters are momentumbased moving averages of the parameters of fd( ) during training, following Mo Co (He et al. 2020). We also use a queue to store the embeddings from f d( ) for the samples of both the previous and current mini-batches. Formally, given an image, we generate two views (random augmentations) Ii and I i. We then encode them using fd( ) and f d( ) to obtain zi = fd(Ii) = g(b(Ii)) and z i = f d(I i), respectively. We store z i from the samples in the previous and current mini-batches in the queue Z . In (Vaze et al. 2022), the unsupervised contrastive loss Lu rep and supervised contrastive loss Ls rep are computed as follows:

Lu rep := log exp(z T i z i/τ) P

z j ˆ Zi exp(z T i z j/τ),

Ls rep := 1

z k Zp i log exp(z T i z k/τ) P

z j ˆ Zi exp(z T i z j/τ)

where ˆZi represents the set Z excluding z i (i.e., ˆZi = Z \ z i); Zp i denotes the subset of Z containing the representations that belong to the same class as Ii; τ is a temperature hyperparameter; and | | denotes the number of samples in the set.

3.2 Generalized Class Discovery

Given Dl and Du, we aim to discover novel categories in Du using the knowledge from Dl. To achieve this, we first train an instance segmentation network using Dl and Du to generate class-agnostic instance masks M u for all objects in Du. Subsequently, we crop the unlabeled images using M u and the labeled images using the ground-truth masks M l. Then, we train a GCD model using the cropped unlabeled images Iu o and the cropped labeled images Il o with class labels yl to generate pseudo-class labels yu for Iu o .

Class-Agnostic Instance Mask Generation. Similar to (Fomenko et al. 2022), we first train an instance segmentation network fo( ) to obtain class-agnostic instance masks for both known and unseen classes C = Ck Cn. We train a class-agnostic instance segmentation network, the Generic Grouping Network (GGN) from (Wang et al. 2022), using both Dl and Du. We experimentally demonstrate that our GCD method is robust when applied to other class-agnostic instance segmentation methods. Once the training terminates, we apply fo( ) to the images Iu Du to obtain instance masks M u. We then construct an unlabeled object image set Iu o by cropping the rectangular regions of Iu based on M u. Similarly, a labeled object image set Il o is prepared by cropping Il in Dl based on the mask labels M l.

Contrastive Learning for GCD in Instance Segmentation. We propose a contrastive learning method for GCD in instance segmentation by modifying the losses in Eq. (1). While these losses are designed for curated and balanced data, instances in typical instance segmentation datasets are naturally imbalanced (i.e., certain objects appear more frequently than others). To address the long-tail distribution of instances for each class, we propose to adjust the temperature parameters in Lu rep and Ls rep for each instance based on its likelihood of belonging to head classes. Figure 2 visualizes t-SNE projections of two semantically similar classes: book and booklet . Previously, (Kukleva

(a) τ = 0.07 (b) τ = 1

(c) TS (d) Ours (ITA)

Figure 2: t-SNE visualization of two semantically close classes. Green: single head class ( book ); Blue: single tail class ( booklet ).

et al. 2023) showed that a low temperature value tends to uniformly discriminate all instances while a high temperature value leads to group-wise discrimination, as shown in Figure 2 (a) and (b). Based on this, (Kukleva et al. 2023) introduced a cosine temperature scheduling (TS) method to alternate between instance-wise and group-wise discrimination. In contrast, we propose to estimate the headness of each instance and assign high/low temperature values to instances belonging to head/tail class samples. Figure 2 demonstrates the superiority of our instance-wise temperature assignment (ITA) method compared to static assignments and TS (Kukleva et al. 2023). In detail, we first compute a headness score ˆhi for each instance Ii by estimating the density of the neighborhood of zi in the embedding space. A higher score ˆhi indicates a higher probability of Ii belonging to a head class. We then apply a momentum update to the headness scores to enhance the robustness of the estimation. Specifically, ˆhi and the momentum-updated headness score hi at the t-th epoch are computed as follows:

z j ˆ Z top K i exp(z T i z j) P

z j ˆ Zi exp(z T i z j) ,

ht i := ρht 1 i + (1 ρ)ˆht i

where ˆZi = Z \ z i; ˆZtop K i denotes the set containing the K% most similar representations to zi in ˆZ; ρ denotes a momentum hyperparameter with a value between 0 and 1. Subsequently, we determine a temperature value τi for each instance Ii using ht i. To avoid extreme values, we constrain ht i to fall within the lowest 10% (hlow) and highest 10% (hhigh) of the scores. We then apply min-max normalization to adjust the score to fall within the range between τ min and τ max. Specifically, the temperature value τi for Ii is calculated as follows:

τi := ht i min(Ht) max(Ht) min(Ht)(τ max τ min) + τ min (3)

where ht i := min(max(ht i, hlow), hhigh); Ht represents the set containing the headness scores for all samples. For efficiency, τi and ht i are updated at every epoch.

The two contrastive losses Lu rep and Ls rep in Eq. (1) are modified using the estimated instance-wise temperature value τi as follows:

Lu rep := log exp(z T i z i/τi) P

z j ˆ Zi exp(z T i z j/τi),

Ls rep := 1

z k Zp i log exp(z T i z k/τi) P

z j ˆ Zi exp(z T i z j/τi) .

Soft Attention Module. Since Iu o and Il o contain target objects along with background or adjacent objects, we investigate a soft attention module (SAM) to encode object-specific features. Although we generate pseudo-masks M u for Du using fo( ), directly using these pseudo-masks to encode object-specific features is risky due to noisy boundaries. To address this, we train an efficient attention module using the pseudo-masks M u for Du and ground-truth masks M l for Dl. We integrate this attention module into every stage of the CNN backbone. Additionally, we utilize pooled feature maps and embedding functions with depth reduction to reduce computational complexity. Given a feature map F RD H W at the end of each stage in the backbone, we first reduce its dimensions to decrease subsequent computations. Specifically, we apply spatial average pooling to F with varying receptive fields. Then, we use M embedding functions ηi( ) to generate M outputs P i Rd si si. Here, i indexes the M outputs, d represents the reduced depth produced by the embedding functions, and si si denotes the resulting spatial dimension after pooling. Subsequently, we reshape P i into ˆP i Rd s2 i and concatenate them to obtain P Rd (s2 1+s2 2+ +s2 M). Then, we compute a pairwise affinity matrix A by projecting F using an embedding function ϕ( ) and multiplying the result by P (i.e., A := P T ϕ(F )). The matrix A R(s2 1+s2 2+ +s2 M) H W represents the spatial relations between the pooled feature map P and F . Additionally, we project F using a function ψ( ) and apply global average pooling along the channel dimension, generating a map G. Finally, we concatenate A with G and apply an embedding function ν( ) followed by a sigmoid function to obtain an attention map S R1 H W . This map S is then element-wise multiplied with each channel of F to produce the output O RD H W of the attention module.

O := S F := σ(ν([A, G])) F (5)

where and σ( ) denote element-wise multiplication and the sigmoid function, respectively; [ , ] represents concatenation. Each of the embedding functions (η( ), ψ( ), ϕ( )) consists of a 1 1 convolution layer, batch normalization, and Re LU activation. In comparison, ν( ) contains only a 1 1 convolution layer and batch normalization. To train the soft attention modules, we use object masks M u from fo( ) for Du and ground-truth masks M l for Dl. Because the pseudo-masks for Du are noisy, especially near object boundaries (Wang, Li, and Wang 2022), we utilize a

(a) RGA-S (Zhang et al. 2020) (b) Ours (SAM)

Figure 3: Visualization of the pairwise affinity between the white cross marked location and other pixels.

weight map W to reduce reliance on these regions. Specifically, the attention loss Latt is computed as follows:

Latt := 1 HW

j=1 W ij Sij M ij 2 2, (6)

where W ij is set to w if dij d and M {M u}, and to 1 otherwise. Here, dij denotes the Euclidean distance from (i, j) to the nearest object boundary; d is a hyperparameter that defines the boundary regions; and w is a weighting coefficient ( 1). Since SAM is applied after each stage of the backbone, Latt is obtained by averaging all the corresponding losses. Figure 3 visualizes the pairwise affinity between a marked position and other pixels. The results show that our method, which uses pooled feature maps, is more robust than the previous approach (Zhang et al. 2020). Deep Clustering for GCD. To avoid the separate semisupervised clustering step in (Vaze et al. 2022), we employ a deep clustering method similar to those in (Fomenko et al. 2022; Wen, Zhao, and Qi 2023). However, unlike (Fomenko et al. 2022), our method does not rely on an experimentally selected target prior distribution. Additionally, it is designed to handle imbalanced data, in contrast to (Wen, Zhao, and Qi 2023). Specifically, we use the method from (Zhang et al. 2021) for clustering Iu o , with a minor modification: replacing L2 distance with cosine similarity. We compute the KL-divergence-based loss Lu cls on Iu o for unsupervised clustering in (Zhang et al. 2021), as follows:

c=1 pic log pic

where qic is the probability of Ii belonging to class c; pic is the auxiliary target class probability; and C is the total number of clusters/classes. Independent from deep clustering, we additionally compute the typical cross-entropy loss Ls cls on Dl for supervised classification, as follows:

c=1 yic log qic (8)

where yi denotes the one-hot encoded label for Ii. Total Loss for GCD. The total loss Lgcd for fd( ) is computed as the weighted sum of the two contrastive losses, the two classification losses, and the attention loss, as follows: Lgcd := Latt +(1 λ)Lu rep +λLs rep +(1 λ)Lu cls +λLs cls (9) where λ is a hyperparameter used to balance the losses, following (Wen, Zhao, and Qi 2023).

3.3 Reliability-Based Dynamic Learning We generate pseudo-masks M u and pseudo-class labels yu for Du using the method described in Section 3.2. Subsequently, we train an instance segmentation network fs( ) that can segment instances of both known and novel classes using Dl and Du with pseudo-labels. To address the issues of inaccurate pseudo-labels and imbalanced instance distributions across classes, we propose a reliability-based dynamic learning (RDL) method. It applies different reliability criteria to each class to avoid excluding all samples from tail classes. Additionally, it adjusts these criteria during training to use diverse data in the early stages while relying only on reliable pseudo-labels in the later stages. Inspired by (Yang et al. 2022), we use holistic stability to measure the reliability of the pseudo-labels. At every fixed number of epochs during training fd( ), we save the model at that point in time. We then apply these saved models to object images Ii Iu o to compute the probability q t ic of Ii belonging to class c, where t denotes the index of the stored models, ranging from 1 to T. Subsequently, we compute the stability si of the probabilities by comparing q T ic from the final model with q t ic from the intermediate models, as follows:

1 KL(q T i ||q t i) (10)

where KL( || ) represents the Kullback-Leibler divergence, and q t i R|C|. Since a higher si indicates greater stability, (Yang et al. 2022) considered pseudo-labels with the lowest r% scores unreliable. However, applying the same criteria to all data may result in categorizing most samples of a certain class as unreliable. Specifically, for imbalanced data, instances of tail classes tend to have lower si than those of head classes due to the smaller number of training samples. Additionally, because neural networks tend to first memorize easy samples and then gradually learn harder instances during training (Arpit et al. 2017), difficult samples/classes often have lower si than easier ones. To address these issues, we propose to use class-wise reliability criteria, which consider pseudo-labels with the lowest r% scores per class as unreliable. Additionally, we gradually increase the portion r% of unreliable samples to initially learn from all data and later optimize using only reliable samples. The idea is that, in the early stages, having a larger number of diverse samples is more important than the accuracy of pseudo-labels, while pseudo-label quality becomes more crucial in later stages. Specifically, we first compute ti for each instance i by finding the rank of si among the lowest values within its class. We assign γ to ti if its proportional rank falls between γ Tis and γ+1

Tis . Then, the reliability-based adjustment weight κt i is computed as follows:

Tis , 0 2 (11)

where t and Tis denote the current epoch and the total number of epochs for training fs( ), respectively.

Finally, we employ SOLOv2 (Wang et al. 2020) for fs( ) and use its loss function with modifications for training. First, we replace the focal loss (Lin et al. 2017) with the equalized focal loss (Li et al. 2022), which performs better on imbalanced data. We use this modified loss Ls for Dl and this loss multiplied by κt i for Du. Therefore, the total instance segmentation loss Lis is computed as follows:

i Bl Ls(Il i, M l i, yl i) + X

i Bu κt i Ls(Iu i , M u i , yu i )

where Bl and Bu denote the sets containing the indices of the data from Dl and Du, respectively.

4 Experiments and Results

4.1 Experimental Setting

Dataset. We conducted experiments in two settings: COCOhalf + LVIS and LVIS + VG, following (Fomenko et al. 2022). In the COCOhalf + LVIS setting, we consider 80 COCO classes as known classes and aim to discover 1,123 disjoint LVIS classes from the total 1,203 LVIS classes. Among the 100K training images in the LVIS dataset (Gupta, Dollár, and Girshick 2019), we use 50K images with labels for the 80 COCO classes for Dl and the entire 100K images without labels for Du. We use the 20K LVIS validation images for evaluation. In the LVIS + VG setting, we utilize 1,203 LVIS classes as known classes and aim to discover 2,726 disjoint classes in the Visual Genome (VG) v1.4 dataset (Krishna et al. 2017). We use the entire 100K LVIS training data for Dl and the combined 158K LVIS and VG training images for Du. For evaluation, we use 8K images that appear in both the LVIS and VG validation sets. Although the VG dataset contains over 7K classes, only 3,367 classes appear in both Du and the validation data. After excluding the 641 classes that overlap with the known classes, we aim to discover 2,726 classes. Implementation Details. We use Res Net-50 as the backbone for both fd( ) and fs( ). For fd( ), we apply the SAM to all four stages of the backbone. We train fd( ) for 390 epochs with K = 1%, τ min = 0.07, τ max = 1, and λ = 0.35. For the SAM, we set M = 3, d = D/8, w = 0.25 stg, ˆd = 1, s1 = 18/(2stg 1) , s2 = 12/(2stg 1) , and s3 = 8/(2stg 1) , where stg is the stage index and denotes rounding. We train fs( ) for 36 epochs with T = 3. The experiments were conducted on a computer with two Nvidia Ge Force RTX 3090 GPUs, an Intel Core i9-10940X CPU, and 128 GB RAM. The code will be publicly available on Git Hub upon publication to ensure reproducibility. Evaluation Metric. We first apply the Hungarian algorithm (Kuhn 1955) to find a one-to-one mapping between the discovered classes and the ground-truth novel classes, following (Fomenko et al. 2022). We then calculate m AP.50:.05:.95 for all, known, and novel classes. m AP.50:.05:.95 is computed using mask labels for the COCOhalf + LVIS setting and using bounding box labels for the LVIS + VG setting due to the absence of mask labels in the VG dataset.

Method m APall m APknown m APnovel

k-means 1.48 17.24 0.13 RNCDL w/ ran. init. 1.87 22.85 0.09 ORCA 3.19 21.61 1.63 UNO 3.42 22.34 1.86 Sim GCD 4.06 23.91 2.47 RNCDL 6.69 25.21 5.16 µGCD 4.93 25.36 3.58 NCDLR 9.42 27.81 8.06 Ours 12.85 35.57 11.24

Table 1: Comparison for the COCOhalf + LVIS setting. indicates that only the GCD model in our method is replaced by the corresponding GCD method while all other components of our method remain unchanged.

Method m APall m APknown m APnovel

RNCDL w/ ran. init. 2.13 7.71 0.81 Sim GCD 2.37 9.13 0.92 RNCDL 4.46 12.55 2.56 µGCD 2.64 9.74 0.98 NCDLR 4.71 12.87 2.63 Ours 5.21 13.28 3.27

Table 2: Quantitative comparison for the LVIS + VG setting.

4.2 Result Tables 1 and 2 present quantitative comparisons for the COCOhalf + LVIS and LVIS + VG settings, respectively. We compare the proposed method with the k-means (Mac Queen et al. 1967) baseline and previous works including ORCA (Cao, Brbic, and Leskovec 2022), UNO (Fini et al. 2021), Sim GCD (Wen, Zhao, and Qi 2023), and RNCDL (Fomenko et al. 2022). Additionally, we report results obtained by replacing the GCD model in our framework with recent GCD methods: µGCD (Vaze, Vedaldi, and Zisserman 2023) and NCDLR (Zhang, Xu, and He 2023). The results demonstrate that our method significantly outperforms the previous state-of-the-art method (Fomenko et al. 2022) as well as methods focusing on balanced and curated datasets (Cao, Brbic, and Leskovec 2022; Fini et al. 2021; Wen, Zhao, and Qi 2023), across all metrics and settings. RNCDL w/ ran. init. refers to the method with random initialization in (Fomenko et al. 2022). Figure 4 shows qualitative comparisons with RNCDL (Fomenko et al. 2022) for both settings.

4.3 Analysis We conducted all the ablation studies and analyses using the COCOhalf + LVIS setting unless otherwise noted. Table 3 presents an ablation study of the proposed components. The baseline model is constructed by using a fixed value for the temperature parameters for all samples in Eq. (4), excluding SAM, and treating all pseudo-labels as equivalent to human annotations. The results demonstrate that each module provides additional improvements across all metrics. Table 4 compares the proposed ITA method with the TS

(a) RNCDL (b) Ours

(c) RNCDL (d) Ours

Figure 4: Qualitative results. (a) and (b) are COCOhalf + LVIS setting; (c) and (d) are LVIS + VG setting.

Method m APall m APknown m APnovel

Baseline 5.79 24.96 4.19 + ITA 7.64 26.84 6.09 + ITA + SAM 10.71 31.62 9.08 + ITA + SAM + RDL 12.85 35.57 11.24

Table 3: Ablation study on the proposed components.

τ min, τ max Method m APall m APknown m APnovel

0.07, 0.5 TS 10.86 33.72 9.19 Ours (ITA) 11.67 34.63 10.01

0.07, 1 TS 11.43 34.25 9.79 Ours (ITA) 12.85 35.57 11.24

Table 4: Comparison on temperature assignment methods.

method from (Kukleva et al. 2023). Although the TS method is also designed for imbalanced data, our ITA method consistently outperforms it for both the selected hyperparameters (0.07, 1) and an alternative set (0.07, 0.5). Additionally, the table shows that the chosen hyperparameters yield better performance than the alternative set. Table 5 compares the proposed SAM with previous attention methods, including MGCAM (Song et al. 2018), MGFPM (Wang, Yu, and Gao 2021), CBAM-S (Woo et al. 2018), and RGA-S (Zhang et al. 2020). The results for the other methods are obtained by replacing only the SAM in the proposed method with the corresponding methods. The results demonstrate that our SAM outperforms all other methods across all metrics. Table 6 compares our RDL method with the fixed and global reliability criteria in ST++ (Yang et al. 2020). The results confirm the importance of using selected pseudo-labels based on adjusted reliability criteria for each class throughout the training process. Table 7 presents the results of our method using various class-agnostic instance mask generation models, including Mask R-CNN (He et al. 2017), OLN (Kim et al. 2022), LDET (Saito et al. 2022), UDOS (Kalluri et al. 2024), and

Method m APall m APknown m APnovel

MGCAM 10.34 29.37 8.79 MGFPM 10.63 31.15 9.01 CBAM-S 11.19 31.83 9.56 RGA-S 11.62 33.48 9.98

Ours (SAM) 12.85 35.57 11.24

Table 5: Comparison with other attention methods.

Method m APall m APknown m APnovel

r = 25% 11.56 32.45 9.92 r = 50% 11.79 33.36 10.17 r = 75% 11.63 32.79 10.01

Ours (RDL) 12.85 35.57 11.24

Table 6: Comparison of reliability-based dynamic learning with fixed criteria.

Method m APall m APknown m APnovel

Mask R-CNN 11.82 32.74 10.13 OLN 12.07 33.56 10.52 LDET 12.36 34.63 10.74 UDOS 12.54 35.12 11.05 GGN 12.85 35.57 11.24

Table 7: Comparison of class-agnostic instance mask generation methods.

GGN (Wang et al. 2022). The results demonstrate that our method consistently outperforms the previous state-of-theart method (Fomenko et al. 2022), regardless of the choice of mask generation model.

5 Conclusion

Towards open-world instance segmentation, we present a novel GCD method in instance segmentation. To address the imbalanced distribution of instances, we introduce the instance-wise temperature assignment method as well as class-wise and dynamic reliability criteria. The former aims to improve the embedding space for class discovery, and the criteria are designed to effectively utilize pseudo-labels from the GCD model. Additionally, we propose an efficient soft attention module. The experimental results in two settings demonstrate that the proposed method outperforms previous methods by effectively discovering novel classes and segmenting instances of both known and novel categories. Regarding limitations, this work assumes the full availability of labeled and unlabeled datasets from the beginning. Thus, it is suboptimal for scenarios where data is provided sequentially, such as in robot navigation. Additionally, we assume prior knowledge of the total number of classes, following most previous works.

Acknowledgments

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. RS-2023-00252434).

An, W.; Tian, F.; Zheng, Q.; Ding, W.; Wang, Q.; and Chen, P. 2023. Generalized Category Discovery with Decoupled Prototypical Network. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11): 12527 12535. Arpit, D.; Jastrzebski, S.; Ballas, N.; Krueger, D.; Bengio, E.; Kanwal, M. S.; Maharaj, T.; Fischer, A.; Courville, A.; Bengio, Y.; and Lacoste-Julien, S. 2017. A Closer Look at Memorization in Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 233 242. PMLR. Bai, J.; Liu, Z.; Wang, H.; Chen, R.; Mu, L.; Li, X.; Zhou, J. T.; FENG, Y.; Wu, J.; and Hu, H. 2023. Towards Distribution-Agnostic Generalized Category Discovery. In Advances in Neural Information Processing Systems, volume 36, 58625 58647. Bellver, M.; Salvador, A.; Torrres, J.; and Giro-i Nieto, X. 2019. Budget-aware Semi-Supervised Semantic and Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. Berrada, T.; Couprie, C.; Alahari, K.; and Verbeek, J. 2024. Guided Distillation for Semi-Supervised Instance Segmentation. In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 464 472. Cao, K.; Brbic, M.; and Leskovec, J. 2022. Open-World Semi-Supervised Learning. In International Conference on Learning Representations. Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging Properties in Self-Supervised Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9630 9640. Cheng, B.; Misra, I.; Schwing, A. G.; Kirillov, A.; and Girdhar, R. 2022. Masked-attention Mask Transformer for Universal Image Segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1280 1289. Cuturi, M. 2013. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Advances in Neural Information Processing Systems, volume 26. Fini, E.; Sangineto, E.; Lathuilière, S.; Zhong, Z.; Nabi, M.; and Ricci, E. 2021. A Unified Objective for Novel Class Discovery. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 9264 9272. Fomenko, V.; Elezi, I.; Ramanan, D.; Leal-Taixé, L.; and Osep, A. 2022. Learning to Discover and Detect Objects. In Advances in Neural Information Processing Systems, volume 35, 8746 8759.

Gupta, A.; Dollár, P.; and Girshick, R. 2019. LVIS: A Dataset for Large Vocabulary Instance Segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5351 5359. Han, K.; Vedaldi, A.; and Zisserman, A. 2019. Learning to Discover Novel Visual Categories via Deep Transfer Clustering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 8400 8408. He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9726 9735. He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV), 2980 2988. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770 778. Jain, J.; Li, J.; Chiu, M.; Hassani, A.; Orlov, N.; and Shi, H. 2023. One Former: One Transformer to Rule Universal Image Segmentation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2989 2998. Kalluri, T.; Wang, W.; Wang, H.; Chandraker, M.; Torresani, L.; and Tran, D. 2024. Open-world Instance Segmentation: Top-down Learning with Bottom-up Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2693 2703. Kim, D.; Lin, T.-Y.; Angelova, A.; Kweon, I. S.; and Kuo, W. 2022. Learning Open-World Object Proposals Without Learning to Classify. IEEE Robotics and Automation Letters, 7(2): 5453 5460. Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; Bernstein, M. S.; and Fei-Fei, L. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123(1): 32 73. Kuhn, H. W. 1955. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly, 2(1-2): 83 97. Kukleva, A.; Böhle, M.; Schiele, B.; Kuehne, H.; and Rupprecht, C. 2023. Temperature Schedules for Self-Supervised Contrastive Methods on Long-Tail Data. In ICLR. Lan, S.; Yu, Z.; Choy, C.; Radhakrishnan, S.; Liu, G.; Zhu, Y.; Davis, L. S.; and Anandkumar, A. 2021. Disco Box: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 3386 3396. Li, B.; Yao, Y.; Tan, J.; Zhang, G.; Yu, F.; Lu, J.; and Luo, Y. 2022. Equalized Focal Loss for Dense Long-Tailed Object Detection. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6980 6989. Li, Z.; Dai, B.; Simsek, F.; Meinel, C.; and Yang, H. 2023. Imba GCD: Imbalanced Generalized Category Discovery. ar Xiv:2401.05353.

Li, Z.; Meinel, C.; and Yang, H. 2023. Generalized Categories Discovery for Long-tailed Recognition. ar Xiv:2401.05352. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal Loss for Dense Object Detection. In 2017 IEEE International Conference on Computer Vision (ICCV), 2999 3007. Mac Queen, J.; et al. 1967. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 281 297. Oakland, CA, USA. Pu, N.; Zhong, Z.; and Sebe, N. 2023. Dynamic Conceptional Contrastive Learning for Generalized Category Discovery. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7579 7588. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, volume 28. Riz, L.; Saltori, C.; Ricci, E.; and Poiesi, F. 2023. Novel Class Discovery for 3D Point Cloud Semantic Segmentation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9393 9402. Rosvall, M.; and Bergstrom, C. T. 2008. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4): 1118 1123. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3): 211 252. Saito, K.; Hu, P.; Darrell, T.; and Saenko, K. 2022. Learning to Detect Every Thing in an Open World. In Computer Vision ECCV 2022, 268 284. Cham: Springer Nature Switzerland. ISBN 978-3-031-20053-3. Song, C.; Huang, Y.; Ouyang, W.; and Wang, L. 2018. Mask-Guided Contrastive Attention Model for Person Reidentification. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1179 1188. Vaze, S.; Han, K.; Vedaldi, A.; and Zisserman, A. 2022. Generalized Category Discovery. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7482 7491. Vaze, S.; Vedaldi, A.; and Zisserman, A. 2023. No Representation Rules Them All in Category Discovery. In Advances in Neural Information Processing Systems, volume 36, 19962 19989. Wang, J.; Yu, X.; and Gao, Y. 2021. Mask Guided Attention For Fine-Grained Patchy Image Classification. In 2021 IEEE International Conference on Image Processing (ICIP), 1044 1048. Wang, W.; Feiszli, M.; Wang, H.; Malik, J.; and Tran, D. 2022. Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4412 4422.

Wang, X.; Zhang, R.; Kong, T.; Li, L.; and Shen, C. 2020. SOLOv2: Dynamic and Fast Instance Segmentation. In Advances in Neural Information Processing Systems, volume 33, 17721 17732. Wang, Z.; Li, Y.; and Wang, S. 2022. Noisy Boundaries: Lemon or Lemonade for Semi-supervised Instance Segmentation? In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16805 16814. Wen, X.; Zhao, B.; and Qi, X. 2023. Parametric Classification for Generalized Category Discovery: A Baseline Study. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 16544 16554. Weng, Z.; Ogut, M. G.; Limonchik, S.; and Yeung, S. 2021. Unsupervised Discovery of the Long-Tail in Instance Segmentation Using Hierarchical Self-Supervision. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2603 2612. Woo, S.; Park, J.; Lee, J.-Y.; and Kweon, I. S. 2018. CBAM: Convolutional Block Attention Module. In Computer Vision ECCV 2018, 3 19. Cham: Springer International Publishing. ISBN 978-3-030-01234-2. Yang, F.; Sun, Q.; Jin, H.; and Zhou, Z. 2020. Superpixel Segmentation With Fully Convolutional Networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13961 13970. Yang, L.; Li, H.; Wu, Q.; Meng, F.; Qiu, H.; and Xu, L. 2023. Bias-Correction Feature Learner for Semi-Supervised Instance Segmentation. IEEE Transactions on Multimedia, 25: 5852 5863. Yang, L.; Zhuo, W.; Qi, L.; Shi, Y.; and Gao, Y. 2022. ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4258 4267. Zhang, C.; Xu, R.; and He, X. 2023. Novel Class Discovery for Long-tailed Recognition. Transactions on Machine Learning Research. Zhang, D.; Nan, F.; Wei, X.; Li, S.-W.; Zhu, H.; Mc Keown, K.; Nallapati, R.; Arnold, A. O.; and Xiang, B. 2021. Supporting Clustering with Contrastive Learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5419 5430. Online: Association for Computational Linguistics. Zhang, S.; Khan, S.; Shen, Z.; Naseer, M.; Chen, G.; and Khan, F. S. 2023. Prompt CAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3479 3488. Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; and Chen, Z. 2020. Relation-Aware Global Attention for Person Re Identification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3183 3192. Zhao, Y.; Zhong, Z.; Sebe, N.; and Lee, G. H. 2022. Novel Class Discovery in Semantic Segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4330 4339.