# knet_towards_unified_image_segmentation__d7145da8.pdf K-Net: Towards Unified Image Segmentation Wenwei Zhang1 Jiangmiao Pang2,4 Kai Chen3,4 Chen Change Loy1 1S-Lab, Nanyang Technological University 2CUHK-Sense Time Joint Lab, the Chinese University of Hong Kong 3Sense Time Research 4Shanghai AI Laboratory {wenwei001, ccloy}@ntu.edu.sg pangjiangmiao@gmail.com chenkai@sensetime.com Semantic, instance, and panoptic segmentations have been addressed using different and specialized frameworks despite their underlying connections. This paper presents a unified, simple, and effective framework for these essentially similar tasks. The framework, named K-Net, segments both instances and semantic categories consistently by a group of learnable kernels, where each kernel is responsible for generating a mask for either a potential instance or a stuff class. To remedy the difficulties of distinguishing various instances, we propose a kernel update strategy that enables each kernel dynamic and conditional on its meaningful group in the input image. K-Net can be trained in an end-to-end manner with bipartite matching, and its training and inference are naturally NMS-free and box-free. Without bells and whistles, K-Net surpasses all previous published stateof-the-art single-model results of panoptic segmentation on MS COCO test-dev split and semantic segmentation on ADE20K val split with 55.2% PQ and 54.3% m Io U, respectively. Its instance segmentation performance is also on par with Cascade Mask R-CNN on MS COCO with 60%-90% faster inference speeds. Code and models will be released at https://github.com/Zww Wayne/K-Net/. 1 Introduction Image segmentation aims at finding groups of coherent pixels [48]. There are different notions in groups, such as semantic categories (e.g., car, dog, cat) or instances (e.g., objects that coexist in the same image). Based on the different segmentation targets, the tasks are termed differently, i.e., semantic and instance segmentation, respectively. There are also pioneer attempts [19,29,50,62] to joint the two segmentation tasks for more comprehensive scene understanding. Grouping pixels according to semantic categories can be formulated as a dense classification problem. As shown in Fig. 1-(a), recent methods directly learn a set of convolutional kernels (namely semantic kernels in this paper) of pre-defined categories and use them to classify pixels [40] or regions [22]. Such a framework is elegant and straightforward. However, extending this notion to instance segmentation is non-trivial given the varying number of instances across images. Consequently, instance segmentation is tackled by more complicated frameworks with additional steps such as object detection [22] or embedding generation [44]. These methods rely on extra components, which must guarantee the accuracy of extra components to a reasonable extent, or demand complex postprocessing such as Non-Maximum Suppression (NMS) and pixel grouping. Recent approaches [34, 49,55] generate kernels from dense feature grids and then select kernels for segmentation to simplify the frameworks. Nonetheless, since they build upon dense grids to enumerate and select kernels, these methods still rely on hand-crafted post-processing to eliminate masks or kernels of duplicated instances. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). (a) FCN for Semantic Segmentation Framework (b) K-Net for Instance Segmentation (c) K-Net for Panoptic Segmentation Figure 1: Semantic segmentation (a), instance (b), and panoptic segmentation (c) tasks are unified by a common framework in this paper. In conventional semantic segmentation methods, each convolutional kernel corresponds to a semantic class. Our framework extends this notion to make each kernel corresponds to either a potential instance or a semantic class. In this paper, we make the first attempt to formulate a unified and effective framework to bridge the seemingly different image segmentation tasks (semantic, instance, and panoptic) through the notion of kernels. Our method is dubbed as K-Net ( K stands for kernels). It begins with a set of convolutional kernels that are randomly initialized, and learns the kernels in accordance to the segmentation targets at hand, namely, semantic kernels for semantic categories and instance kernels for instance identities (Fig. 1-(b)). A simple combination of semantic kernels and instance kernels allows panoptic segmentation naturally (Fig. 1-(c)). In the forward pass, the kernels perform convolution on the image features to obtain the corresponding segmentation predictions. The versatility and simplicity of K-Net are made possible through two designs. First, we formulate K-Net so that it dynamically updates the kernels to make them conditional to their activations on the image. Such a content-aware mechanism is crucial to ensure that each kernel, especially an instance kernel, responds accurately to varying objects in an image. Through applying this adaptive kernel update strategy iteratively, K-Net significantly improves the discriminative ability of the kernels and boosts the final segmentation performance. It is noteworthy that this strategy universally applies to kernels for all the segmentation tasks. Second, inspired by recent advances in object detection [4], we adopt the bipartite matching strategy [47] to assign learning targets for each kernel. This training approach is advantageous to conventional training strategies [37, 46] as it builds a one-to-one mapping between kernels and instances in an image. It thus resolves the problem of dealing with a varying number of instances in an image. In addition, it is purely mask-driven without involving boxes. Hence, K-Net is naturally NMS-free and box-free, which is appealing to real-time applications. To show the effectiveness of the proposed unified framework on different segmentation tasks, we conduct extensive experiments on COCO dataset [38] for panoptic and instance segmentation, and ADE20K dataset [70] for semantic segmentation. Without bells and whistles, K-Net surpasses all previous state-of-the-art single-model results on panoptic (54.6% PQ) and semantic segmentation benchmarks (54.3% m Io U) and achieves competitive performance compared to the more expensive Cascade Mask R-CNN [3]. We further analyze the learned kernels and find that instance kernels incline to specialize on objects at specific locations of similar sizes. 2 Related Work Semantic Segmentation. Contemporary semantic segmentation approaches typically build upon a fully convolutional network (FCN) [40] and treat the task as a dense classification problem. Based on this framework, many studies focus on enhancing the feature representation through dilated convolution [7 9], pyramid pooling [57,67], context representations [64,65], and attention mechanisms [32, 63, 68]. Recently, SETR [69] reformulates the task as a sequence-to-sequence prediction task by using a vision transformer [17]. Despite the different model architectures, the approaches above share the common notion of making predictions via static semantic kernels. Differently, the proposed K-Net makes the kernels dynamic and conditional on their activations in the image. Instance Segmentation. There are two representative frameworks for instance segmentation topdown and bottom-up approaches. Top-down approaches [2,15,22,27,33] first detect accurate bounding boxes and generate a mask for each box. Mask R-CNN [22] simplifies this pipeline by directly adding a FCN [40] in Faster R-CNN [46]. Extensions of this framework add a mask scoring branch [24] or adopt a cascade structure [3, 5]. Bottom-up methods [1, 30, 43, 44] first perform semantic segmentation then group pixels into different instances. These methods usually require a grouping process, and their performance often appears inferior to top-down approaches in popular benchmarks [14,38]. Unlike all these works, K-Net performs segmentation and instance separation simultaneously by constraining each kernel to predict one mask at a time for one object. Therefore, K-Net needs neither bounding box detection nor grouping process. It focuses on refining kernels rather than refining bounding boxes, different from previous cascade methods [3,5]. Recent attempts [10, 54, 55, 58] perform instance segmentation in one stage without involving detection nor embedding generation. These methods apply dense mask prediction using dense sliding windows [10] or dense grids [54]. Some studies explore polar [58] representation, contour [45], and explicit shape representation [60] of instance masks. These methods all rely on NMS to eliminate duplicated instance masks, which hinders end-to-end training. The heuristic process is also unfavorable for real-time applications. Instance kernels in K-Net are trained in an end-to-end manner with bipartite matching and set prediction loss, thus, our methods does not need NMS. Panoptic Segmentation. Panoptic segmentation [29] combines instance and semantic segmentation to provide a richer understanding of the scene. Different strategies have been proposed to cope with the instance segmentation task. Mainstream frameworks add a semantic segmentation branch [28,31, 55,59] on an instance segmentation framework or adopt different pixel grouping strategies [11,61] based on a semantic segmentation method. Recently, DETR [4] tries to simplify the framework by transformer [51] but need to predict boxes around both stuff and things classes in training for assigning learning targets. These methods either need object detection or embedding generation to separate instances, which does not reconcile the instance and semantic segmentation in a unified framework. By contrast, K-Net partitions an image into semantic regions by semantic kernels and object instances by instance kernels through a unified perspective of kernels. Concurrent to K-Net, some recent attempts [12, 35, 52] apply Transformer [51] for panoptic segmentation. Mask Former reformulates semantic segmentation as a mask classification task, which is commonly adopted in instance-level segmentation. From an inverse perspective, K-Net tries to simplify instance and panoptic segmentation by letting a kernel to predict the mask of only one instance or a semantic category, which is the essential design in semantic segmentation. In contrast to K-Net that directly uses learned kernels to predict masks and progressively refines the masks and kernels, Ma X-Deep Lab [52] and Mask Former [12] rely on queries and Transformer [51] to produce dynamic kernels for the final mask prediction. Dynamic Kernels. Convolution kernels are usually static, i.e., agnostic to the inputs, and thus have limited representation ability. Previous works [16,18,25,26,71] explore different kinds of dynamic kernels to improve the flexibility and performance of models. Some semantic segmentation methods apply dynamic kernels to improve the model representation with enlarged receptive fields [56] or multi-scales contexts [20]. Differently, K-Net uses dynamic kernels to improve the discriminative capability of the segmentation kernels more so than the input features of kernels. Recent studies apply dynamic kernels to generate instance [49,55] or panoptic [34] segmentation predictions directly. Because these methods generate kernels from dense feature maps, enumerate kernels of each position, and filter out kernels of background regions, they either still rely on NMS [49,55] or need extra kernel fusion [34] to eliminate kernels or masks of duplicated objects. Instead of generated from dense grids, the kernels in K-Net are a set of learnable parameters updated by their corresponding contents in the image. K-Net does not need to handle duplicated kernels because its kernels learn to focus on different regions of the image in training, constrained by the bipartite matching strategy that builds a one-to-one mapping between the kernels and instances. 3 Methodology We consider various segmentation tasks through a unified perspective of kernels. The proposed K-Net uses a set of kernels to assign each pixel to either a potential instance or a semantic class (Sec. 3.1). To enhance the discriminative capability of kernels, we contribute a way to update the static kernels by the contents in their partitioned pixel groups (Sec. 3.2). We adopt the bipartite matching strategy to train instance kernels in an end-to-end manner (Sec. 3.3). K-Net can be applied seamlessly to semantic, instance, and panoptic segmentation as described in Sec. 3.4. Despite the different definitions of a meaningful group , all segmentation tasks essentially assign each pixel to one of the predefined meaningful groups [48]. As the number of groups in an image is typically assumed finite, we can set the maximum group number of a segmentation task as N. For example, there are N pre-defined semantic classes for semantic segmentation or at most N objects in an image for instance segmentation. For panoptic segmentation, N is the total number of stuff classes and objects in an image. Therefore, we can use N kernels to partition an image into N groups, where each kernel is responsible to find the pixels belonging to its corresponding group. Specifically, given an input feature map F RB C H W of B images, produced by a deep neural network, we only need N kernels K RN C to perform convolution with F to obtain the corresponding segmentation prediction M RB N H W as M = σ(K F), (1) where C, H, and W are the number of channels, height, and width of the feature map, respectively. The activation function σ can be softmax function if we want to assign each pixel to only one of the kernels (usually used in semantic segmentation). Sigmoid function can also be used as activation function if we allow one pixel belong to multiple masks, which results on N binary masks by setting a threshold like 0.5 on the activation map (usually used in instance segmentation). This formulation has already dominated semantic segmentation for years [8, 40, 67]. In semantic segmentation, each kernel is responsible to find all pixels of a similar class across images. Whereas in instance segmentation, each pixel group corresponds to an object. However, previous methods separate instances by extra steps [22,30,44] instead of by kernels. This paper is the first study that explores if the notion of kernels in semantic segmentation is equally applicable to instance segmentation, and more generally panoptic segmentation. To separate instances by kernels, each kernel in K-Net only segments at most one object in an image (Fig. 1-(b)). In this way, K-Net distinguishes instances and performs segmentation simultaneously, achieving instance segmentation in one pass without extra steps. For simplicity, we call these kernels as semantic and instance kernels in this paper for semantic and instance segmentation, respectively. A simple combination of instance kernels and semantic kernels can naturally preform panoptic segmentation that either assigns a pixel to an instance ID or a class of stuff (Fig. 1-(c)). 3.2 Group-Aware Kernels Kernel Update Adaptive Kernel Update Kernel Interaction Class Predictions Conv FC Identity Multiplication Convolution Figure 2: Kernel Update Head. Despite the simplicity of K-Net, separating instances directly by kernels is non-trivial. Because instance kernels need to discriminate objects that vary in scale and appearance within and across images. Without a common and explicit characteristic like semantic categories, the instance kernels need stronger discriminative ability than static kernels. To overcome this challenge, we contribute an approach to make the kernel conditional on their corresponding pixel groups, through a kernel update head, as shown in Fig. 2. The kernel update head fi contains three key steps: group feature assembling, adaptive kernel update, and kernel interaction. Firstly, the group feature F K for each pixel group is assembled using the mask prediction Mi 1. As it is the content of each individual groups that distinguishes them from each other, F K is used to update their corresponding kernel Ki 1 adaptively. After that, the kernel interacts with each other to comprehensively model the image context. Finally, the obtained group-aware kernels Ki perform convolution over feature map F to obtain more accurate mask prediction Mi. As shown in Fig. 3, this process can be conducted iteratively because a finer partition usually reduces the noise in group features, which results in more discriminative kernels. This process is formulated as Ki, Mi = fi (Mi 1, Ki 1, F) . (2) Kernel Update Head 𝑓! Learned Kernels Feature Map 𝐹: 𝐶 𝐻 𝑊 Convolution Layer Convolution Instance Semantic Class Predictions Dynamic Kernels Mask Predictions Dynamic Kernels Class Predictions Kernel Update Head 𝑓# Mask Predictions Mask Predictions Backbone Neck Figure 3: K-Net for panoptic segmentaion. A set of learned kernels first performs convolution with the feature map F to predict masks M0. Then the kernel update head takes the mask predictions M0, learned kernels K0, and feature map F as input and produce class predictions, group-aware (dynamic) kernels, and mask predictions. The produced mask prediction, dynamic kernels, and feature map F are sent to the next kernel update head. This process is performed iteratively to progressively refine the kernels and the mask predictions. Notably, the kernel update head with the iterative refinement is universal as it does not rely on the characteristic of kernels. Thus, it can enhance not only instance kernels but also semantic kernels. We detail the three steps as follows. Group Feature Assembling. The kernel update head first assembles the features of each group, which will be adopted later to make the kernels group-aware. As the mask of each kernel in Mi 1 essentially defines whether or not a pixel belongs to the kernel s related group, we can assemble the feature F K for Ki 1 by multiplying the feature map F with the Mi 1 as v Mi 1(u, v) F(u, v), F K RB N C, (3) where B is the batch size, N is the number of kernels, and C is the number of channels. Adaptive Feature Update. The kernel update head then updates the kernels using the obtained F K to improve the representation ability of kernels. As the mask Mi 1 may not be accurate, which is more common the case, the feature of each group may also contain noises introduced by pixels from other groups. To reduce the adverse effect of the noise in group features, we devise an adaptive kernel update strategy. Specifically, we first conduct element-wise multiplication between F K and Ki 1 as F G = φ1(F K) φ2(Ki 1), F G RB N C, (4) where φ1 and φ2 are linear transformations. Then the head learns two gates, GF and GK, which adapt the contribution from F K and Ki 1 to the updated kernel K, respectively. The formulation is GK = σ(ψ1(F G)), GF = σ(ψ2(F G)), K = GF ψ3(F K) + GK ψ4(Ki 1), (5) where ψn, n = 1, ..., 4 are different fully connected (FC) layers followed by Layer Norm (LN) and σ is the Sigmoid function. K is then used in kernel interaction. The gate learned here plays a role like the self-attention mechanism in Transformer [51], whose output is computed as a weighted summation of the values. In Transformer, the weight assigned to each value is usually computed by a compatibility function dot-product of the queries and keys. Similarly, adaptive kernel update essentially performs weighted summation of kernel features Ki 1 and group features F G. Their weight GK and GF are computed by element-wise multiplication, which can be regarded as another kind of compatibility function. Kernel Interaction. Interaction among kernels is important to inform each kernel with contextual information from other groups. Such information allows the kernel to implicitly model and exploit the relationship between groups of an image. To this end, we add a kernel interaction process to obtain the new kernels Ki given the updated kernels K. Here we simply adopt Multi-Head Attention [51] followed by a Feed-Forward Neural Network, which has been proven effective in previous works [4, 51]. The output Ki of kernel interaction is then used to generate a new mask prediction through Mi = gi (Ki) F, where gi is an FC-LN-Re LU layer followed by an FC layer. Ki will also be used to predict classification scores in instance and panoptic segmentation. 3.3 Training Instance Kernels While each semantic kernel can be assigned to a constant semantic class, there lacks an explicit rule to assign varying number of targets to instance kernels. In this work, we adopt bipartite matching strategy and set prediction loss [4,47] to train instance kernels in an end-to-end manner. Different from previous works [4,47] that rely on boxes, the learning of instance kernels is purely mask-driven because the inference of K-Net is naturally box-free. Loss Functions. The loss function for instance kernels is written as LK = λcls Lcls + λce Lce + λdice Ldice, where Lcls is Focal loss [37] for classification, and Lce and Ldice are Cross Entropy (CE) loss and Dice loss [42] for segmentation, respectively. Given that each instance only occupies a small region in an image, CE loss is insufficient to handle the highly imbalanced learning targets of masks. Therefore, we apply Dice loss [42] to handle this issue following previous works [49,54,55]. Mask-based Hungarian Assignment. We adopt Hungarian assignment strategy used in [4,47] for target assignment to train K-Net in an end-to-end manner. It builds a one-to-one mapping between the predicted instance masks and the ground-truth (GT) instances based on the matching costs. The matching cost is calculated between the mask and GT pairs in a similar manner as the training loss. 3.4 Applications to Various Segmentation Tasks Panoptic Segmentation. For panoptic segmentation, the kernels are composed of instance kernels Kins 0 and semantic kernels Ksem 0 as shown in Fig. 3. We adopt semantic FPN [28] for producing high resolution feature map F, except that we add positional encoding used in [4,51,72] to enhance the positional information. Specifically, given the feature maps P2, P3, P4, P5 produced by FPN [36], positional encoding is computed based on the feature map size of P5, and it is added with P5. Then semantic FPN [28] is used to produce the final feature map. As semantic segmentation mainly relies on semantic information for per-pixel classification, while instance segmentation prefers accurate localization information to separate instances, we use two separate branches to generate the features F ins and F sem to perform convolution with Kins 0 and Ksem 0 for generating instance and semantic masks M ins 0 and M sem 0 , respectively. Notably, it is unnecessary to produce thing and stuff masks initially from different branches to produce a reasonable performance. Such a design is consistent with previous practices [28,55] and empirically yields better performance (about 1% PQ). We then construct M0, K0, and F as the inputs of kernel update head to dynamically update the kernels and refine the panoptic mask prediction. Because things are already separated by instance masks in M ins 0 , while M sem 0 contains the semantic masks of both things and stuff , we select M st 0 , the masks of stuff categories from M sem 0 , and directly concatenate it with M ins 0 to form the panoptic mask prediction M0. Due to similar reason, we only select and concatenate the kernels of stuff classes in Ksem 0 with Kins 0 to form the panoptic kernels K0. To exploit the complementary semantic information in F sem and localization information in F ins, we add them together to obtain F as the input feature map of the kernel update head. With M0, K0, and F, the kernel update head f1 can produce group-aware kernels K1 and mask M1. Then kernels and masks are iteratively by S times and finally we can obtain the mask prediction MS. To produce the final panoptic segmentation results, we paste thing and stuff masks in a mixed order following Mask Former [12]. We also find it necessary in K-Net to firstly sort the pasting order of masks based on their classification scores for further filtering out lower-confident mask predictions. Such a method empirically performs better (about 1% PQ) than the previous strategy that pasting thing and stuff masks separately [28,34]. Instance Segmentation. In the similar framework, we simply remove the concatenation process of kernels and masks to perform instance segmentation. We did not remove the semantic segmentation branch as the semantic information is still complementary for instance segmentation. Note that in this case, the semantic segmentation branch does not use extra annotations. The ground truth of semantic segmentation is built by converting instance masks to their corresponding class labels. Semantic Segmentation. As K-Net does not rely on specific architectures of model representation, K-Net can perform semantic segmentation by simply appending its kernel update head to any existing semantic segmentation methods [8,40,57,67] that rely on semantic kernels. Table 1: Comparisons with state-of-the-art panoptic segmentation methods on COCO dataset Framework Backbone Box-free NMS-free Epochs PQ PQTh PQSt val Panoptic-Deep Lab [11] Xception-71 1000 39.7 43.9 33.2 Panoptic FPN [28] R50-FPN 36 41.5 48.5 31.1 SOLOv2 [55] R50-FPN 36 42.1 49.6 30.7 DETR [4] R50 300 + 25 43.4 48.2 36.3 Unifying [31] R50-FPN 27 43.4 48.6 35.5 Panoptic FCN [34] R50-FPN 36 43.6 49.3 35.0 K-Net R50-FPN 36 47.1 51.7 40.3 R101-FPN 36 49.6 55.1 41.4 R101-FPN-DCN 36 48.3 54.0 39.7 Swin-L [39] 36 54.6 60.2 46.0 test-dev Panoptic-Deep Lab Xception-71 1000 41.4 45.1 35.9 Panoptic FPN R101-FPN 36 43.5 50.8 32.5 Panoptic FCN R101-FPN 36 45.5 51.4 36.4 DETR R101 300 + 25 46.0 - - UPSNet [59] R101-FPN-DCN 36 46.6 53.2 36.7 Unifying [31] R101-FPN-DCN 27 47.2 53.5 37.7 K-Net R101-FPN 36 47.0 52.8 38.2 K-Net R101-FPN-DCN 36 48.3 54.0 39.7 Ma X-Deep Lab-L [52] Max-L 54 51.3 57.2 42.4 Mask Former [12] Swin-L [39] 300 53.3 59.1 44.5 Panoptic Segf Former [35] PVTv2-B5 [53] 50 54.4 61.1 44.3 K-Net Swin-L 36 55.2 61.2 46.2 4 Experiments Dataset and Metrics. For panoptic and instance segmentation, we perform experiments on the challenging COCO dataset [38]. All models are trained on the train2017 split and evaluated on the val2017 split. The panoptic segmentation results are evaluated by the PQ metric [29]. We also report the performance of thing and stuff, noted as PQTh, PQSt, respectively, for thorough evaluation. The instance segmentation results are evaluated by mask AP [38]. The AP for small, medium and large objects are noted as APs, APm, and APl, respectively. The AP at mask Io U thresholds 0.5 and 0.75 are also reported as AP50 and AP75, respectively. For semantic segmentation, we conduct experiments on the challenging ADE20K dataset [70] and report m Io U to evaluate the segmentation quality. All models are trained on the train split and evaluated on the validation split. Implementation Details. For panoptic and instance segmentation, we implement K-Net with MMDetection [6]. In the ablation study, the model is trained with a batch size of 16 for 12 epochs. The learning rate is 0.0001, and it is decreased by 0.1 after 8 and 11 epochs, respectively. We use Adam W [41] with a weight decay of 0.05. For data augmentation in training, we adopt horizontal flip augmentation with a single scale. The long edge and short edge of images are resized to 1333 and 800, respectively, without changing the aspect ratio. When comparing with other frameworks, we use multi-scale training with a longer schedule (36 epochs) for fair comparisons [6]. The short edge of images is randomly sampled from [640, 800] [21]. For semantic segmentation, we implement K-Net with MMSegmentation [13] and train it with 80,000 iterations. As Adam W [41] empirically works better than SGD, we use Adam W with a weight decay of 0.0005 by default on both the baselines and K-Net for a fair comparison. The initial learning rate is 0.0001, and it is decayed by 0.1 after 60000 and 72000 iterations, respectively. More details are provided in the appendix. Model Hyperparameters. In the ablation study, we adopt Res Net-50 [23] backbone with FPN [36]. For panoptic and instance segmentation, we use λcls = 2 for Focal loss following previous methods [72], and empirically find λseg = 1, λce = 1, λdice = 4 work best. For efficiency, the default number of instance kernels is 100. For semantic segmentation, N equals to the number of classes of the dataset, which is 150 in ADE20K and 133 in COCO dataset. The number of rounds of iterative kernel update is set to three by default for all segmentation tasks. 4.1 Benchmark Results Panoptic Segmentation. We first benchmark K-Net with other panoptic segmentation frameworks in Table 1. K-Net surpasses the previous state-of-the-art box-based method [31] and box/NMS-free method [34] by 1.7 and 1.5 PQ on val split, respectively. On the test-dev split, K-Net with Res Net- Table 2: Comparisons with state-of-the-art instance segmentation methods on COCO dataset. P. (M) indicates the number of parameters in the model, and the counting unit is million Method Backbone Box-free NMS-free Epochs AP AP50 AP70 APs APm APl FPS P. (M) val2017 SOLO [54] R-50-FPN 36 35.8 56.7 37.9 14.3 39.3 53.2 12.7 36.08 Mask R-CNN [22] R-50-FPN 36 37.1 58.5 39.7 18.7 39.6 53.9 17.5 44.17 SOLOv2 [55] R-50-FPN 36 37.5 58.2 40.0 15.8 41.4 56.6 17.7 33.89 Cond Inst [49] R-50-FPN 36 37.5 58.5 40.1 18.7 41.0 53.3 14.0 46.37 Cascade Mask R-CNN [3] R-50-FPN 36 38.5 59.7 41.8 19.3 41.1 55.6 10.3 77.10 K-Net R-50-FPN 36 37.8 60.3 39.9 16.9 41.2 57.5 21.2 37.26 K-Net-N256 R-50-FPN 36 38.6 60.9 41.0 19.1 42.0 57.7 19.8 37.30 test-dev SOLO R-50-FPN 72 36.8 58.6 39.0 15.9 39.5 52.1 12.7 36.08 Mask R-CNN R-50-FPN 36 37.4 59.5 40.1 18.6 39.8 51.6 17.5 44.17 Cond Inst R-50-FPN 36 37.8 59.2 40.4 18.2 40.3 52.7 14.0 46.37 SOLOv2 R-50-FPN 36 38.2 59.3 40.9 16.0 41.2 55.4 17.7 33.89 Cascade Mask R-CNN R-50-FPN 36 38.8 60.4 42.0 19.4 40.9 53.9 10.3 77.10 K-Net R-50-FPN 36 38.4 61.2 40.9 17.4 40.7 56.2 21.2 37.26 K-Net-N256 R-50-FPN 36 39.1 61.7 41.8 18.2 41.4 56.6 19.8 37.30 SOLO R-101-FPN 72 37.8 59.5 40.4 16.4 40.6 54.2 10.7 55.07 Mask R-CNN R-101-FPN 36 38.8 60.8 41.8 19.1 41.2 54.3 14.3 63.16 Cond Inst R-101-FPN 36 38.9 60.6 41.8 18.8 41.8 54.4 11.0 52.83 SOLOv2 R-101-FPN 36 39.5 60.8 42.6 16.7 43.0 57.4 14.3 65.36 Cascade Mask R-CNN R-101-FPN 36 39.9 61.6 43.3 19.8 42.1 55.7 9.5 96.09 K-Net R-101-FPN 36 40.1 62.8 43.1 18.7 42.7 58.8 16.2 56.25 K-Net-N256 R-101-FPN 36 40.6 63.3 43.7 18.8 43.3 59.0 15.5 56.29 Table 3: Results of K-Net on ADE20K semantic segmentation dataset (a) Improvements of K-Net on different architectures Method Backbone Val m Io U FCN [40] R50 36.7 FCN + K-Net R50 43.3 (+6.6) PSPNet [67] R50 42.6 PSPNet + K-Net R50 43.9 (+1.3) DLab.v3 [8] R50 43.5 DLab.v3 + K-Net R50 44.6 (+1.1) Uper Net [57] R50 42.4 Uper Net + K-Net R50 43.6 (+1.2) Uper Net Swin-L 50.6 Uper Net + K-Net Swin-L 52 (+1.4) (b) Comparisons with state-of-the-art methods. Results marked by use larger image sizes Method Backbone Val m Io U OCRNet [64] HRNet-W48 44.9 PSPNet [67] R101 45.4 PSANet [68] R101 45.4 DNL [63] R101 45.8 DLab.v3 [8] R101 46.7 DLab.v3+ [9] S-101 [66] 47.3 SETR [69] Vi T-L [17] 48.6 Uper Net Swin-L 53.5 Uper Net + K-Net Swin-L 53.3 Uper Net + K-Net Swin-L 54.3 101-FPN backbone even obtains better results than that of UPSNet [59], which uses Deformable Convolution Network (DCN) [16]. K-Net equipped with DCN surpasses the previous method [31] by 1.1 PQ. Without bells and whistles, K-Net obtains new state-of-the-art single-model performance with Swin Transformer [39] serving as the backbone. We also compare K-Net with concurrent work Max-Deep Lab [52], Mask Former [12], and Panoptic Seg Former [35]. K-Net surpasses these methods with the least training epochs (36), taking only about 44 GPU days (roughly 2 days and 18 hours with 16 GPUs). Note that only 100 instance kernels and Swin Transformer with window size 7 are used here for efficiency. K-Net could obtain a higher performance with more instance kernels (Sec. 4.2), Swin Transformer with window size 12 (used in Mask Former [12]), as well as an extended training schedule with aggressive data augmentation used in previous work [4]. Instance Segmentation. We compare K-Net with other instance segmentation frameworks [10,22, 49] in Table 2. More details are provided in the appendix. As the only box-free and NMS-free method, K-Net achieves better performance and faster inference speed than Mask R-CNN [22], SOLO [54], SOLOv2 [55] and Cond Inst [49], indicated by the higher AP and frames per second (FPS). We adopt 256 instance kernels (K-Net-N256 in the table) to compare with Cascade Mask R-CNN [3]. The performance of K-Net-N256 is on par with Cascade Mask R-CNN [3] but enjoys a 92.2% faster inference speed (19.8 v.s 10.3). On COCO test-dev split, K-Net with Res Net-101-FPN backbone obtains performance that is 0.9 AP better than Mask R-CNN [22]. It also surpasses previous kernel-based approach Cond Inst [49] and SOLOv2 [55] by 1.2 AP and 0.6 AP, respectively. With Res Net-101-FPN backbone, K-Net surpasses Cascade Mask R-CNN with 100 and 256 instance kernels in both accuracy and speed by 0.2 AP and 6.7 FPS, and 0.7 AP and 6 FPS, respectively. Table 4: Ablation studies of K-Net on instance segmentation (a) Adaptive Kernel Update (A. K. U.) and Kernel Interaction (K. I.) A. K. U. K. I. AP AP50 AP75 10.0 18.2 9.6 22.6 37.3 23.5 31.2 52.0 32.4 34.1 55.3 35.7 (b) Positional Encoding (P. E.) and Coordinate Convolution (Coors.) Coors. P. E. AP AP50 AP75 30.9 51.7 31.6 34.0 55.4 35.6 34.1 55.3 35.7 34.0 55.1 35.8 (c) Number of rounds of kernel update Stage Number AP AP50 AP75 FPS 1 21.8 37.3 22.1 24.0 2 32.1 52.3 33.5 22.7 3 34.1 55.3 35.7 21.2 4 34.5 56.5 35.7 20.1 5 34.5 56.5 35.9 18.9 (d) Numbers of instance kernels N AP AP50 AP75 FPS 50 32.7 53.7 34.1 21.6 64 33.6 54.8 35.1 21.6 100 34.1 55.3 35.7 21.2 128 34.3 55.6 35.8 20.7 256 34.7 56.1 36.3 19.8 We also compare the number of parameters of these models in Table 2. Though K-Net does not have the least number of parameters, it is more lightweight than Cascade Mask R-CNN by approximately half number of the parameters (37.3 M vs. 77.1 M). Semantic Segmentation. We apply K-Net to existing frameworks [8,40,57,67] that rely on static semantic kernels in Table 3a. K-Net consistently improves different frameworks. Notably, KNet significantly improves FCN (6.6 m Io U). This combination surpasses PSPNet and Uper Net by 0.7 and 0.9 m Io U, respectively, and achieves performance comparable with Deep Lab v3. Furthermore, the effectiveness of K-Net does not saturate with strong model representation, as it still brings significant improvement (1.4 m Io U) over Uper Net with Swin Transformer [39]. The results suggest the versatility and effectiveness of K-Net for semantic segmentation. In Table 3b, we further compare K-Net with other state-of-the-art methods [9, 69] with test-time augmentation on the validation set. With the input of 512 512, K-Net already achieves state-of-theart performance. With a larger input of 640 640 following previous method [39] during training and testing, K-Net with Uper Net and Swin Transformer achieves new state-of-the-art single model performance, which is 0.8 m Io U higher than the previous one. 4.2 Ablation Study on Instance Segmentation We conduct an ablation study on COCO instance segmentation dataset to evaluate the effectiveness of K-Net in discriminating instances. The conclusion is also applicable to other segmentation tasks since the design of K-Net is universal to all segmentation tasks. Head Architecture. We verify the components in the kernel update head in Table 4a. The results of without A. K. U. is obtained by updating kernels purely by K = FK + Ki 1 followed by an FC-LN-Re LU layer. The results indicates that both adaptive kernel update and kernel interaction are necessary for high performance. Positional Information. We study the necessity of positional information in Table 4b. The results show that positional information is beneficial, and positional encoding [4,51] works slightly better than coordinate convolution. The combination of the two components does not bring additional improvements. The results justify the use of just positional encoding in our framework. Number of Stages. We compare different kernel update rounds in Table 4c. The results show that FPS decreases as the update rounds grow while the performance saturates beyond three stages. Table 5: Numbers of semantic kernels Stage Number 0 1 2 3 4 5 6 7 m Io U 36.7 42.7 43.0 43.3 43.8 44.1 43.1 42.6 Such a conclusion also holds for semantic segmentation as shown in Table 5. The performance of FCN + K-Net on ADE20K dataset gradually increases as the increase of iteration number but also saturates after four iterations. Number of Kernels. We further study the number of kernels in K-Net. The results in Table 4d reveal that 100 kernels are sufficient to achieve good performance. The observation is expected for COCO dataset because most of the images in the dataset do not contain many objects (7.7 objects per image in average [38]). K-Net consistently achieves better performance given more instance kernels since they improve the models capacity in coping with complicated images. However, a larger N may lead (a) Average activation over 5000 images. 1st round 3rd round Before Update (b) Mask prediction before and after kernel update. Figure 4: Visual analysis of kernels and their masks. Best viewed in color and by zooming in. to small performance gains and then get saturated (when N = 300, 512, 768, we all get 34.9% m AP). Therefore, we select N = 100 in other experiments for efficiency if without further specification. 4.3 Visual Analysis Overall Distribution of Kernels. We carefully analyze the properties of instance kernels learned in K-Net by analyzing the average of mask activations of the 100 instance kernels over the 5000 images in the val split. All the masks are resized to have a similar resolution of 200 200 for the analysis. As shown in Fig. 4a, the learned kernels are meaningful. Different kernels specialize on different regions of the image and objects with different sizes, while each kernel attends to objects of similar sizes at close locations across images. Masks Refined through Kernel Update. We further analyze how the mask predictions of kernels are refined through the kernel update in Fig. 4b. Here we take K-Net for panoptic segmentation to thoroughly analyze both semantic and instance masks. The masks produced by static kernels are incomplete, e.g., the masks of river and building are missed. After kernel update, the contents are thoroughly covered by the segmentation masks, though the boundaries of masks are still unsatisfactory. The boundaries are refined after more kernel update. The classification confidences of instances also increase after kernel update. More results are given in the appendix. 5 Conclusion This paper explores instance kernels that can learn to separate instances during segmentation. Thus, extra components that previously assist instance segmentation can be replaced by instance kernels, including bounding boxes, embedding generation, and hand-crafted post-processing like NMS, kernel fusion, and pixel grouping. Such an attempt, for the first time, allows different image segmentation tasks to be tackled through a unified framework. The framework, dubbed as K-Net, first partitions an image into different groups by learned static kernels, then iteratively refines these kernels and their partition of the image by the features assembled from their partitioned groups. K-Net obtains new state-of-the-art single-model performance on panoptic and semantic segmentation benchmarks and surpasses the well-developed Cascade Mask R-CNN with the fastest inference speed among the recent instance segmentation frameworks. We wish K-Net and the analysis to pave the way for future research on unified image segmentation frameworks. Acknowledgements. This study is supported under the RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). It is also partially supported by the NTU NAP grant. Jiangmiao Pang and Kai Chen are also supported by the Shanghai Committee of Science and Technology, China (Grant No. 20DZ1100800). The authors would like to thank the valuable suggestions and comments by Jiaqi Wang, Rui Xu, and Xingxing Zou. [1] Min Bai and Raquel Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017. [2] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. YOLACT: Real-time instance segmentation. In ICCV, 2019. [3] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In CVPR, 2018. [4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020. [5] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Hybrid task cascade for instance segmentation. In CVPR, 2019. [6] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. ar Xiv preprint ar Xiv:1906.07155, 2019. [7] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deep Lab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018. [8] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. ar Xiv preprint ar Xiv:1706.05587, 2017. [9] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018. [10] Xinlei Chen, Ross Girshick, Kaiming He, and Piotr Dollár. Tensormask: A foundation for dense object segmentation. 2019. [11] Bowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, and Liang Chieh Chen. Panoptic-Deep Lab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020. [12] Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. Co RR, abs/2107.06278, 2021. [13] MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020. [14] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. [15] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. CVPR, 2016. [16] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV, 2017. [17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. [18] Hang Gao, Xizhou Zhu, Stephen Lin, and Jifeng Dai. Deformable Kernels: Adapting effective receptive fields for object deformation. In ICLR, 2020. [19] Stephen Gould, Tianshi Gao, and Daphne Koller. Region-based segmentation and object detection. In Yoshua Bengio, Dale Schuurmans, John D. Lafferty, Christopher K. I. Williams, and Aron Culotta, editors, Neur IPS, 2009. [20] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi-scale filters for semantic segmentation. In ICCV, 2019. [21] Kaiming He, Ross Girshick, and Piotr Dollar. Rethinking Image Net pre-training. In ICCV, 2019. [22] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In ICCV, 2017. [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [24] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask Scoring R-CNN. In CVPR, 2019. [25] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In Neur IPS, 2015. [26] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In Neur IPS, 2016. [27] Lei Ke, Yu-Wing Tai, and Chi-Keung Tang. Deep occlusion-aware instance segmentation with overlapping bilayers. In CVPR, 2021. [28] Alexander Kirillov, Ross B. Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. In CVPR, 2019. [29] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, 2019. [30] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. Instancecut: From edges to instances with multicut. In CVPR, 2017. [31] Qizhu Li, Xiaojuan Qi, and Philip H. S. Torr. Unifying training and inference for panoptic segmentation. In CVPR, 2020. [32] Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, and Hong Liu. Expectation-maximization attention networks for semantic segmentation. In ICCV, 2019. [33] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instance-aware semantic segmentation. CVPR, 2017. [34] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia. Fully convolutional networks for panoptic segmentation. In CVPR, 2021. [35] Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Tong Lu, and Ping Luo. Panoptic segformer. Co RR, abs/2109.03814, 2021. [36] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. [37] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, 2017. [38] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. [39] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. ar Xiv preprint ar Xiv:2103.14030, 2021. [40] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. [41] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019. [42] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, 2016. [43] Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In CVPR, 2019. [44] Alejandro Newell, Zhiao Huang, and Jia Deng. Associative Embedding: End-to-end learning for joint detection and grouping. In Neur IPS, 2017. [45] Sida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, and Xiaowei Zhou. Deep snake for real-time instance segmentation. In CVPR, 2020. [46] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Neur IPS, 2015. [47] Russell Stewart, Mykhaylo Andriluka, and Andrew Y. Ng. End-to-end people detection in crowded scenes. In CVPR, 2016. [48] Richard Szeliski. Computer vision: algorithms and applications. Springer Science & Business Media, 2010. [49] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. In ECCV, 2020. [50] Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song Chun Zhu. Image parsing: Unifying segmentation, detection, and recognition. IJCV, 2005. [51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017. [52] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan L. Yuille, and Liang-Chieh Chen. Max-deeplab: End-to-end panoptic segmentation with mask transformers. Co RR, abs/2012.00759, 2020. [53] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. PVTv2: Improved baselines with pyramid vision transformer. Co RR, abs/2106.13797, 2021. [54] Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. SOLO: Segmenting objects by locations. In ECCV, 2020. [55] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. SOLOv2: Dynamic and fast instance segmentation. Neur IPS, 2020. [56] Jialin Wu, Dai Li, Yu Yang, Chandrajit Bajaj, and Xiangyang Ji. Dynamic filtering with large sampling field for convnets. In ECCV, 2018. [57] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, 2018. [58] Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. Polar Mask: Single shot instance segmentation with polar representation. In CVPR, 2020. [59] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. UPSNet: A unified panoptic segmentation network. In CVPR, 2019. [60] Wenqiang Xu, Haiyang Wang, Fubo Qi, and Cewu Lu. Explicit shape encoding for real-time instance segmentation. In ICCV, 2019. [61] Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, and Liang-Chieh Chen. Deeper Lab: Single-shot image parser. Co RR, abs/1902.05093, 2019. [62] Jian Yao, Sanja Fidler, and Raquel Urtasun. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, 2012. [63] Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, and Han Hu. Disentangled non-local neural networks. In ECCV, 2020. [64] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. In ECCV, 2020. [65] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In CVPR, June 2018. [66] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Muller, R. Manmatha, Mu Li, and Alexander Smola. Resnest: Split-attention networks. ar Xiv preprint ar Xiv:2004.08955, 2020. [67] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017. [68] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. PSANet: Point-wise spatial attention network for scene parsing. In ECCV, 2018. [69] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, and Li Zhang. Rethinking semantic segmentation from a sequence-tosequence perspective with transformers. In CVPR, 2021. [70] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. IJCV, 2019. [71] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable Conv Nets V2: More deformable, better results. In CVPR, 2019. [72] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: deformable transformers for end-to-end object detection. Co RR, abs/2010.04159, 2020.