# fewshot_object_detection_via_association_and_discrimination__abbeadcf.pdf

Few-Shot Object Detection via Association and DIscrimination

Yuhang Cao1 Jiaqi Wang1,2B Ying Jin1 Tong Wu1

Kai Chen2,3 Ziwei Liu4 Dahua Lin1,2

1CUHK-Sense Time Joint Lab, The Chinese University of Hong Kong 2Shanghai AI Laboratory 3Sense Time Research 4S-Lab, Nanyang Technological University {cy020,wj017,jy021,wt020,dhlin}@ie.cuhk.edu.hk chenkai@sensetime.com ziwei.liu@ntu.edu.sg

Object detection has achieved substantial progress in the last decade. However, detecting novel classes with only few samples remains challenging, since deep learning under low data regime usually leads to a degraded feature space. Existing works employ a holistic ﬁne-tuning paradigm to tackle this problem, where the model is ﬁrst pre-trained on all base classes with abundant samples, and then it is used to carve the novel class feature space. Nonetheless, this paradigm is still imperfect. Durning ﬁne-tuning, a novel class may implicitly leverage the knowledge of multiple base classes to construct its feature space, which induces a scattered feature space, hence violating the inter-class separability. To overcome these obstacles, we propose a two-step ﬁne-tuning framework, Few-shot object detection via Association and DIscrimination (FADI), which builds up a discriminative feature space for each novel class with two integral steps. 1) In the association step, in contrast to implicitly leveraging multiple base classes, we construct a compact novel class feature space via explicitly imitating a speciﬁc base class feature space. Speciﬁcally, we associate each novel class with a base class according to their semantic similarity. After that, the feature space of a novel class can readily imitate the well-trained feature space of the associated base class. 2) In the discrimination step, to ensure the separability between the novel classes and associated base classes, we disentangle the classiﬁcation branches for base and novel classes. To further enlarge the inter-class separability between all classes, a set-specialized margin loss is imposed. Extensive experiments on standard Pascal VOC and MS-COCO datasets demonstrate that FADI achieves new state-of-the-art performance, signiﬁcantly improving the baseline in any shot/split by +18.7. Notably, the advantage of FADI is most announced on extremely few-shot scenarios (e.g. 1and 3shot). Code is available at: https://github.com/yhcao6/FADI

1 Introduction

Deep learning has achieved impressive performance on object detection [21, 11, 1] in recent years. However, their strong performance heavily relies on a large amount of labeled training data, which limits the scalability and generalizability of the model in the data scarcity scenarios. In contrast, human visual systems can easily generalize to novel classes with only a few supervisions. Therefore, great interests have been invoked to explore few-shot object detection (FSOD), which aims at training a network from limited annotations of novel classes with the aid of sufﬁcient data of base classes.

BCorresponding author.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Horse Motor

(b) Association

(c) Discrimination

Figure 1: Conceptually visualization of our FADI. (a) The conventional ﬁne-tuning paradigm, e.g., TFA [28], learns good decision boundaries during the pre-training stage to separate the decision space into several subspaces (rectangles) occupied by different base classes. In the ﬁne-tuning stage, a novel class ( cow ) may exploit multiple similar base classes ( sheep and horse ) to construct the feature space of itself, which induces a scattered intra-class structure (the feature space of cow across two base classes, sheep and horse ). FADI divides the ﬁne-tuning stage into two steps. (b) In the association step, to construct a compact intra-class structure, we associate each novel class with a well-learned base class based on their semantic similarity ( cow is similar to sheep , motor is similar to bike ). The novel class readily learns to align its intra-class distribution to the associated base class. (c) In the discrimination step, to ensure the inter-class separability between novel classes and associated base classes, we disentangle the classiﬁcation branches for base and novel classes. A set-specialized margin loss is further imposed to enlarge the inter-class separability between all classes.

Various methods have since been proposed to tackle the problem of FSOD, including metalearning [13, 35, 32], metric learning [14], and ﬁne-tuning [28, 31, 23]. Among them, ﬁne-tuningbased methods are one of the dominating paradigms for few-shot object detection. [28] introduces a simple two-stage ﬁne-tuning approach (TFA). MPSR [31] improves upon TFA [28] via alleviating the problem of scale variation. The recent state-of-the-art method FSCE [23] shows the classiﬁer is more error-prone than the regressor, and introduces the contrastive-aware object proposal encodings to facilitate the classiﬁcation of detected objects. All these works employ a holistic ﬁne-tuning paradigm, where the model is ﬁrst trained on all base classes with abundant samples, and then the pre-trained model is ﬁne-tuned on novel classes. Although it exhibits a considerable performance advantage compared with the earlier meta-learning methods, this ﬁne-tuning paradigm is still imperfect. To be speciﬁc, the current design of the ﬁne-tuning stage directly extracts the feature representation of a novel class from the network pre-trained on base classes. Therefore, a novel class may exploit the knowledge of multiple similar base classes to construct the feature space of itself. As a result, the feature space of a novel class will have an incompact intra-class structure that scatters across feature spaces of other classes, breaking the inter-class separability, hence leading to classiﬁcation confusion, as shown in Figure 1a.

To overcome these obstacles, we propose a two-step ﬁne-tuning framework, Few-shot object detection via Association and DIscrimination (FADI), which constructs a discriminable feature space for each novel class with two integral steps, association and discrimination. Speciﬁcally, in the association step, as shown in Figure 1b, to construct a compact intra-class structure, we associate each novel class with a well-trained base class based on their underlying semantic similarity. The novel class readily learns to align its feature space to the associated base class, thus naturally becomes separable from the remaining classes. In the discrimination step, as shown in Figure 1c, to ensure the separability between the novel classes and associated base classes, we disentangle the classiﬁcation branches for base and novel classes to reduce the ambiguity in the feature space induced by the association step. To further enlarge the inter-class separability between all classes, a set-specialized margin loss is applied. To this end, the ﬁne-tuning stage is divided into two dedicated steps, and together complement each other.

Extensive experimental results have validated the effectiveness of our approach. We gain signiﬁcant performance improvements on the Pascal VOC [7] and COCO [18] benchmarks, especially on the extremely few-shot scenario. Speciﬁcally, without bells and whistles, FADI improves the TFA [28] baseline by a signiﬁcant margin in any split and shot with up to +18.7 m AP, and push the envelope of the state-of-the-art performance by 2.5, 4.3, 2.8 and 5.6, 7.8, 1.6 for shot K = 1, 2, 3 on novel split-1 and split-3 of Pascal VOC dataset, respectively.

2 Related Work

Few-Shot Classiﬁcation Few-Shot Classiﬁcation aims to recognize novel instances with abundant base samples and a few novel samples. Metric-based methods address the few-shot learning by learning to compare, different distance formulations [26, 22, 24] are adopted. Initialization-based methods [9, 15] learn a good weight initialization to promote the adaption to unseen samples more effectively. Hallucination-based methods introduce hallucination techniques [10, 29] to alleviate the shortage of novel data. Recently, researchers ﬁnd out that the simple pre-training and ﬁne-tuning framework [4, 6] can compare favorably against other complex algorithms.

Few-Shot Object Detection As an emerging task, FSOD is less explored than few-shot classiﬁcation. Early works mainly explore the line of meta-learning [13, 35, 32, 14, 36, 37, 8], where a meta-learner is introduced to acquire class agnostic meta knowledge that can be transferred to novel classes. Later, [28] introduces a simple two-stage ﬁne-tuning approach (TFA), which signiﬁcantly outperforms the earlier meta-learning methods. Following this framework, MPSR [31] enriches object scales by generating multi-scale positive samples to alleviate the inherent scale bias. Recently, FSCE [23] shows in FSOD, the classiﬁer is more error-prone than the regressor and introduces the contrastive-aware object proposal encodings to facilitate the classiﬁcation of detected objects. Similarly, FADI also aims to promote the discrimination capacity of the classiﬁer. But unlike previous methods that directly learn the classiﬁer by implicitly exploiting the base knowledge, motivated by the works of [34, 33], FADI explicitly associates each novel class with a semantically similar base class to learn a compact intra-class distribution.

Margin Loss Loss function plays an important role in the ﬁeld of recognition tasks. To enhance the discrimination power of traditional softmax loss, different kinds of margin loss are proposed. Sphere Face [19] introduces a multiplicative margin constrain in a hypersphere manifold. However, the non-monotonicity of cosine function makes it difﬁcult for stable optimization, Cos Face [27] then proposed to further normalize the feature embedding and impose an additive margin in the cosine space. Arc Face [5] moves the additive cosine margin into the angular space to obtain a better discrimination power and more stable training. However, we ﬁnd these margin losses are not directly applicable under data-scarce settings as they equally treat different kinds of samples but ignore the inherent bias of the classiﬁer towards base classes. Hence we propose a set-specialized margin loss that takes the kind of samples into consideration which yields signiﬁcantly better performance.

3 Our Approach

In this section, we ﬁrst review the preliminaries of few-shot object detection setting and the conventional two-stage ﬁne-tuning framework. Then we introduce our method that tackles few-shot object detection via association and discrimination (FADI).

3.1 Preliminaries

In few-shot detection, the training set is composed of a base set DB = {x B i , y B i } with abundant data of classes CB, and a novel set DN = {x N i , y N i } with few-shot data of classes CN, where xi and yi indicate training samples and labels, respectively. The number of objects for each class in CN is K for K-shot detection. The model is expected to detect objects in the test set with classes in CB CN.

Fine-tuning-based methods are the current one of the leading paradigms for few-shot object detection, which successfully adopt a simple two-stage training pipeline to leverage the knowledge of base classes. TFA [28] is a widely adopted baseline of ﬁne-tuning-based few-shot detectors. In the base training stage, the model is trained on base classes with sufﬁcient data to obtain a robust feature representation. In the novel ﬁne-tuning stage, the pre-trained model on base classes is then ﬁne-tuned on a balanced few-shot set which comprises both base and novel classes (CB CN). Aiming at preventing over-ﬁtting during ﬁne-tuning, only the box predictor, i.e., classiﬁer and regressor, are updated to ﬁt the few-shot set. While the feature extractor, i.e., other structures of the network, are frozen [28] to preserve the pre-trained knowledge on the abundant base classes.

Although the current design of ﬁne-tuning stage brings considerable gains on few-shot detection, we observe that it may induce a scattered feature space on novel class, which violates the inter-class

𝐹𝐶! Base Classifier

Pre-trained

Pseudo Label

Pseudo Label

Base Class Novel Class

Base Classifier

Novel Classifier

Stage1 Rand Init aeroplane

Step1: Association

Step2: Discrimination

Loss Disentangle

$ 𝑓 ; & 𝒲%&"

𝑔 ; & 𝒲#()*)+

𝑔 ; & 𝒲!""# $

𝐿2 𝑓 ; 𝒲%&"

Figure 2: Method overview. There are two steps in FADI: association and discrimination. To construct a compact intra-class structure, the association step aligns the feature distribution of each novel class with a well-learned base class based on their semantic similarity. To ensure inter-class separability, the discrimination step disentangles classiﬁcation branches for base and novel classes and imposes a set-specialized margin loss.

separability and leads to confusion of classiﬁcation. Towards this drawback, we proposes few-shot object detection via association and discrimination (FADI), which divides the ﬁne-tuning stage of TFA into a two-step association and discrimination pipelines. In the association step (Sec. 3.2), to construct a compact intra-class distribution, we associate each novel class with a base class based on their underlying semantic similarity. The feature representation of the associated base class is explicitly learned by the novel class. In the discrimination step (Sec. 3.3), to ensure the inter-class separability, we disentangle the base and novel branches and impose a set-specialized margin loss to train a more discriminative classiﬁer for each class.

3.2 Association Step

In the base training stage, the base model is trained on the abundant base data DB and its classiﬁer learns a good decision boundary (see Figure 1) to separate the whole decision space into several subspaces that are occupied by different base classes. Therefore, if a novel class can align the feature distribution of a base class, it will fall into the intra-class distribution of the associated base class, and be naturally separable from the other base classes. And if two novel classes are assigned to different base classes, they will also become separable from each other.

To achieve this goal, we introduce a new concept named association, which pairs each novel class to a similar base class by semantic similarity. After then, the feature distribution of the novel class is aligned with the associated base class via feature distribution alignment.

Similarity Measurement In order to ease the difﬁculty of feature distribution alignment, given a novel class CN i and a set of base classes CB, we want to associate CN i to the most similar base class in CB. An intuitive way is to rely on visual similarity between feature embeddings. However, the embedding is not representative for novel classes under data-scarce scenarios. Thus, we adopt Word Net [20] as an auxiliary to describe the semantic similarity between classes. Word Net is an English vocabulary graph, where nodes represent lemmas or synsets and they are linked according to their relations. It incorporates rich lexical knowledge which beneﬁts the association. Lin Similarity [16] is used to calculate the class-to-class similarity upon Word Net which is given by:

sim(CN i , CB j ) = 2 IC(LCS(CN i , CB j )) IC(CN i ) + IC(CB j ) , (1)

where LCS denotes the lowest common subsumer of two classes in lexical structure of Word Net. IC, i.e., information content, is the probability to encounter a word in a speciﬁc corpus. Sem Cor Corpus is adopted to count the word frequency here. We take the maximum among all base classes to obtain

(b) Association

(c) Discrimination

Figure 3: t-SNE here shows the distribution of feature after FC2/FC 2 from 200 randomly selected images on PASCAL VOC, horse and dog are base classes, cow and bird are novel classes, respectively. The feature space learned by FADI has a more compact intra-class structure and larger inter-class separability.

the associated base class CB j i where j i means the base class CB j is assigned to the novel class CN i .

CB j i argmax j |CB| sim(CN i , CB j ). (2)

To this end, novel class set CN is associated with a subset of base class CB N CB.

Feature Distribution Alignment After obtaining the associated base class for each novel class, given a sample x N i of novel class CN i , it is associated with a pseudo label y B j of the assigned base class CB j i. We design a pseudo label training mechanism to directly align the feature distribution of the novel class with the assigned base class, as follows.

min WN asso Lcls(y B j , f(z N i ; f WB cls)), where z N i = g(φ(x N i ; f WB pre); WN asso), (3)

where f W means the weights are frozen. Thus, f( ; f WB cls) and φ( ; f WB pre) indicate the classiﬁer (one fc layer) and the feature extractor (main network structures) with frozen weights and are pre-trained on base classes, and g( ; WN asso) means an intermediate structure (one or more fc layers) to align the feature distribution via updating the weights WN asso. By assigning pseudo labels and freezing the classiﬁer, this intermediate structure learns to align the feature distribution of the novel class to the associated base class. The main network structures φ( ; f WB pre) is also ﬁxed to keep the pre-trained knowledge from base classes.

As shown in Figure 2, we use the same Ro I head structure of Faster R-CNN [21], but we remove the regressor to reduce it to a pure classiﬁcation problem. During training, we freeze all parameters except the second linear layer FC 2, which means g( ; WN asso) is a single fc layer. We then construct a balanced training set with K shots per class. It is noted we discard the base classes that are associated with novel classes in this step. And the labels of novel classes are replaced by their assigned pseudo labels. As a result, the supervision will enforce the classiﬁer to identify samples of the novel class CN i as the assigned base class CB j i, which means the feature representation of novel classes before the classiﬁer gradually shifts toward their assigned base classes. As shown in Figure 3b, the t-SNE [25] visualization conﬁrms the effectiveness of our distribution alignment. After the association step, the feature distribution of two associated pairs ("bird" and "dog"; "cow" and "horse") are well aligned.

3.3 Discrimination Step

As shown in Figure 3b, after the association step, the feature distribution of each novel class is aligned with the associated base class. Therefore, this novel class will have a compact intra-class distribution and be naturally distinguishable from other classes. However, the association step inevitably leads to confusion between the novel class and its assigned base class. To tackle this problem, we introduce a discrimination step that disentangles the classiﬁcation branches for base and novel classes. A set-specialized margin loss is further applied to enlarge the inter-class separability.

Disentangling Given a training sample xi with label yi, we disentangle the classiﬁcation branches for base and novel classes as follows,

min WB cls,WN cls Lcls(yi, [p B, p N]), where

p B = f(g(q; f WB origin); WB cls), p N = f(g(q; f WN asso); WN cls), q = φ(xi; f WB pre), (4)

where f( ; WB cls), f( ; WN cls) are the classiﬁers for base and novel classes, respectively. g( , f WB origin), g( , f WN asso) are the last fc layer with frozen weights for base and novel classes, respectively. As shown in Figure 2, we disentangle the classiﬁers and the last fc layers (FC2 and FC2 ) for base and novel classes. FC2 and FC2 load the original weights f WB origin that are pre-trained with base classes

and the weights f WN asso after association step, respectively. They are frozen in the discrimination step to keep their speciﬁc knowledge for base and novel classes. Therefore, FC2 and FC2 are suitable to deal with base classes and novel classes, respectively. We attach the base classiﬁer f( ; WB cls) to FC2, and the novel classiﬁer f( ; WN cls) to FC2 . The base classiﬁer is a |CB|-way classiﬁer. The novel classiﬁer is a (|CN|+1)-way classiﬁer since we empirically let the novel classiﬁer be also responsible for recognizing background class C0. The prediction p B and p N from these two branches will be concatenated to yield the ﬁnal (|CB| + |CN| + 1)-way prediction [p B, p N].

Set-Specialized Margin Loss Besides disentangling, we further propose a set-specialized margin loss to alleviate the confusion between different classes. Different from previous margin losses [19, 27, 5] that directly modify the original CE loss, we introduce a margin loss as an auxiliary loss for the classiﬁer. Given an i-th training sample of label yi, we adopt cosine similarity to formulate the logits prediction, which follows the typical conventions in few-shot classiﬁcation and face recognition [27].

pyi = τ x T Wyi ||x|| ||Wyi||, syi = epyi PC j=1 epj , (5)

where W is the weight of the classiﬁer, x is the input feature and τ is the temperature factor. We try to maximize the margin of decision boundary between Cyi and any other class Cj,j =yi, as follows,

j=1,j =yi log((syi sj)+ + ϵ), (6)

where syi and sj are classiﬁcation scores on class Cyi and Cj,j =yi, and ϵ is a small number (1e 7) to keep numerical stability.

In the scenario of few-shot learning, there exists an inherent bias that the classiﬁer tends to predict higher scores on base classes, which makes the optimization of margin loss on novel classes becomes more difﬁcult. And the number of background (negative) samples dominates the training samples, thus we may suppress the margin loss on background class C0.

Towards the aforementioned problem, it is necessary to introduce the set-specialized handling of different set of classes into the margin loss. Thanks to adopting margin loss as an auxiliary beneﬁt, our design can easily enable set-specialized handling of different sets of classes by simply re-weighting the margin loss value:

{i|yi CB} α Lmi + X

{i|yi CN} β Lmi + X

{i|yi=C0} γ Lmi, (7)

where α, β, γ are hyper-parameters controlling the margin of base samples, novel samples and negative samples, respectively. Intuitively, β is larger than α because novel classes are more challenging, and γ is a much smaller value to balance the overwhelming negative samples. Finally, the loss function of the discrimination step is shown as in Eq. 8

Lft = Lcls + Lm + 2 Lreg, (8)

where Lcls is a cross-entropy loss for classiﬁcation, Lreg is a smooth-L1 loss for regression, and Lm is the proposed set-specialized margin loss. Since our margin loss increases the gradients on the classiﬁcation branch, we scale Lreg by a factor of 2 to keep the balance of the two tasks. The overall loss takes the form of multi-task learning to jointly optimize the model.

Method / Shot Backbone Novel Split 1 Novel Split 2 Novel Split 3 1 2 3 5 10 1 2 3 5 10 1 2 3 5 10

LSTD [2] VGG-16 8.2 1.0 12.4 29.1 38.5 11.4 3.8 5.0 15.7 31.0 12.6 8.5 15.0 27.3 36.3

YOLOv2-ft [30]

6.6 10.7 12.5 24.8 38.6 12.5 4.2 11.6 16.1 33.9 13.0 15.9 15.0 32.2 38.4 FSRW [13] 14.8 15.5 26.7 33.9 47.2 15.7 15.3 22.7 30.1 40.5 21.3 25.6 28.4 42.8 45.9 Meta Det [30] 17.1 19.1 28.9 35.0 48.8 18.2 20.6 25.9 30.6 41.5 20.1 22.3 27.9 41.9 42.9

Rep Met [14] Inception V3 26.1 32.9 34.4 38.6 41.3 17.2 22.1 23.4 28.3 35.8 27.5 31.1 31.5 34.4 37.2

FRCN-ft [30]

13.8 19.6 32.8 41.5 45.6 7.9 15.3 26.2 31.6 39.1 9.8 11.3 19.1 35.0 45.1 FRCN+FPN-ft [28] 8.2 20.3 29.0 40.1 45.5 13.4 20.6 28.6 32.4 38.8 19.6 20.8 28.7 42.2 42.1 Meta Det [30] 18.9 20.6 30.2 36.8 49.6 21.8 23.1 27.8 31.7 43.0 20.6 23.9 29.4 43.9 44.1 Meta R-CNN [35] 19.9 25.5 35.0 45.7 51.5 10.4 19.4 29.6 34.8 45.4 14.3 18.2 27.5 41.2 48.1

TFA w/ fc [28]

36.8 29.1 43.6 55.7 57.0 18.2 29.0 33.4 35.5 39.0 27.7 33.6 42.5 48.7 50.2 TFA w/ cos [28] 39.8 36.1 44.7 55.7 56.0 23.5 26.9 34.1 35.1 39.1 30.8 34.8 42.8 49.5 49.8 MPSR [31] 41.7 - 51.4 55.2 61.8 24.4 - 39.2 39.9 47.8 35.6 - 42.3 48.0 49.7 SRR-FSD [38] 47.8 50.5 51.3 55.2 56.8 32.5 35.3 39.1 40.8 43.8 40.1 41.5 44.3 46.9 46.4 FSCE [23] 44.2 43.8 51.4 61.9 63.4 27.3 29.5 43.5 44.2 50.2 37.2 41.9 47.5 54.6 58.5 FADI (Ours) 50.3 54.8 54.2 59.3 63.2 30.6 35.0 40.3 42.8 48.0 45.7 49.7 49.1 55.0 59.6

Table 1: Performance (novel AP50) on PASCAL VOC dataset. denotes meta-learning-based methods.

shot n AP n AP50 n AP75 TFA FADI TFA FADI TFA FADI

1 3.4 5.7 5.8 10.4 3.8 6.0 2 4.6 7.0 8.3 13.1 4.8 7.0 3 6.6 8.6 12.1 15.8 6.5 8.3 5 8.3 10.1 15.3 18.6 8.0 9.7 10 10.0 12.2 19.1 22.7 9.3 11.9 30 13.7 16.1 24.9 29.1 13.4 15.8

(a) Comparison with baseline TFA

Method n AP n AP75 10 30 10 30

FSRW [13] 5.6 9.1 4.6 7.6 Meta Det [30] 7.1 11.3 5.9 10.3 Meta R-CNN [35] 8.7 12.4 6.6 10.8 MPSR [31] 9.8 14.1 9.7 14.2 SRR-FSD [38] 11.3 14.7 9.8 13.5 FSCE [23] 11.9 16.4 10.5 16.2 Ours (FADI) 12.2 16.1 11.9 15.8

(b) Comparison with latest methods.

Table 2: Performance on MS COCO dataset. denotes meta-learning-based methods. n AP means novel AP.

4 Experiments

4.1 Datasets and Evaluation Protocols

We conduct experiments on both PASCAL VOC (07 + 12) [7] and MS COCO [18] datasets. To ensure fair comparison, we strictly follow the data split construction and evaluation protocol used in [13, 28, 23]. PASCAL VOC contains 20 categories, and we consider the same 3 base/novel splits with TFA [28] and refer them as Novel Split 1, 2, 3. Each split contains 15 base categories with abundant data and 5 novel categories with K annotated instances for K = 1, 2, 3, 5, 10. We report AP50 of novel categories (n AP50) on VOC07 test set. For MS COCO, 20 classes that overlap with PASCAL VOC are selected as novel classes, and the remaining 60 classes are set as base ones. Similarly, we evaluate our method on shot 1, 2, 3, 5, 10, 30 and the standard COCO-style ap metric is adopted.

4.2 Implementation Details

We implement our methods based on MMDetection [3]. Faster-RCNN [21] with Feature Pyramid Network [17] and Res Net-101 [12] are adopted as base model. Detailed settings are described in the supplementary material.

4.3 Benchmarking Results

Comparison with Baseline Methods To show the effectiveness of our method, we ﬁrst make a detailed comparison with TFA since our method is based on it. As shown in Table 1, FADI outperforms TFA by a large margin in any shot and split on PASCAL VOC benchmark. To be speciﬁc, FADI improves TFA by 10.5, 18.7, 9.5, 3.6, 7.2 and 7.1, 8.1, 6.2, 7.7, 8.9 and 14.9, 14.9, 6.3, 5.5, 9.8 for K=1, 2, 3, 5, 10 on Novel split1, split2 and split3. The lower the shot, the more difﬁcult to learn a discriminative novel classiﬁer. The signiﬁcant performance gap reﬂects our FADI can effectively alleviate such problem even under low shot, i.e., K <= 3. Similar improvements can be observed on the challenging COCO benchmark. As shown in Table 2, we boost TFA by 2.3, 2.4, 2.0, 1.8, 2.2, 2.4

Association Disentangling Margin n AP50 1 3 5

41.3 46.3 53.7 42.4 46.8 55.2 42.2 47.3 54.1 44.9 50.3 56.8 46.3 48.8 56.4 50.3 54.2 59.3

Table 3: Effectiveness of different components of FADI.

Margin n AP50

TFA 41.3 Cos Face [27] 38.9 Arc Face [5] 37.9 Cos Face (novel) 44.2 Arc Face (novel) 44.3 Ours 46.3

Table 4: Comparison of different margin loss on the TFA baseline model.

base / novel bird bus cow motorbike sofa n AP50

random person boat horse aeroplane sheep 39.6 human aeroplane train sheep bicycle chair 44.1 visual dog car horse person chair 43.3 top2 dog car sheep tv diningtable 41.2 top1 horse train horse bicycle chair 44.3 top1 w/o dup dog train horse bicycle chair 44.9

Table 5: Comparison of different assign policies. Set-specialized margin loss is not adopted in this table.

for K=1, 2, 3, 5, 10, 30. Besides, we also report n AP50 and n AP75, a larger gap can be obtained under Io U threshold 0.5 which suggests FADI beneﬁts more under lower Io U thresholds.

Comparison with State-of-the-Art Methods Next, we compare with other latest few-shot methods. As shown in Table 1, our method pushes the envelope of current SOTA by a large margin in shot 1, 2, 3 for novel split 1 and 3. Speciﬁcally, we outperform current SOTA by 2.5, 4.3, 2.8 and 5.6, 7.8, 1.6 for K = 1, 2, 3 on novel split1 and 3, respectively. As the shot grows, the performance of FADI is slightly behind FSCE [23], we conjecture by unfreezing more layers in the feature extractor, the model can learn a more compact feature space for novel classes as it exploits less base knowledge, and it can better represent the real distribution than the distribution imitated by our association. However, it is not available when the shot is low as the learned distribution will over-ﬁt training samples.

4.4 Ablation Study

In this section, we conduct a thorough ablation study of each component of our method. We ﬁrst analyze the performance contribution of each component, and then we show the effect of each component and why they work. Unless otherwise speciﬁed, all ablation results are reported on Novel Split 1 of Pascal VOC benchmark based on our implementation of TFA [28].

Component Analysis Table 3 shows the effectiveness of each component, i.e., Association, Disentangling, and Set-Specialized Margin Loss in our method. It is noted that when we study association without disentangling, we train a modiﬁed TFA model by replacing the FC2 with FC2 after the association step. Since the association confuses the novel and its assigned base class, the performance of only applying association is not very signiﬁcant. However, when equipped with disentangling, it can signiﬁcantly boost the n AP50 by 3.6, 4.0, 3.1 for K=1, 3, 5, respectively. The set-specialized margin loss shows it is generally effective for both the baseline and the proposed association + disentangling framework. Applying margin loss improves association + disentangling by 5.4, 3,9, 2.5. With all 3 components, our method totally achieves a gain of 9.0, 7.9, 5.6.

Semantic-Guided Association The assigning policy is a key component in the association step. To demonstrate the effectiveness of our semantic-guided assigning with Word Net [20], we explore different assign policies. The results are shown in Table 5. Random means we randomly assign a base class to a novel class. Human denotes manually assigning based on human knowledge. Visual denotes associating base and novel classes by visual similarity. Speciﬁcally, we regard the weights of the base classiﬁer as prototype representations of base classes. As a result, the score prediction of novel instances on base classiﬁer can be viewed as the visual similarity. Top1 and top2 mean the strategies that we assign each novel class to the most or second similar base classes by Eq. 1. In such

Association Discrimination

Figure 4: Score confusion matrices of different methods on Pascal VOC novel split1. The element in i-th row, j-th column represents for samples of novel class i, the score prediction on class j. Brighter colors indicate higher scores. If class i and j are the same, this indicates a more accurate score prediction. Otherwise, it indicates a heavier confusion. The font color of classes represents the association relations, e.g., the associated pairs bird and dog have the same font color blue.

Metric Novel Split1 Novel Split2 Novel Split3 1 3 5 1 3 5 1 3 5

Visual 43.3 49.3 56.4 22.5 37.2 39.3 31.8 43.1 50.7 Semantic 44.9 50.3 56.8 26.1 38.5 40.1 37.1 45.0 51.5

Table 6: Comparison of visual and semantic similarity.

Figure 5: Examples of co-occurance

cases, one base class may be assigned to two different novel classes ("horse" is assigned to "bird" and "cow"), we remove such duplication by taking the similarity as the priority of assigning. Speciﬁcally, the base and novel classes with the highest similarity will be associated, and they will be removed from the list of classes to be associated. Then we rank the similarity of the remaining classes and choose the new association. We can learn that top1 is better than random and top2 by 4.7 and 3.1, which suggests semantic similarity has a strong implication with performance. By removing the duplication, we further obtain a 0.6 gain.

Set-Specialized Margin Loss Table 4 compares our margin loss with Arcface [5] and Cos Face [27]. It can be shown that directly applying these two margin losses will harm the performance. But the degeneration of performance can be reserved by only applying to samples of novel classes. This rescues Arcface from 37.9 to 44.3, Cosface from 38.9 to 44.2. Nevertheless, they are still inferior to our margin loss by 2.0. Detailed hyper-parameter study is described in the supplementary materials.

Complementarity between Association and Discrimination Figure 4 shows the score confusion matrices of different methods. We can see that there exists an inherent confusion between some novel and base classes, e.g. in the left top ﬁgure, "cow" is confused most with "sheep" and then "horse". However, our association biases such confusion and enforces "cow" to be confused with its more semantic similar class "horse", which demonstrates the association step can align the feature distribution of the associated pairs. On the other hand, thanks to the discrimination step, the confusion incurred by association is effectively alleviated and overall it shows less confusion than TFA (the second column of Figure 4). Moreover, our FADI yields signiﬁcantly higher score predictions than TFA, which conﬁrms the effectiveness of disentangling and set-specialized margin loss.

Superiority of Semantic Similarity over Visual Similarity Table 5 demonstrates semantic similarity works better than visual similarity. Here the weights of the base classiﬁer as prototype representations of base classes. Thus, we take the score prediction of novel instances on base classiﬁer as the visual similarity. However, we ﬁnd it sometimes can be misleading, especially when a novel instance co-occurrent with a base instance, e.g., cat sits on a chair , person rides a bike as shown in Figure 5. Such co-occurrence deceives the base classiﬁer that cat is similar to chair and bike is similar to person , which makes the visual similarity not reliable under data scarcity scenarios. As shown in Table 6, when the shot grows, the performance gap between semantic and visual can be reduced by a more accurate visual similarity measurement.

5 Conclusion

In this paper, we propose Few-shot object detection via Association and DIscrimination (FADI). In the association step, to learn a compact intra-class structure, we selectively associate each novel class with a well-trained base class based on their semantic similarity. The novel class readily learns to align its intra-class distribution to the associated base class. In the discrimination step, to ensure the inter-class separability, we disentangle the classiﬁcation branches for base and novel classes, respectively. A set-specialized margin loss is further imposed to enlarge the inter-class distance. Experiments results demonstrate that FADI is a concise yet effective solution for FSOD.

Acknowledgements. This research was conducted in collaboration with Sense Time. This work is supported by GRF 14203518, ITS/431/18FX, CUHK Agreement TS1712093, Theme-based Research Scheme 2020/21 (No. T41-603/20R), NTU NAP, RIE2020 Industry Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative, and Shanghai Committee of Science and Technology, China (Grant No. 20DZ1100800).

[1] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1 [2] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. Lstd: A low-shot transfer detector for object detection. In AAAI Conference on Artiﬁcial Intelligence, 2018. 7 [3] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. ar Xiv preprint ar Xiv:1906.07155, 2019. 7, 12 [4] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classiﬁcation. In International Conference on Learning Representations, 2019. 3 [5] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3, 6, 8, 9 [6] Guneet S Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classiﬁcation. ar Xiv preprint ar Xiv:1909.02729, 2019. 3 [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303 338, June 2010. 2, 7, 12 [8] Qi Fan, Wei Zhuo, Chi-Keung Tang, and Yu-Wing Tai. Few-shot object detection with attention-rpn and multi-relation detector. In IEEE Conference on Computer Vision and Pattern Recognition, 2020. 3 [9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017. 3 [10] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features. In IEEE International Conference on Computer Vision, 2017. 3 [11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, 2017. 1 [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. 7 [13] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. In IEEE International Conference on Computer Vision, 2019. 2, 3, 7 [14] Leonid Karlinsky, Joseph Shtok, Sivan Harary, Eli Schwartz, Amit Aides, Rogerio Feris, Raja Giryes, and Alex M Bronstein. Repmet: Representative-based metric learning for classiﬁcation and few-shot object detection. In IEEE International Conference on Computer Vision, 2019. 2, 3, 7 [15] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3 [16] Dekang Lin et al. An information-theoretic deﬁnition of similarity. In International Conference on Machine Learning, 1998. 4 [17] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He, Bharath Hariharan, and Serge J Belongie. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 7 [18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. 2, 7, 12 [19] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3, 6

[20] George A Miller. Wordnet: a lexical database for english. Communications of the ACM. 4, 8 [21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, 2015. 1, 5, 7 [22] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017. 3 [23] Bo Sun, Banghuai Li, Shengcai Cai, Ye Yuan, and Chi Zhang. Fsce: Few-shot object detection via contrastive proposal encoding. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. 2, 3, 7, 8 [24] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 3 [25] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 2008. 5 [26] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, 2016. 3 [27] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 3, 6, 8, 9 [28] Xin Wang, Thomas E. Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu. Frustratingly simple few-shot object detection. In International Conference on Machine Learning, 2020. 2, 3, 7, 8 [29] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 3 [30] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Meta-learning to detect rare objects. In IEEE Conference on Computer Vision and Pattern Recognition, 2019. 7 [31] Jiaxi Wu, Songtao Liu, Di Huang, and Yunhong Wang. Multi-scale positive sample reﬁnement for few-shot object detection. In European Conference on Computer Vision, 2020. 2, 3, 7 [32] Yang Xiao and Renaud Marlet. Few-shot object detection and viewpoint estimation for objects in the wild. In European Conference on Computer Vision, 2020. 2, 3 [33] Chen Xing, Negar Rostamzadeh, Boris Oreshkin, and Pedro O O Pinheiro. Adaptive cross-modal few-shot learning. Advances in Neural Information Processing Systems, 2019. 3 [34] Caixia Yan, Qinghua Zheng, Xiaojun Chang, Minnan Luo, Chung-Hsing Yeh, and Alexander G Hauptman. Semantics-preserving graph propagation for zero-shot object detection. IEEE Transactions on Image Processing, 2020. 3 [35] Xiaopeng Yan, Ziliang Chen, Anni Xu, Xiaoxi Wang, Xiaodan Liang, and Liang Lin. Meta r-cnn: Towards general solver for instance-level low-shot learning. In IEEE International Conference on Computer Vision, 2019. 2, 3, 7 [36] Yukuan Yang, Fangyu Wei, Miaojing Shi, and Guoqi Li. Restoring negative information in few-shot object detection. In Advances in Neural Information Processing Systems, 2020. 3 [37] Ze Yang, Yali Wang, Xianyu Chen, Jianzhuang Liu, and Yu Qiao. Context-transformer: tackling object confusion for few-shot detection. In AAAI Conference on Artiﬁcial Intelligence, 2020. 3 [38] Chenchen Zhu, Fangyi Chen, Uzair Ahmed, and Marios Savvides. Semantic relation reasoning for shotstable few-shot object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. 7

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Appendix 1.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] We will release the code to ensure strict reproducibility upon the paper accepted. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 4.1, Appendix 2 and Appendix 3.

(c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix 2. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Pascal VOC [7], MS COCO [18], MMDetection [3] (b) Did you mention the license of the assets? [No]

(c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [No] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [No] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]