# lgd_labelguided_selfdistillation_for_object_detection__c918f72f.pdf

LGD: Label-Guided Self-Distillation for Object Detection

Peizhen Zhang,*1 Zijian Kang,*2 Tong Yang,1 Xiangyu Zhang, 1

Nanning Zheng,2 Jian Sun 1

1MEGVII Technology, 2Xi an Jiaotong University {zhangpeizhen, yangtong, zhangxiangyu, sunjian}@megvii.com, kzj123@stu.xjtu.edu.cn, nnzheng@mail.xjtu.edu.cn

In this paper, we propose the first self-distillation framework for general object detection, termed LGD (Label-Guided self-Distillation). Previous studies rely on a strong pretrained teacher to provide instructive knowledge that could be unavailable in real-world scenarios. Instead, we generate an instructive knowledge based only on student representations and regular labels. Our framework includes sparse labelappearance encoder, inter-object relation adaptater and intraobject knowledge mapper that jointly form an implicit teacher at training phase, dynamically dependent on labels and evolving student representations. They are trained end-to-end with detector and discarded in inference. Experimentally, LGD obtains decent results on various detectors, datasets, and extensive tasks like instance segmentation. For example in MSCOCO dataset, LGD improves Retina Net with Res Net-50 under 2 single-scale training from 36.2% to 39.0% m AP (+ 2.8%). It boosts much stronger detectors like FCOS with Res Ne Xt-101 DCN v2 under 2 multi-scale training from 46.1% to 47.9% (+ 1.8%). Compared with a classical teacherbased method FGFI, LGD not only performs better without requiring pretrained teacher but also reduces 51% training cost beyond inherent student learning. Codes are available at https://github.com/megvii-research/LGD.

Introduction Knowledge distillation (KD) (Romero et al. 2015; Hinton, Vinyals, and Dean 2015) is initially proposed for image classification and obtains impressive results. Typically, it is about transferring instructive knowledge from a pretrained model (teacher) to a smaller one (student). Recently, KD applied to the fundamental object detection task, has aroused researchers interests (Li, Jin, and Yan 2017; Wei et al. 2018; Wang et al. 2019; Zhang et al. 2020; Dai et al. 2021; Guo et al. 2021; Zhang and Ma 2021; Yao et al. 2021). Existing works achieve respectable performance but the choice of teacher is sophisticated and inconsistent among them. One common ground is that they all require a heavy pretrained teacher as it is discovered by recent works (Zhang and Ma 2021; Yao et al. 2021) that distillation efficacy could be enhanced with stronger teachers. Yet the pursuit for an

*These authors contributed equally. Corresponding author. Copyright 2022, ssociation for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

R-50 R-101 R-101 DCN 38

FGFI-101 (w/ teacher)

FGFI-101 DCN (w/ teacher)

Results on Retina Net with Various Backbones

Figure 1: Results trending on Retina Net 2 ms with backbones R-{50, 101, 101 DCN} respectively. FGFI-{101, 101 DCN} denote FGFI method using Retina Net 2 ms with R-101 and R-101 DCN as teachers, respectively.

ideal teacher could scarcely be satisfied in real-world applications, since it might take tons of efforts on trial and error (Peng et al. 2020). Instead, the issue that KD for generic detection without pretrained teacher is barely investigated. To alleviate the pretrained teacher dependence, teacherfree schemes are proposed like (a) self-distillation, (b) collaborative learning and (c) label regularization, where instructive knowledge could be cross-layer features (Zhang et al. 2019), competitive counterparts (Zhang et al. 2018) and modulated label distribution (Yuan et al. 2020), etc. However, these methods are designed for classification and are inapplicable to detection since the latter has to handle multiple objects with different locations and categories but singe image classification. Lately, Label Enc (Hao et al. 2020) extends traditional label regularization by introducing location-category modeling with an isolated network. It produces label representations with which the student features are supervised. Though it obtains impressive results, we find the improvement saturates (Figure 3) as detector grows stronger, e.g., with larger backbones and multi-scale training. We conjecture this is because labels themselves describe only object-wise categories and locations, without consider-

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Cross-Attention Interaction

Label-appearance Encoder

0. , 0. , 1. , 1. + Context ǜя, ǝя, ǜӕ, ǝӕ + Person ǜя, ǝя, ǜӕ, ǝӕ + Dog ǜя, ǝя, ǜӕ, ǝӕ + Goat Annotations

Inter-object Relation Adapter Intra-object Knowledge Mapper

Backbone with Feature Pyramid

Detecction Head

Annotations

Shared Detection Head

Label Encoder

Appearance Encoder

Figure 2: The proposed framework contains three modules: (1) Label-appearance encoder, (2) Inter-object relation adapter and (3) Intra-object knowledge mapper. For brevity, we omit the pyramid level indications which will be elaborated in Section . LI det / LS det denote detection losses upon instructive / student representations and Ldistill is the distillation loss. We denote by (f x1, ey1, f x2, ey2) the ground-truth box location normalized by image size that (0., 0., 1., 1.) refers to an entire context box.

ing the inter-object relationship which is also important (Hu et al. 2018; Cai et al. 2019). For detectors with limited capacity, Label Enc provides strong complementary supervision, albeit without relation information. For stronger detectors which are able to extract abundant object-wise hints from default supervision, using Label Enc becomes less beneficial or even detrimental (see the leftmost figure in Figure 3). This might result from semantic discrepancy by heterogeneous input (image vs. label) and isolated modeling. Motivated by this, we propose Label-Guided self Distillation (LGD), a new teacher-free method for object detection as shown in Figure 2. In LGD, we devise an inter-object relation adapter and an intra-object knowledge mapper to collaboratively model the relation in forming instructive knowledge. The relation adapter computes interacted embeddings by a cross-attention interaction. Specifically, the interacted embedding of each object is calculated by first measuring the cross-modal similarity between its appearance embedding and every label embedding upon which a weighted-aggregation is then performed. The knowledge mapper maps the interacted embeddings onto feature map space as final instructive knowledge, considering intraobject representation consistency and localization heuristics. Owing to the above relation modeling, the final instructive knowledge is naturally adapted to the student representations, facilitating effective distillation for strong student detectors and semantic discrepancy mitigation. Beyond efficacy, our method is also efficient, it does not rely on a strong convolution network as teacher because we adopt efficient instance-wise embeddings design. The above efficient design allows LGD to train jointly with the student, simplify the pipeline, and reduce training cost (Table 7). During inference, only student detector is kept, bringing no extra cost. In short, our contributions are three-fold:

1. We propose a new self-distillation framework for general object detection. Unlike previous methods that use a convolution network as teacher, LGD generates instructive knowledge on-the-fly without pretrained teacher and im-

proves the detection quality under limited training cost. 2. We introduce inter-and-intra relation to model a new instructive knowledge, rather than simply extract existent relation from student and teacher for distillation. 3. The proposed method outperforms previous teacher-free SOTA with higher upper limit and is better than classical teacher-based method FGFI in strong student settings. Beyond inherent student learning, it saves 51% training time against the classical teacher-based distillation.

Related Work Detection KD with Pretrained Teachers Unlike classification, knowledge transfer for object detection is more challenging. In detection, models are asked to predict multiple instances with diversified categories distributed at different locations in the image. (Li, Jin, and Yan 2017) proposed Mimic to distill activations within the region proposals predicted by RPN (Ren et al. 2015). (Chen et al. 2017) introduced weighted cross-entropy and bounded regression loss for enhancing the performance. To further exploit the context information of the distilling regions around the objects, (Wang et al. 2019) extended the ground-truth box regions by anchor-assigned ones. For learning adapted sampling weight for different knowledge, (Zhang et al. 2020) proposed PAD with uncertainty modeling. Besides intermediate feature hints, (Dai et al. 2021) involved the prediction map distillation obeying the assignment rules and relation distillation (Park et al. 2019) upon their defined general instances. Instead of focusing on foreground regions only, (Guo et al. 2021) decoupled the fore/back-ground knowledge transfer. To facilitate region-agnostic distillation, (Zhang and Ma 2021) proposed feature-based knowledge transfer by spatial-channel-wise attention. To resolve the feature resolution mismatching in cross-layer distillation and mitigate the misaligned label assignment, (Yao et al. 2021) introduced G-Det KD. Above methods mainly conducted feature-based distillation which is followed in this work. Whereas, they are designed for settings with strong

pretrained teachers that could be unavailable or unaffordable in real-world scenarios. Recently, (Huang et al. 2020) proposed self-distillation for weakly supervised detection but the setting is much different from generic object detection.

Teacher-free Methods Beyond traditional KD with pretrained teacher, there are teacher-free schemes that could be divided into three categories: (1) self-distillation (2) collaborative learning and (3) label regularization. (1) self-distillation excavates instructive knowledge from model itself. For instance, (Yang et al. 2019; Kim et al. 2020) used previously saved snapshots as teachers. In (Zhang et al. 2019), network was divided into sections that deeper layers were used to teach the shallower ones. In Meta Distiller (Liu et al. 2020), the knowledge stemmed from one-step predictions. (2) Collaborative learning involves multiple students to boost each other. (Zhang et al. 2018) proposed deep mutual learning (DML) where student networks with identical architecture learned collaboratively. (Lan, Zhu, and Gong 2018) proposed ONE by considering ensemble learning in branch-granularity. In KDCL (Guo et al. 2020), predictions were fused together as instructive knowledge. Likewise in (Chen et al. 2020a), ensemble logits of multiple students were aggregated to distill another. (Furlanello et al. 2018) proposed Born-Again Network (BAN) that leveraged information from last generations to distill the next. (3) For label regularization, (Yuan et al. 2020) proposed tf-KD for regularized label distribution beyond label smoothing (Szegedy et al. 2016). However, above methods were designed for classification only. Recently, there have been newly-built label regularization methods (Mostajabi, Maire, and Shakhnarovich 2018; Hao et al. 2020) using an isolated network to explicitly model labels as features for supervision,w.r.t. semantic segmentation and detection. They obtained impressive results. In (Hao et al. 2020), dense color maps with category and location information were constructed and fed into an auto-encoderlike network to fetch label representations. However, they considered each object modeling separately which was suboptimal. Instead, we propose to generate instructive knowledge by inter-object and intra-object relation modeling to form a self-distillation scheme with higher upper limit.

Method As shown in Fig. 2, we illustrate the modules in LGD as follows: (1) An encoder that computes label and appearance embeddings. (2) An inter-object relation adapter that generates interacted embeddings given label and appearance embeddings of objects. (3) An intra-object knowledge mapper that back-projects interacted embeddings onto feature map space to obtain instructive knowledge for distillation.

Label-appearance Encoder (1) Label Encoding: For each object, we concatenate its normalized ground-truth box (f x1, ey1, f x2, ey2) and one-hot category vector to obtain a descriptor. The object-wise descriptors are passed into a label encoding module for refined label embeddings L = {li RC}N i=0, where i indicates

object index, C = 256 is the intermediate feature dimension, and N is the object number. i = 0 indexes the context object. To introduce basic relation modeling among label descriptors and maintain a permutation-invariant property, we adopt the classical Point Net (Qi et al. 2017) as the label encoding module. It processes the descriptors by a multilayer perceptron (Friedman et al. 2001) with local-global modeling by a spatial transformer network (Jaderberg et al. 2015). Also, the label descriptors are similar to point set that is accustomed to Point Net (bounding boxes could be viewed as points in 4-dimensional Cartesian space). Empirically, using Point Net as encoder behaves slightly better than MLP or transformer encoder (Vaswani et al. 2017) (Table 4). We further replace the Batch Norm (Ioffe and Szegedy 2015) with Layer Norm (Ba, Kiros, and Hinton 2016) to adapt the small-batch detection setting. Notably, the above 1D object-wise label encoding manner is more efficient than that in Label Enc. The Label Enc constructs an ad-hoc color map RH W K to describe labels where (H, W) and K are input resolution and object category number respectively (HWK C). The color map is processed by an extra CNN and pyramid network for 2D pixel-wise representations L = {l i RHp Wp C, 1 p P}. P refers to the number of pyramid scales (Lin et al. 2017a) that (Hp, Wp) denotes feature map resolution at scale p.

(2) Appearance Encoding: Beyond label encoding, we retrieve compact appearance embeddings from feature pyramid of student detector that contains appearance feature of perceived objects. We adopt a handy mask pooling to extract object-wise embeddings from the feature maps. Specifically, we pre-compute the object-wise masks: M = {mi}N i=1 S{m0} at input level for total N objects and a virtual context object with location (0., 0., 1., 1.) covering the entire image. For each object i (0 i N), mi RH W is a binary matrix whose values are set as 1 inside the groundtruth region and 0 otherwise. The mask pooling is conducted concurrently for all pyramid levels, at each of which, objectwise masks at input level are down-scaled to corresponding resolution to become scale-specific ones. At p-th scale, the appearance embedding ai RC is obtained by calculating channel-broadcasted Hadamard product between the projected feature map Fproj(Xp) RHp Wp C and downscaled object mask RHp Wp, followed by global sum pooling. Fproj( ) is a single 3 3 conv layer. Thus, we collect appearance embeddings: Ap = {ai RC}N i=0 for each object at level p (likewise for the other levels).

Inter-object Relation Adapter

Given label and appearance embeddings, we formulate the inter-object relation adaption by a cross-attention process. In Fig. 2, this process is executed at every student appearance pyramid scale to retrieve the interacted embeddings. We omit the pyramid scale subscript below for brevity. During the cross attention, a sequence of key and query tokens are leveraged in calculating KQ-attention relation for aggregating value to obtain attention outputs. For achieving the label-guided information adaption, we exploit the ap-

pearance embeddings A at current scale as query, and the scale-invariant label embeddings L as key and value. The attention scheme measures the correlation between lowerlevel structural appearance information and higher-level label semantics among objects then reassembles the informative label embeddings for dynamic adaption. Before conducting attention, the query, key, and value are transformed by linear layers f Q, f K and f V, respectively. We then computed the interacted embeddings ui RC for i-th object by weighting each transformed label embedding f V(lj) by label-appearance correlation factor wij.

j=0 wijf V(lj) (1)

wij is calculated by a scaled dot-product between i-th appearance embeddings ai and j-th label embeddings lj followed by a softmax operation:

wij = exp (f Q(ai) f K(lj)/τ) PN k=0 exp (f Q(ai) f K(lk)/τ) (2)

where is the notation for inner product and τ =

C is the denominator for variance rectification (Vaswani et al. 2017). Specifically, for more robust attention modeling, the paradigm actually involves T set of concurrent operations termed heads to obtain partial interacted embeddings in parallel. By concatenating the partial interacted embeddings from all heads and applying a linear projection f P, we obtain interacted embeddings E = {ei RC}N i=0 for all objects:

ei = f P([u1 i ; u2 i ; . . . ; u T i ]) (3)

where [;] denotes the concatenation operator that combines the partial embeddings along the channel dimension. The resulting embeddings are also scale-sensitive as the appearance embeddings. As aforementioned, we obtain interacted embeddings across scales by iterating over all feature scales. Technically, above computation is accomplished by means of multi-head self attention (MHSA) (Vaswani et al. 2017). Note that our framework is decoupled to the specific choice. As will be shown in this paper, LGD shows the efficacy even with the naive transformer. It is likely to perform even better by using advanced variants like focal transformer (Yang et al. 2021) but that is beyond the scope.

Intra-object Knowledge Mapper To make the 1D interacted embeddings applicable to widelyused intermediate feature distillation (Li, Jin, and Yan 2017; Wang et al. 2019) for detection, we map the interacted embeddings onto 2D feature map space to fetch instructive knowledge. Naturally, for each pyramid scale p, (1 p P), the resolutions of resulting maps are confined to be identical with corresponding student feature maps. Intuitively, since spatial topology is not maintained in label encoding for compact representations (Sec. ), it is important to recover the localization information for each object to achieve alignment in geometric perspective. Naturally, object bounding box regions serve as good heuristics. We fill

each object-binding interacted embedding within its corresponding ground-truth box region on a zero-initialized feature map. In practice, for each object i, we acquire its feature map at p-th scale by calculating matrix multiplication between the vectorized object mask mi RHp Wp 1 and the projected, interacted embedding ei. All these object-wise maps are added up to a unified one followed by a refinement module Fref( ) to form the instructive knowledge:

Xp I = Fref

m0F ctx(e0) + G

i=1 mi F inst(ei)

(4) where F ctx(e0) and F inst(ei) R1 C, (1 i N) are the transposes of projected context and normal object interacted embeddings, respectively. Both Fctx( ) and Finst( ) are single fc layers. G( ) is a single 3 3 conv layer. Fref( ) starts with a relu followed by three 3 3 conv layers. Thus, we collect the instructive knowledge X I = {Xp I RHp Wp C}P p=1 at all scales. Beyond applicability consideration, the above mapping implies a spirit of intra-object regularization (Yun et al. 2020; Law and Deng 2018; Chen et al. 2020b) which enforces activation neurons inside the same foreground region on student appearance representations to be close (through subsequent distillation in Equation 5). Moreover, these instructive representations will be supervised with detection loss for ensuring the representation capability (Equation 6). Before distillation, an adaption head Fadapt( ) is used to adapt student representations, following Fit Net. We conduct knowledge transfer between the instructive representations Xp I and the adapted student features XS p = Fadapt(Xp) at each feature scale. We adopt Instance Norm (Ulyanov, Vedaldi, and Lempitsky 2016) to eliminate the appearance and label style information for both feature maps followed by a Mean-Square-Error (MSE):

Ldistill = 1 Ntotal

XS p XI p 2 (5)

where P is the total number of pyramid levels, and Ntotal = PP p=1 Hp Wp C indicates the total size of the feature pyramid tensors. As gradient stopping technique suggested in previous studies (Hao et al. 2020; Hoffman, Gupta, and Darrell 2016), we detach instructive representations X I when calculating distillation loss to avoid model collapse. Besides the distillation loss and detection loss for optimizing student detector, we further ensure the instructive representation quality and consistency with student representations by sharing the detection head for supervision. The overall detection loss is shown below:

Ldet = LS det(H(X), Y) + LI det(H(X I), Y) (6)

where X/X I denote student / instructive representations across scales. LS/I det denotes the detection loss (classification and regression) upon them. H( ) refers to the detection head. Y stands for the label set (boxes and categories). In summary, the total training objective is:

Ltotal = Ldet + λLdistill (7)

where λ is a trade-off for distillation term and we simply adopt λ = 1 throughout all experiments. For stable training, the distillation starts in 30k iterations since it could be detrimental when the instructive knowledge is optimized insufficiently (Hao et al. 2020; Liu et al. 2020). The student detector backbone is frozen in early 10k iterations under 1 training schedule and 20k for 2 training schedule.

Experiments Setup

The proposed framework is built upon Detectron2 (Wu et al. 2019). Experiments are run with batch size 16 on 8 GPUs. Inputs are resized such that shorter sides are no more than 800 pixels. We use SGD optimizer with 0.9 momentum and 10 4 weight decay. The multi-head attention in interobject relation adapter uses T = 8 heads following common practice. For brevity, we denote by R-50, R-101 and R-101 DCN for Res Net-50, Res Net-101 and Res Net-101 with deformable convolutions v2 (Zhu et al. 2019). Main experiments are validated on MS-COCO (Lin et al. 2014) dataset that we also testify on others: Pascal VOC (Everingham et al. 2010) and Crowd Human (Shao et al. 2018). MS-COCO is a challenging object detection dataset with 80 categories. Mean average precision (AP) is used as the major metric. Following common protocol (He, Girshick, and Doll ar 2019), we use the trainval-115k and minival-5k subsets w.r.t. training and evaluation. We denote by 1 the training for 90k iterations where learning rate is divided by 10 at 60k and 80k iterations. By analogy, 2 denotes 180k of iterations with milestones at 120k and 160k. We term the single and multi-scale training by ss and ms for short. Pascal VOC is a dataset with 20 classes. The union of trainval-2007 and trainval-2012 subsets are used for training, leaving test-2007 for validation. We report m AP and AP50/75 (AP with overlapping threshold 0.5/0.75). Models are trained for 24k iterations with milestones at 18k and 22k. Crowd Human is the largest crowd pedestrian detection dataset, containing 23 people per image. It includes 15k and 4370 images w.r.t. training and validation. The major metric is average log miss rate over false positives per image (termed m MR, lower is better). Models are trained for 30 epochs with learning rate decayed at 24th and 27th epoch.

Comparison to Teacher-free Methods

Detailed Comparison with State-of-the-Art. As shown in Figure 3 and Table 1, we compare our LGD framework with the baseline and previous teacher-free SOTA, i.e., the Label Enc (Hao et al. 2020) regularization method. We verify the efficacy on MS-COCO on three popular detectors: Faster R-CNN (Ren et al. 2015), Retina Net (Lin et al. 2017b) and FCOS (Tian et al. 2019). Figure 3 shows the result trending as student detector grows stronger (longer periods: 1 2 , scale augmentations: ss ms and larger backbones: R-50 R-101 R-101 DCN). Our model compare favorably to or is slightly better than Label Enc in earlier settings. For Retina Net or FCOS R-50 at 2 ss setting, the baseline runs into overfitting while our method tackles

Head Backbone Setting Baseline Label Enc Ours

1 ss 37.6 38.1 38.3 1 ms 37.9 38.4 38.6 2 ss 38.0 38.9 39.2 2 ms 39.6 39.6 40.4 101 2 ms 41.7 41.4 42.3 101 DCN 2 ms 44.1 44.0 44.9

1 ss 36.6 37.8 38.3

1 ms 37.4 38.5 38.5 2 ss 36.2 39.0 39.0 2 ms 38.8 39.6 40.3 101 2 ms 40.6 41.5 42.1 101 DCN 2 ms 43.1 43.5 44.4

1 ss 38.8 39.6 39.7 1 ms 39.4 40.0 40.1 2 ss 38.1 41.0 40.9 2 ms 41.0 41.8 42.3 101 2 ms 42.9 43.6 44.1 101 DCN 2 ms 44.9 45.6 46.3

Table 1: Detailed comparison with previous SOTA.

Method Retina Net FRCN 1 ss 1 ms 1 ss 1 ms Baseline 36.6 37.4 37.6 37.9 DML 37.0 37.4 37.6 37.9 tf-KD 37.5 37.8 BAN , 36.8 38.0 37.6 38.1 Ours 38.3 38.5 38.3 38.6

Table 2: Comparison with typical teacher-free methods. denotes our transfer to detection. denotes reporting the 3rd generation result in BAN literature which costs 3 longer training schedules far more than regular 1 . Also, it is undefined for tf-KD to experiment on Retina Net with focal loss.

that and achieves 2.8% m AP gain. Notably, as the detector setting becomes stronger, the gain of Label Enc shrinks rapidly while ours still consistently boosts the performance. For Faster R-CNN with R-101 and R-101 DCN, Label Enc underperforms the baseline (41.4 vs. 41.7 and 44.0 vs. 44.1). Instead, our method manage to improve and surpasses Label Enc at around 1% m AP, verifying higher upper limit. Likewise, for Retina Net and FCOS with R-101 and R-101 DCN, our method could steadily achieve gains of 1.2 1.5%. Note that in traditional distillation schemes, it remains unknown to find suitable teacher for such strong students.

Comparison with Typical Methods. As aforementioned, teacher-free schemes other than Label Enc are NOT designed for detection. For surplus concern, we transfer and reimplement typical methods like DML, tf-KD and BAN to detection by substituting their logits distillation with intermediate feature distillation in mainstream detection KD literature (except tf-KD). As shown in Table 2, these methods obtain slight improvement or are even harmful (tf-KD). BAN performs the best among them. It obtains 0.6% improvement on Retina Net 1 ms R-50 at a cost of actual 3 training periods. However, it fails to generalize to other settings.

1 ss 1 ms 2 ss 2 ms 2 ms 2 ms 36

Faster R-CNN

1 ss 1 ms 2 ss 2 ms 2 ms 2 ms

1 ss 1 ms 2 ss 2 ms 2 ms 2 ms

Figure 3: Result tendency as detector grows stronger on three typical detectors by Label Enc and ours. In each sub-figure, there are six settings from left to right: R-50-{1 ss, 1 ms, 2 ss, 2 ms} R-101-2 ms R-101 DCN-2 ms.

Comparison with Classical Teacher-based KD

Method Teacher Student Backbones R-50 R-101 R-101 DCN Baseline N/A 38.8 40.6 43.1 Label Enc N/A 39.6 41.5 43.5

FGFI R-101 39.8 40.7 42.4 R-101 DCN 40.5 41.9 43.0 Ours N/A 40.3 42.1 44.4

Table 3: Results corresponding to Figure 1. Our method is effective for stronger students compared with others.

We also compare the proposed teacher-free LGD with the classical teacher-based method, FGFI (Wang et al. 2019). Experiments are conducted on Retina Net 2 ms with backbones R-50, 101 and 101 DCN respectively. As shown in Figure 1 and Table 3, our framework performs better when student gets stronger. Towards strong detector with R-101 DCN as backbone, LGD is 0.9% and 1.4% superior to Label Enc and FGFI. The reason why the benefits of FGFI diminish might attribute to lack of much stronger teacher (Zhang and Ma 2021; Yao et al. 2021). We believe it is possible that FGFI with larger teacher or other stronger teacher-based detection KD can outperform ours, but such teacher-presumed setting is not the design purpose of our framework.

Ablation Studies

Method AP APs APm APL AP N/A 36.6 21.2 40.4 48.1 MLP 37.9 21.5 41.9 49.7 +1.3 Trans Enc 37.9 21.7 41.6 50.2 +1.3 Point Net 38.3 23.2 42.0 50.0 +1.7

Table 4: Label Encoder Ablation

Label Encoding. In this work, we adopt Point Net (Qi et al. 2017) as the label encoding module. In fact, other modules are also applicable. We conduct comparisons on three

alternations under 2 ms schedule on MS-COCO with Retina Net based on Res Net-50 backbone. Specifically, we compare Point Net with a MLP only network, and an encoder network composed of 6 scaled dot-product attention heads (Vaswani et al. 2017), abbreviated as Trans Enc . Similar to the handling we have done upon Point Net, we feed label descriptors into these networks to obtain label embeddings. We respectively input these label embeddings to remaining LGD modules and examine. All variants achieve good results as shown in Table 4, which demonstrates the robustness of our framework. The Point Net we finally adopt is the best among three of them, perhaps owing to its local-global relationship modeling among label descriptors.

Method Baseline Interaction Query (Ours)

Label Student Retina Net 36.6 37.6 (+1.0) 38.3 (+1.7) FRCN 37.6 37.8 (+0.2) 38.3 (+0.7) FCOS 38.8 39.6 (+0.8) 39.7 (+0.9)

Table 5: Inter-object Relation Adaption ablations with Retina Net, Faster R-CNN and FCOS with R-50 1 ss.

Inter-object Relation Adapter. As aforementioned in Sec , the proposed method adopts the student appearance embeddings as queries and label embeddings as keys and values to involve in the guided inter-object relation modeling (here abbreviated as Student ). We also experiment with the reverse option that using label embeddings as queries (abbreviated as Label ). As shown in Table 5, for Retina Net and FRCN 1 ss with R-50 as backbone, the adopted student mode are 0.7% and 0.5% better than Label mode.

Intra-object Knowledge Mapper. As specified in Equation 4, the instructive knowledge is dependent on interacted embeddings of both actual objects and virtual context. We ablate their usage in Table 6a. As expected, the context alone is not helpful since mere context provides nothing useful towards object detection. It manages to enhance the performance when combined with object embeddings (+0.3%).

Object Context AP 36.6 ! 36.6 ! 38.0 ! ! 38.3

(a) Embedding Participation

Method Mode AP

36.6 unshared 37.8 shared 38.3

37.6 unshared 37.7 shared 38.3

(b) Head sharing choice

Table 6: Intra-object knowledge adapter ablations.

Head Sharing. Besides, we also examine the head sharing paradigm as shown in Table 6b. Sharing heads between student and instructive representations is consistently better.

Training Efficiency

Method Pre-training Overall Method Specific Baseline 12.1 FGFI 17.0 35.5 23.4 Label Enc 14.9 24.5 12.4 Ours N/A 23.5 11.4

Table 7: Comparison of Training Cost (hours).

Though all distillation and regularization methods won t affect the inference speed of student, they could be traininginefficient due to prerequisite pretraining and distillation process. This is concerned in practical applications but is seldom discussed. As shown in Table 7, we benchmark the (1) Overall : overall training cost and (2) Method Specific : overall except student learning (an inherent part shared by all methods). The examination is run on 8 Tesla V100 GPUs upon Retina Net 2 ss R-50. We use the corresponding detector with R-101 backbone as teacher for FGFI. Compared with FGFI, we save 34% (23.5 vs. 35.5 hours) and 51% (11.4 vs. 23.4 hours) on overall and method-specific items respectively. In fact, there could be stronger teacher exploitation for FGFI or other modern teacher-based KDs that outperform ours but it might bring about a heavier training burden and is beyond our discussion scope. Analogous to FGFI, Label Enc introduces a two-stage training paradigm albeit without pretrained teacher. Towards Label Enc, our method consumes 1 hour less and is trained in one-step fashion. In practice, Label Enc consumes 3.8 G extra GPU footprints except that of the inherent detector, while ours consumes 2.5 G extra (saving 34% relatively) yet performs better.

Versatility

Extended Datasets (a) Pascal VOC: We conduct experiments with Faster R-CNN and Retina Net with R-50 under 2 ms setting. As shown in Table 8, our method improves the results by 1.7% (Faster R-CNN) and 2.3% (Retina Net). Notably, the AP75 metric of Retina Net improves 3.0%, showing the efficacy.

Method AP AP50 AP75 FRCN 55.1 81.9 61.0 +ours 56.8 (+1.7) 82.5 (+0.6) 63.3 (+2.3) Retina Net 56.6 81.4 61.3 +ours 58.9 (+2.3) 82.6 (+1.2) 64.3 (+3.0)

Table 8: Pascal VOC.

m MR Detector

Retina Net FRCN

Baseline 57.9 48.7 Ours 56.4 ( 1.5) 46.4 ( 2.3)

Table 9: Crowd Human. m MR: the lower, the better.

(b) Crowd Human: We also verify our method on the largest crowded detection dataset, Crowd Human. As shown in Table 9, our method significantly improves the m MR (lower is better) by 2.3% and 1.5% for Faster R-CNN and Retina Net respectively. It further demonstrates the generality of our proposed LGD method towards real-world applications.

Method APbox APmask Mask R-CNN (R-50) 38.8 35.2 +ours 39.8 (+1.0) 36.2 (+1.0) Mask R-CNN (R-101) 41.2 37.2 +ours 42.0 (+0.8) 38.0 (+0.8)

Table 10: Comparison on instance segmentation.

Instance Segmentation. To further validate the versatility, we conduct experiments on instance segmentation on MS-COCO. In this task, a detector is required to simultaneously localize and segment each object. We experiment on Mask R-CNN (He et al. 2017). To fully utilize the labels, we replace the object-wise box masks (Section (2)) with the segmentation masks as better spatial prior. As shown in Table 10, our method boosts 1% and 0.8% mask-box AP with respect to Mask R-CNN R-50 and 101. Please refer to the supplementary materials for more details.

In this paper, we propose a brand new self-distillation framework, termed LGD for knowledge distillation in general object detection. It absorbs the spirits of inter-and-intra object relationship into forming the instructive knowledge given regular labels and student representations. The proposed LGD runs in an online manner with decent performance and relatively lower training cost. It is superior to previous teacher-free methods and a classical teacher-based KD method especially for strong student detectors, showing higher potential. We hope LGD could serve as a baseline for future detection KD methods without pretrained teacher.

References Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. ar Xiv preprint ar Xiv:1607.06450. Cai, Q.; Pan, Y.; Ngo, C.; Tian, X.; Duan, L.; and Yao, T. 2019. Exploring Object Relation in Mean Teacher for Cross Domain Detection. In CVPR. Chen, D.; Mei, J.-P.; Wang, C.; Feng, Y.; and Chen, C. 2020a. Online knowledge distillation with diverse peers. In AAAI. Chen, G.; Choi, W.; Yu, X.; Han, T. X.; and Chandraker, M. 2017. Learning Efficient Object Detection Models with Knowledge Distillation. In Neur IPS. Chen, Y.; Zhang, Z.; Cao, Y.; Wang, L.; Lin, S.; and Hu, H. 2020b. Rep Points v2: Verification Meets Regression for Object Detection. In Neur IPS. Dai, X.; Jiang, Z.; Wu, Z.; Bao, Y.; Wang, Z.; Liu, S.; and Zhou, E. 2021. General Instance Distillation for Object Detection. In CVPR. Everingham, M.; Van Gool, L.; Williams, C. K. I.; Winn, J.; and Zisserman, A. 2010. The Pascal Visual Object Classes (VOC) Challenge. IJCV. Friedman, J.; Hastie, T.; Tibshirani, R.; et al. 2001. The elements of statistical learning, volume 1. Springer series in statistics New York. Furlanello, T.; Lipton, Z. C.; Tschannen, M.; Itti, L.; and Anandkumar, A. 2018. Born-Again Neural Networks. In ICML. Guo, J.; Han, K.; Wang, Y.; Wu, H.; Chen, X.; Xu, C.; and Xu, C. 2021. Distilling Object Detectors via Decoupled Features. In CVPR. Guo, Q.; Wang, X.; Wu, Y.; Yu, Z.; Liang, D.; Hu, X.; and Luo, P. 2020. Online Knowledge Distillation via Collaborative Learning. In CVPR. Hao, M.; Liu, Y.; Zhang, X.; and Sun, J. 2020. Label Enc: A New Intermediate Supervision Method for Object Detection. In ECCV. He, K.; Girshick, R. B.; and Doll ar, P. 2019. Rethinking Image Net Pre-Training. In ICCV. He, K.; Gkioxari, G.; Doll ar, P.; and Girshick, R. B. 2017. Mask R-CNN. In ICCV. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. ar Xiv:1503.02531. Hoffman, J.; Gupta, S.; and Darrell, T. 2016. Learning with Side Information through Modality Hallucination. In CVPR. Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; and Wei, Y. 2018. Relation Networks for Object Detection. In CVPR. Huang, Z.; Zou, Y.; Kumar, B. V. K. V.; and Huang, D. 2020. Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection. In Neur IPS. Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML. Jaderberg, M.; Simonyan, K.; Zisserman, A.; and Kavukcuoglu, K. 2015. Spatial Transformer Networks. In Neur IPS.

Kim, K.; Ji, B.; Yoon, D.; and Hwang, S. 2020. Selfknowledge distillation: A simple way for better generalization. ar Xiv preprint ar Xiv:2006.12000. Lan, X.; Zhu, X.; and Gong, S. 2018. Knowledge Distillation by On-the-Fly Native Ensemble. In Neur IPS. Law, H.; and Deng, J. 2018. Corner Net: Detecting Objects as Paired Keypoints. In ECCV. Li, Q.; Jin, S.; and Yan, J. 2017. Mimicking Very Efficient Network for Object Detection. In CVPR. Lin, T.; Doll ar, P.; Girshick, R. B.; He, K.; Hariharan, B.; and Belongie, S. J. 2017a. Feature Pyramid Networks for Object Detection. In CVPR. Lin, T.; Goyal, P.; Girshick, R. B.; He, K.; and Doll ar, P. 2017b. Focal Loss for Dense Object Detection. In ICCV. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In ECCV. Liu, B.; Rao, Y.; Lu, J.; Zhou, J.; and Hsieh, C.-J. 2020. Metadistiller: Network self-boosting via meta-learned topdown distillation. In ECCV. Mostajabi, M.; Maire, M.; and Shakhnarovich, G. 2018. Regularizing Deep Networks by Modeling and Predicting Label Structure. In CVPR. Park, W.; Kim, D.; Lu, Y.; and Cho, M. 2019. Relational Knowledge Distillation. In CVPR. Peng, H.; Du, H.; Yu, H.; Li, Q.; Liao, J.; and Fu, J. 2020. Cream of the crop: Distilling prioritized paths for one-shot neural architecture search. In Neur IPS. Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Point Net: Deep Learning on Point Sets for 3D Classification and Segmentation. In CVPR. Ren, S.; He, K.; Girshick, R. B.; and Sun, J. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Neur IPS. Romero, A.; Ballas, N.; Kahou, S. E.; Chassang, A.; Gatta, C.; and Bengio, Y. 2015. Fit Nets: Hints for Thin Deep Nets. In ICLR. Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; and Sun, J. 2018. Crowdhuman: A benchmark for detecting human in a crowd. ar Xiv preprint ar Xiv:1805.00123. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the Inception Architecture for Computer Vision. In CVPR. Tian, Z.; Shen, C.; Chen, H.; and He, T. 2019. FCOS: Fully Convolutional One-Stage Object Detection. In ICCV. Ulyanov, D.; Vedaldi, A.; and Lempitsky, V. 2016. Instance normalization: The missing ingredient for fast stylization. ar Xiv preprint ar Xiv:1607.08022. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In Neur IPS. Wang, T.; Yuan, L.; Zhang, X.; and Feng, J. 2019. Distilling Object Detectors With Fine-Grained Feature Imitation. In CVPR.

Wei, Y.; Pan, X.; Qin, H.; Ouyang, W.; and Yan, J. 2018. Quantization mimic: Towards very tiny cnn for object detection. In ECCV. Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.-Y.; and Girshick, R. 2019. Detectron2. https://github.com/facebookresearch/ detectron2. Yang, C.; Xie, L.; Su, C.; and Yuille, A. L. 2019. Snapshot Distillation: Teacher-Student Optimization in One Generation. In CVPR. Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; and Gao, J. 2021. Focal self-attention for local-global interactions in vision transformers. ar Xiv:2107.00641. Yao, L.; Pi, R.; Xu, H.; Zhang, W.; Li, Z.; and Zhang, T. 2021. G-Det KD: Towards General Distillation Framework for Object Detectors via Contrastive and Semantic-guided Feature Imitation. ar Xiv:2108.07482. Yuan, L.; Tay, F. E. H.; Li, G.; Wang, T.; and Feng, J. 2020. Revisiting Knowledge Distillation via Label Smoothing Regularization. In CVPR. Yun, S.; Park, J.; Lee, K.; and Shin, J. 2020. Regularizing Class-Wise Predictions via Self-Knowledge Distillation. In CVPR. Zhang, L.; and Ma, K. 2021. Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors. In ICLR. Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; and Ma, K. 2019. Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation. In ICCV. Zhang, Y.; Lan, Z.; Dai, Y.; Zeng, F.; Bai, Y.; Chang, J.; and Wei, Y. 2020. Prime-Aware Adaptive Distillation. In ECCV. Zhang, Y.; Xiang, T.; Hospedales, T. M.; and Lu, H. 2018. Deep Mutual Learning. In CVPR. Zhu, X.; Hu, H.; Lin, S.; and Dai, J. 2019. Deformable Conv Nets V2: More Deformable, Better Results. In CVPR.