# hierarchical_selfsupervised_augmented_knowledge_distillation__13b2e4d5.pdf Hierarchical Self-supervised Augmented Knowledge Distillation Chuanguang Yang1,2 , Zhulin An1 , Linhang Cai1,2 and Yongjun Xu1 1Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China {yangchuanguang, anzhulin, cailinhang19g, xyj}@ict.ac.cn Knowledge distillation often involves how to define and transfer knowledge from teacher to student effectively. Although recent self-supervised contrastive knowledge achieves the best performance, forcing the network to learn such knowledge may damage the representation learning of the original class recognition task. We therefore adopt an alternative self-supervised augmented task to guide the network to learn the joint distribution of the original recognition task and self-supervised auxiliary task. It is demonstrated as a richer knowledge to improve the representation power without losing the normal classification capability. Moreover, it is incomplete that previous methods only transfer the probabilistic knowledge between the final layers. We propose to append several auxiliary classifiers to hierarchical intermediate feature maps to generate diverse self-supervised knowledge and perform the one-to-one transfer to teach the student network thoroughly. Our method significantly surpasses the previous SOTA SSKD with an average improvement of 2.56% on CIFAR-100 and an improvement of 0.77% on Image Net across widely used network pairs. Codes are available at https://github.com/winycg/HSAKD. 1 Introduction Orthogonal to efficient network architecture designs [Yang et al., 2019; Zhu et al., 2019; Yang et al., 2020], Knowledge Distillation (KD) [Hinton et al., 2015] aims to transfer knowledge from a pre-trained high-capacity teacher network to a light-weight student network. The student s performance can often be improved significantly, benefiting from the additional guidance compared with the independent training. The current pattern of KD can be summarized as two critical aspects: (1) what kind of knowledge encapsulated in teacher network can be explored for KD; (2) How to effectively transfer knowledge from teacher to student. The original KD [Hinton et al., 2015] minimizes the KLdivergence of predictive class probability distributions be- Corresponding author (a) Self-supervised contrastive relationship [Xu et al., 2020]. (b) Our introduced self-supervised augmented distribution. Figure 1: Difference of self-supervised knowledge between SSKD and our method. (a) SSKD applies contrastive learning by forcing the image and its transformed version closed against other negative images in the feature embedding space. It defines the contrastive relationship as knowledge. (b) Our method unifies the original task and self-supervised auxiliary task into a joint task and defines the self-supervised augmented distribution as knowledge. tween student and teacher networks, which makes intuitive sense to force the student to mimic how a superior teacher generates the final predictions. However, such a highly abstract dark knowledge ignores much comprehensive information encoded in hidden layers. Later works naturally proposed to transfer feature maps [Romero et al., 2015] or their refined information [Zagoruyko and Komodakis, 2017; Heo et al., 2019; Ahn et al., 2019] between intermediate layers of teacher and student. A reasonable interpretation of the success of feature-based distillation lies in that hierarchical feature maps throughout the CNN represent the intermediate learning process with an inductive bias of the final solution. Beyond knowledge alignment limited in the individual sample, more recent works [Peng et al., 2019; Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Dataset Baseline +DA (Rotation) +SAL (Rotation) CIFAR-100 78.01 77.75( 0.26) 79.76( +1.75) Tiny Image Net 63.69 62.66( 1.03) 65.81( +2.12) Table 1: Top-1 accuracy on Res Net-18 using rotation as a data augmentation (DA) and a self-supervised augmented label (SAL). Tian et al., 2020] leverage cross-sample correlations or dependencies in high-layer feature embedding space. Inspired by the recent success of self-supervised visual representational learning [Chen et al., 2020], SSKD [Xu et al., 2020] introduces an auxiliary self-supervised task to extract richer knowledge. As shown in Fig. 1a, SSKD proposes transferring cross-sample self-supervised contrastive relationships, making it achieve superior performance in the field of KD. However, forcing the network to learn invariant feature representations among transformed images using a self-supervised pretext task with random rotations from 0 , 90 , 180 , 270 utilized in SSKD may destroy the original visual semantics (e.g. 6 v.s. 9). It would increase the difficulty of representation learning for semantic recognition tasks. As validated in Table 1, applying random rotation as an additional data augmentation degrades the classification performance, especially on more challenging Tiny Image Net. To effectively learn knowledge from self-supervised representation learning without interfering with the original fullysupervised classification task, we use a unified task by combining the label spaces of the original task and self-supervised task into a joint label space, as shown in Fig. 1b. This task is partly inspired by the previous seminal self-supervised representation learning [Gidaris et al., 2018; Lee et al., 2020]. We further introduce these prior works to explore more powerful knowledge for distillation. To verify the effectiveness of the self-supervised augmented label, we also conduct initial exploratory experiments on standard image classification in Table 1. We find that the performance can be significantly improved by SAL, which can be attributed to learned better feature representations from an extra well-combined selfsupervised task. The good performance further motivates us to define the self-supervised augmented distribution as a promising knowledge for KD. Another valuable problem lies in how to transfer the probabilistic knowledge between teacher and student effectively. Vanilla KD aligns probability distributions only in the final layer but ignores comprehensive knowledge. Feature-based distillation methods provide one-to-one matching between the same convolutional stages of teacher and student. However, matched feature maps may have different semantic abstractions and result in a negative supervisory effect [Passalis et al., 2020]. Compared with feature information, the probability distribution is indeed a more robust knowledge for KD, especially when existing a large architecture gap between teacher and student [Tian et al., 2020]. However, it is difficult to explicitly derive comprehensive probability distributions from hidden layers over the original architecture. Therefore a natural idea is to append several auxiliary classifiers to the network at various hidden layers to generate multi-level probability distributions from hierar- chical feature maps. It allows us to perform comprehensive one-to-one matching in hidden layers in terms of probabilistic knowledge. Moreover, it is also noteworthy that the gap of abstraction level of any matched distributions would be easily reduced due to the delicately-designed auxiliary classifiers. We guide all auxiliary classifiers attached to the original network to learn informative self-supervised augmented distributions. Furthermore, we perform Hierarchical Selfsupervised Augmented Knowledge Distillation (HSAKD) between teacher and student towards all auxiliary classifiers in a one-to-one manner. By taking full advantage of richer selfsupervised augmented knowledge, the student can be guided to learn better feature representations. Note that all auxiliary classifiers are only used to assist knowledge transfer and dropped during the inference period. The overall contributions are summarized as follows: We introduce a self-supervised augmented distribution that encapsulates the unified knowledge of the original classification task and auxiliary self-supervised task as the richer dark knowledge for the field of KD. We propose a one-to-one probabilistic knowledge distillation framework by leveraging the architectural auxiliary classifiers, facilitating comprehensive knowledge transfer and alleviating the mismatch problem of abstraction levels when existing a large architecture gap. HSAKD significantly refreshes the results achieved by previous SOTA SSKD on standard image classification benchmarks. It can also learn well-general feature representations for downstream semantic recognition tasks. 2 Related Work Knowledge Distillation. The seminal KD [Hinton et al., 2015] popularizes the pattern of knowledge transfer with a soft probability distribution. Later methods further explore feature-based information encapsulated in hidden layers for KD, such as intermediate feature maps [Romero et al., 2015], attention maps [Zagoruyko and Komodakis, 2017], gram matrix [Yim et al., 2017] and activation boundaries [Heo et al., 2019]. More recent works explore cross-sample relationship using high-level feature embeddings with various definitions of edge weight [Park et al., 2019; Peng et al., 2019; Tung and Mori, 2019]. The latest SSKD [Xu et al., 2020] extracts structured knowledge from the self-supervised auxiliary task. Beyond knowledge exploration, [Ahn et al., 2019; Tian et al., 2020] maximize the mutual information between matched features. [Yang et al., 2021] further extend this idea into online KD in a contrastive-based manner. To bridge the gap between the teacher and student, [Passalis et al., 2020] introduce teacher assistant models for smoother KD. However, extra teacher models increase the complexity of the training pipeline. Therefore, we choose to append several welldesigned auxiliary classifiers to alleviate the knowledge gap and facilitate comprehensive knowledge transfer. Self-supervised Representational Learning (SRL). The seminal SRL popularizes the pattern that guides the network to learn which transformation is applied to a transformed image for learning feature representations. Typical transformations can be rotations [Gidaris et al., 2018], jig- Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) 𝑡𝑗𝒙, 𝑗= 1,2,3,4 Stage2 Stage3 Stage1 Stage2 Stage3 𝒑𝑇(𝑡𝑗𝒙;𝜏) Feature map Auxiliary classifier Pooling&FC layer Self-supervised augmented distribution Class probability distribution Mimicry direction Student backbone 𝑓𝑆( ) Teacher backbone 𝑓𝑇( ) Figure 2: Overview of our proposed HSAKD. Both teacher and student networks are equipped with several auxiliary classifiers after various convolutional stages to capture diverse self-supervised augmented knowledge from hierarchical feature maps. Mimicry loss is applied from self-supervised augmented distributions of the student {q S l (tj(x); τ)}L l=1 to corresponding that of the teacher {q T l (tj(x); τ)}L l=1 generated from same feature hierarchies in a one-to-one manner. Following the conventional KD, we also force the mimicry from the class probability distribution of student p S(tj(x); τ) to that of the teacher p T (tj(x); τ). During the inference period, we only retain the student backbone f S( ) and drop all auxiliary classifiers {c S( )}L l=1. Therefore there has no extra inference cost compared with the original student network. saw [Noroozi and Favaro, 2016] and colorization [Zhang et al., 2016]. More recently, [Misra and Maaten, 2020; Chen et al., 2020] learn invariant feature representations under self-supervised pretext tasks by maximizing the consistency of representations among various transformed versions of the same image. Both SSKD and our HSAKD are related to SRL. SSKD uses the latter SRL pattern to extract knowledge. In contrast, HSAKD combines the former classification-based pattern of SRL with the fully-supervised classification task to extract richer joint knowledge. 3.1 Self-supervised Augmented Distribution We present the difference between the original class probability distribution and self-supervised augmented distribution using a conventional classification network of CNN. A CNN can be decomposed into a feature extractor Φ( ; µ) and a linear classifier g( ; w), where µ and w are weight tensors. Given an input sample x X, X is the training set, z = Φ(x; µ) Rd is the extracted feature embedding vector, where d is the embedding size. We consider a conventional N-way object classification task with the label space N = {1, , N}. The linear classifier attached with softmax normalization maps the feature embedding z to a predictive class probability distribution p(x; τ) = σ(g(z; w)/τ) RN over the label space, where σ is the softmax function, weight matrix w RN d, τ is a temperature hyper-parameter to scale the smoothness of distribution. We introduce an additional self-supervised task to augment the conventional supervised object class space. Learning such a joint distribution can force the network to generate more informative and meaningful predictions benefiting from the original and auxiliary self-supervised task simultaneously. Assuming that we define M various image transformations {tj}M j=1 with the label space M = {1, , M}, where t1 denotes the identity transformation, i.e. t1(x) = x. To effectively learn composite knowledge, we combine the class space from the original supervised object recognition task and self-supervised task into a unified task. The label space of this task is K = N M, here is the Cartesian product. |K| = N M, where | | is the cardinality of the label collection, * denotes element-wise multiplication. Given a transformed sample x {tj(x)}M j=1 by applying one transformation on x, z = Φ( x; µ) Rd is the extracted feature embedding vector, q( x; τ) = σ(g( z; w)/τ) RN M is the predictive distribution over the joint label space K, where weight tensor w R(N M) d. We use p RN to denote the normal class probability distribution and q RN M to denote the self-supervised augmented distribution. 3.2 Auxiliary Architecture Design It has been widely known that feature maps with various resolutions encode various patterns of representational information. Higher-resolution feature maps often present more fine-grained object details, while lower-resolution ones often contain richer global semantic information. To take full advantage of hierarchical feature maps encapsulated in a single network, we append several intermediate auxiliary classifiers into hidden layers to learn and distill hierarchical selfsupervised augmented knowledge. For ease of notation, we denote a conventional classification network as f( ), which maps a input sample tj(x), j M to the vanilla class probability distribution p(tj(x); τ) = σ(f(tj(x))/τ) RN over the original class space N. Modern CNNs typically utilize stage-wise convolutional blocks to gradually extract coarser features as the depth of the network increases. For example, popular Res Net-50 for Image Net classification contains consecutive four stages, and extracted feature maps produced from various stages have different granularities and patterns. Assuming that a network contains L stages, we choose to append an auxiliary classifier after each stage, thus resulting in L classifiers {cl( )}L l=1, where cl( ) is the auxiliary classifier after l-th stage. cl( ) is composed of stage-wise convolutional blocks, a global average pooling layer and a fully connected layer. Denoting the Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) extracted feature map after l-th stage as Fl, we can obtain the self-supervised augmented distribution inferred by cl( ) as ql(tj(x); τ) = σ(cl(Fl))/τ) RN M over the joint class space K. The overall design of auxiliary classifiers over a 3-stage network is illustrated in Fig. 2 for example. The detailed design formulation of various auxiliary classifiers for a specific network can be found in our released codes. 3.3 Training the Teacher Network We denote the teacher backbone network as f T ( ) and L auxiliary classifiers as {c T l ( )}L l=1. We conduct an end-to-end training process for preparing the teacher network. On the one hand, we train the f T ( ) with normal data x by the conventional Cross-Entropy (CE) loss to fit the ground-truth label y N, and p T (x; τ) = σ(f T (x)/τ) RN is the predictive class probaility distribution. On the other hand, we aim to train L auxiliary classifiers {c T l ( )}L l=1 for learning hierarchical self-supervised augmented distributions. Given an input sample tj(x), we feed the feature maps {F T l,j}L l=1 generated from backbone f T ( ) to {c T l ( )}L l=1, respectively. The predictive self-supervised augmented distribution inferred by the lth classifier c T l is q T l (tj(x); τ) = σ(c T l (F T l,j))/τ) RN M. We train all auxiliary classifiers using CE loss with selfsupervised augmented labels across {tj(x)}M j=1 as Eq. (1). LT ce SAD = 1 l=1 Lce(q T l (tj(x); τ), kj) (1) Where τ = 1 and Lce denotes the Cross-Entropy loss. For a bit abuse of notation, we use kj to denote the self-supervised augmented label of tj(x) in joint class space K. The overall loss for training a teacher is shown in Eq. (2). LT = Ex X [Lce(p T (x; τ), y) + LT ce SAD] (2) Note that the two losses in Eq. 2 have different roles. The first loss aims to simply fit the normal data for learning general classification capability. The second loss aims to generate additional self-supervised augmented knowledge by the existing hierarchical features derived from the backbone network. This method facilitates richer knowledge distillation benefiting from the self-supervised task beyond the conventional fully-supervised task. 3.4 Training the Student Network We denote the student backbone network as f S( ) and L auxiliary classifiers as {c S l ( )}L l=1. We conduct an end-to-end training process under the supervision of the teacher network. The overall loss includes the task loss from ground-truth labels and the mimicry loss from the pre-trained teacher. Task Loss We force the f S( ) to fit the normal data x as the task loss: Ltask = Lce(p S(x; τ), y) (3) Where p S(x; τ) = σ(f S(x)/τ) RN is the predictive class probaility distribution. We also had tried to force L auxiliary classifiers {c S l ( )}L l=1 to learn the self-supervised augmented distributions from the joint hard label of the original and selfsupervised tasks by LS ce SAD as the additional loss: LS ce SAD = 1 l=1 Lce(q S l (tj(x); τ), kj) (4) Where q S l (tj(x); τ) = σ(c S l (F S l,j))/τ) RN M, and F S l,j is the extracted feature map from the l-th stage of f S( ) for the input tj(x). However, we empirically found that introducing loss (4) into the original task loss (3) damages the performance of student networks, as validated in Section 4.2. Mimicry Loss On the one hand, we consider transferring hierarchical selfsupervised augmented distributions generated from L auxiliary classifiers of the teacher network to corresponding L auxiliary classifiers of the student network, respectively. The transfer performs in a one-to-one manner by KL-divergence loss DKL. The loss is formulated as Eq. (5), where τ 2 is used to retain the gradient contributions unchanged [Hinton et al., 2015]. l=1 τ 2DKL(q T l (tj(x); τ) q S l (tj(x); τ)) (5) Benefiting from Eq. (5), one can expect that the student network gains comprehensive guidances by unified selfsupervised knowledge and the original class full-supervised knowledge. The informative knowledge is derived from multi-scale intermediate feature maps encapsulated in hidden layers of the high-capacity teacher network. On the other hand, we transfer the original class probability distributions generated from the final layer between teacher and student. Specifically, we transfer the knowledge derived from both the normal and transformed data {tj(x)}M j=1, where t1(x) = x. This loss is formulated as Eq. (6). j=1 τ 2DKL(p T (tj(x); τ) p S(tj(x); τ)) (6) We do not explicitly force the student backbone f S( ) to fit the transformed data in task loss for preserving the normal classification capability. But mimicking the side product of predictive class probability distributions inferred from these transformed data from the teacher network is also beneficial for the self-supervised representational learning of student network, as validated in Section 4.2. Overall loss. We summarize the task loss and mimicry loss as the overall loss LS for training the student network. LS = Ex X [Ltask + Lkl q + Lkl p] (7) Following the wide practice, we set the hyper-parameter τ = 1 in task loss and τ = 3 in mimicry loss. Besides, we do not introduce other hyper-parameters. 4 Experiments 4.1 Experimental Settings We conduct evaluations on standard CIFAR-100 and Image Net [Deng et al., 2009] benchmarks across the widely applied network families including Res Net [He et al., 2016], Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Teacher WRN-40-2 WRN-40-2 Res Net56 Res Net32 4 VGG13 Res Net50 WRN-40-2 Res Net32 4 Student WRN-16-2 WRN-40-1 Res Net20 Res Net8 4 Mobile Net V2 Mobile Net V2 Shuffle Net V1 Shuffle Net V2 Teacher 76.45 76.45 73.44 79.63 74.64 79.34 76.45 79.63 Teacher* 80.70 80.70 77.20 83.73 78.48 83.85 80.70 83.73 Student 73.57( 0.23) 71.95( 0.59) 69.62( 0.26) 72.95( 0.24) 73.51( 0.26) 73.51( 0.26) 71.74( 0.35) 72.96( 0.33) KD 75.23( 0.23) 73.90( 0.44) 70.91( 0.10) 73.54( 0.26) 75.21( 0.24) 75.80( 0.46) 75.83( 0.18) 75.43( 0.33) Fit Net 75.30( 0.42) 74.30( 0.42) 71.21( 0.16) 75.37( 0.12) 75.42( 0.34) 75.41( 0.07) 76.27( 0.18) 76.91( 0.06) AT 75.64( 0.31) 74.32( 0.23) 71.35( 0.09) 75.06( 0.19) 74.08( 0.21) 76.57( 0.20) 76.51( 0.44) 76.32( 0.12) AB 71.26( 1.32) 74.55( 0.46) 71.56( 0.19) 74.31( 0.09) 74.98( 0.44) 75.87( 0.39) 76.43( 0.09) 76.40( 0.29) VID 75.31( 0.22) 74.23( 0.28) 71.35( 0.09) 75.07( 0.35) 75.67( 0.13) 75.97( 0.08) 76.24( 0.44) 75.98( 0.41) RKD 75.33( 0.14) 73.90( 0.26) 71.67( 0.08) 74.17( 0.22) 75.54( 0.36) 76.20( 0.06) 75.74( 0.32) 75.42( 0.25) SP 74.35( 0.59) 72.91( 0.24) 71.45( 0.38) 75.44( 0.11) 75.68( 0.35) 76.35( 0.14) 76.40( 0.37) 76.43( 0.21) CC 75.30( 0.03) 74.46( 0.05) 71.44( 0.10) 74.40( 0.24) 75.66( 0.33) 76.05( 0.25) 75.63( 0.30) 75.74( 0.18) CRD 75.81( 0.33) 74.76( 0.25) 71.83( 0.42) 75.77( 0.24) 76.13( 0.16) 76.89( 0.27) 76.37( 0.23) 76.51( 0.09) SSKD 76.16( 0.17) 75.84( 0.04) 70.80( 0.02) 75.83( 0.29) 76.21( 0.16) 78.21( 0.16) 76.71( 0.31) 77.64( 0.24) Ours 77.20( 0.17) 77.00( 0.21) 72.58( 0.33) 77.26( 0.14) 77.45( 0.21) 78.79( 0.11) 78.51( 0.20) 79.93( 0.11) Ours* 78.67( 0.20) 78.12( 0.25) 73.73( 0.10) 77.69( 0.05) 79.27( 0.12) 79.43( 0.24) 80.11( 0.32) 80.86( 0.15) Table 2: Top-1 accuracy (%) comparison of SOTA distillation methods across various teacher-student pairs on CIFAR-100. All results are reproduced by ours using author-provided code. The numbers in Bold and underline denote the best and the second-best results, respectively. Teacher denotes that we first train the backbone f T ( ) and then train auxiliary classifiers {c T l ( )}L l=1 based on the frozen f T ( ). For a fair comparison, all compared methods and Ours are supervised by Teacher . Teacher* denotes that we train f T ( ) and {c T l ( )}L l=1 jointly, leading to a more powerful teacher network. Ours* denotes the results supervised by Teacher* for pursuing better performance. WRN [Zagoruyko S, 2016], VGG [Simonyan and Zisserman, 2015], Mobile Net [Sandler et al., 2018] and Shuffle Net [Zhang et al., 2018; Ma et al., 2018]. Some representative KD methods including KD [Hinton et al., 2015], Fit Net [Romero et al., 2015], AT [Zagoruyko and Komodakis, 2017], AB [Heo et al., 2019], VID [Ahn et al., 2019], RKD [Park et al., 2019], SP [Tung and Mori, 2019], CC [Peng et al., 2019], CRD [Tian et al., 2020] and SOTA SSKD [Xu et al., 2020] are compared. For a fair comparison, all comparative methods are combined with conventional KD by default, and we adopt rotations {0 , 90 , 180 , 270 } as the self-supervised auxiliary task as same as SSKD. We use the standard training settings following [Xu et al., 2020] and report the mean result with a standard deviation over 3 runs. The more detailed settings for reproducibility can be found in our released codes. 4.2 Ablation Study Effect of loss terms. As shown in Fig. 3 (left), applying hierarchical self-supervised augmented knowledge transfer through multiple auxiliary classifiers by loss Lkl q substantially boosts the accuracy upon the original task loss Ltask. We further compare Lkl p and Lkd upon Ltask + Lkl q to demonstrate the efficacy of transferring class probability distributions from additional transformed images. We find that Lkl p results in better accuracy gains than Lkd, which suggests that transferring probabilistic class knowledge from those transformed images is also beneficial to feature representation learning. Finally, beyond the mimicry loss, we also explore whether the self-supervised augmented task loss LS ce SAD can be integrated into the overall task loss to train student networks. After adding LS ce SAD upon the above losses, the performance of the student network is dropped slightly. We speculate that mimicking soft self-supervised augmented distribution from the teacher by Lkl q is good Figure 3: Ablation study of loss terms (left) and auxiliary classifiers (right) on the student networks WRN-16-2 and Shuffle Net V1 under the pre-trained teacher network WRN-40-2 on CIFAR-100. enough to learn rich self-supervised knowledge. Extra learning from hard one-hot distribution by LS ce SAD may interfere with the process of self-supervised knowledge transfer. Effect of auxiliary classifiers. We append several auxiliary classifiers to the network with various depths to learn and transfer diverse self-supervised augmented distributions extracted from hierarchical features. To examine this practice, we first individually evaluate each auxiliary classifier. As shown in Fig. 3 (right), we can observe that each auxiliary classifier is beneficial to performance improvements. Moreover, the auxiliary classifier attached in the deeper layer often achieves more accuracy gains than that in the shallower layer, which can be attributed to more informative semantic knowledge encoded in high-level features. Finally, using all auxiliary classifiers can maximize accuracy gains. 4.3 Comparison with State-Of-The-Arts Results on CIFAR-100 and Image Net. We compare our HSAKD with SOTA representative distillation methods Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Teacher Student Acc Teacher Teacher* Student KD AT CC SP RKD CRD SSKD Ours Ours* Res Net-34 Res Net-18 Top-1 73.31 75.48 69.75 70.66 70.70 69.96 70.62 71.34 71.38 71.62 72.16 72.39 Top-5 91.42 92.67 89.07 89.88 90.00 89.17 89.80 90.37 90.49 90.67 90.85 91.00 Table 3: Top-1 accuracy (%) comparison on Image Net. The compared results are from [Xu et al., 2020]. Transferred Dataset Baseline KD Fit Net AT AB VID RKD SP CC CRD SSKD Ours CIFAR-100 STL-10 67.76 67.90 69.41 67.37 67.82 69.29 69.74 68.96 69.13 70.09 71.03 74.66 CIFAR-100 Tiny Image Net 34.69 34.15 36.04 34.44 34.79 36.09 37.21 35.69 36.43 38.17 39.07 42.57 Table 4: Linear classification accuracy (%) of transfer learning on the student Mobile Net V2 pre-trained using the teacher VGG-13. Baseline KD CRD SSKD Ours 76.18 77.06 77.36 77.60 78.45 Table 5: Comparison of detection m AP (%) on Pascal VOC using Res Net-18 as the backbone pre-trained by various KD methods. across various teacher-student pairs with the same and different architectural styles on CIFAR-100 in Table 2 and on Image Net in Table 3. Interestingly, using LT ce SAD as an auxiliary loss can improve the teacher accuracy of the final classification. Compared to the original Teacher , Teacher* achieves an average gain of 4.09% across five teacher networks on CIFAR-100 and a top-1 gain of 2.17% for Res Net-34 on Image Net. For downstream student networks, Teacher* leads to an average improvement of 1.15% on CIFAR-100 and an improvement of 0.23% on Image Net than Teacher . These results indicate that our proposed loss LT can improve the performance of a given network and produce a more suitable teacher for KD to learn a better student. Moreover, our HSAKD significantly outperforms the best-competing method SSKD across all network pairs with an average accuracy gain of 2.56% on CIFAR-100 and a top-1 gain of 0.77% on Image Net. Compared with other SOTA methods, the superiority of HSAKD can be attributed to hierarchical self-supervised augmented knowledge distillation by the assistance with well-designed auxiliary classifiers. Transferability of Learned Representations. Beyond the accuracy on the object dataset, we also expect the student network can produce the generalized feature representations that transfer well to other unseen semantic recognition datasets. To this end, we freeze the feature extractor pre-trained on the upstream CIFAR-100, and then train two linear classifiers based on frozen pooled features for downstream STL-10 and Tiny Image Net respectively, following the common linear classification protocal [Tian et al., 2020]. As shown in Table 4, we can observe that both SSKD and HSAKD achieve better accuracy than other comparative methods, demonstrating that using self-supervised auxiliary tasks for distillation is conducive to generating better feature representations. Moreover, HSAKD can significantly outperform the best-competing SSKD by 3.63% on STL-10 and 3.50% on Tiny Image Net. The results verify that encoding the selfsupervised auxiliary task as an augmented distribution in our HASKD has better supervision quality than the contrastive relationship in SSKD for learning good features. Percentage KD CRD SSKD Ours 25% 65.15( 0.23) 65.80( 0.61) 67.82( 0.30) 68.50( 0.24) 50% 68.61( 0.22) 69.91( 0.20) 70.08( 0.13) 72.18( 0.41) 75% 70.34( 0.09) 70.98( 0.43) 70.47( 0.14) 73.26( 0.11) Table 6: Top-1 accuracy (%) comparison on CIFAR-100 under fewshot scenario with various percentages of training samples. We use the Res Net56-Res Net20 as the teacher-student pair for evaluation. Transferability for Object Detection. We further evaluate the student network Res Net-18 pre-trained with the teacher Res Net-34 on Image Net as a backbone to carry out downstream object detection on Pascal VOC. We use Faster RCNN [Ren et al., 2016] framework and follow the standard data preprocessing and finetuning strategy. The comparison on detection performance is shown in Table 5. Our method outperforms the original baseline by 2.27% m AP and the best-competing SSKD by 0.85% m AP. These results verify that our method can guide a network to learn better feature representations for semantic recognition tasks. Efficacy under Few-shot Scenario. We compare our method with conventional KD and SOTA CRD and SSKD under few-shot scenarios by retaining 25%, 50% and 75% training samples. For a fair comparison, we use the same data split strategy for each few-shot setting, while maintaining the original test set. As shown in Table 6, our method can consistently surpass others by large margins under various few-shot settings. Moreover, it is noteworthy that by using only 25% training samples, our method can achieve comparable accuracy with the baseline trained on the complete set. This is because our method can effectively learn general feature representations from limited data. In contrast, previous methods often focus on mimicking the inductive bias from intermediate feature maps or cross-sample relationships, which may overfit the limited set and generalize worse to the test set. 5 Conclusion We propose a self-supervised augmented task for KD and further transfer such rich knowledge derived from hierarchical feature maps leveraging well-designed auxiliary classifiers. Our method achieves SOTA performance on the standard image classification benchmarks in the field of KD. It can guide the network to learn well-general feature representations for semantic recognition tasks. Moreover, it has no hyper-parameters to be tuned and is easy to implement. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) References [Ahn et al., 2019] Sungsoo Ahn, Shell Xu Hu, Andreas C. Damianou, Neil D. Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In CVPR, pages 9163 9171, 2019. [Chen et al., 2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. ICML, pages 1597 1607, 2020. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248 255, 2009. [Gidaris et al., 2018] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016. [Heo et al., 2019] Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In AAAI, volume 33, pages 3779 3787, 2019. [Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. [Lee et al., 2020] Hankook Lee, Sung Ju Hwang, and Jinwoo Shin. Self-supervised label augmentation via input transformations. In ICML, pages 5714 5724. PMLR, 2020. [Ma et al., 2018] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, pages 116 131, 2018. [Misra and Maaten, 2020] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, pages 6707 6717, 2020. [Noroozi and Favaro, 2016] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, pages 69 84, 2016. [Park et al., 2019] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, pages 3967 3976, 2019. [Passalis et al., 2020] Nikolaos Passalis, Maria Tzelepi, and Anastasios Tefas. Heterogeneous knowledge distillation using information flow modeling. In CVPR, pages 2339 2348, 2020. [Peng et al., 2019] Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correlation congruence for knowledge distillation. In ICCV, pages 5007 5016, 2019. [Ren et al., 2016] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. TPAMI, 39(6):1137 1149, 2016. [Romero et al., 2015] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ICLR, 2015. [Sandler et al., 2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pages 4510 4520, 2018. [Simonyan and Zisserman, 2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015. [Tian et al., 2020] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. ICLR, 2020. [Tung and Mori, 2019] Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In ICCV, pages 1365 1374, 2019. [Xu et al., 2020] Guodong Xu, Ziwei Liu, Xiaoxiao Li, and Chen Change Loy. Knowledge distillation meets selfsupervision. In ECCV, pages 588 604, 2020. [Yang et al., 2019] Chuanguang Yang, Zhulin An, Chao Li, Boyu Diao, and Yongjun Xu. Multi-objective pruning for cnns using genetic algorithm. In ICANN, pages 299 305, 2019. [Yang et al., 2020] Chuanguang Yang, Zhulin An, Hui Zhu, Xiaolong Hu, Kun Zhang, Kaiqiang Xu, Chao Li, and Yongjun Xu. Gated convolutional networks with hybrid connectivity for image classification. In AAAI, pages 12581 12588, 2020. [Yang et al., 2021] Chuanguang Yang, Zhulin An, and Yongjun Xu. Multi-view contrastive learning for online knowledge distillation. In ICASSP, pages 3750 3754, 2021. [Yim et al., 2017] Junho Yim, Donggyu Joo, Ji-Hoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR, pages 7130 7138, 2017. [Zagoruyko and Komodakis, 2017] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ICLR, 2017. [Zagoruyko S, 2016] Komodakis N Zagoruyko S. Wide residual networks. In BMVC, 2016. [Zhang et al., 2016] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, pages 649 666, 2016. [Zhang et al., 2018] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, pages 6848 6856, 2018. [Zhu et al., 2019] Hui Zhu, Zhulin An, Chuanguang Yang, Kaiqiang Xu, Erhu Zhao, and Yongjun Xu. Eena: efficient evolution of neural architecture. In ICCV Workshops, 2019. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)