# crosslayer_distillation_with_semantic_calibration__8f27e635.pdf

Cross-Layer Distillation with Semantic Calibration

Defang Chen,1,2,3 Jian-Ping Mei,4 Yuan Zhang,1,3

Can Wang,1,2,3* Zhe Wang,1,3 Yan Feng,1,3 Chun Chen,1,3

1College of Computer Science, Zhejiang University, China. 2Zhejiang Provincial Key Laboratory of Service Robot. 3Zhejiang University-Lianlian Pay Joint Research Center. 4College of Computer Science, Zhejiang University of Technology, China. defchern@zju.edu.cn, jpmei@zjut.edu.cn, {yuan zhang, wcan, fengyan, chenc}@zju.edu.cn

Recently proposed knowledge distillation approaches based on feature-map transfer validate that intermediate layers of a teacher model can serve as effective targets for training a student model to obtain better generalization ability. Existing studies mainly focus on particular representation forms for knowledge transfer between manually speciﬁed pairs of teacher-student intermediate layers. However, semantics of intermediate layers may vary in different networks and manual association of layers might lead to negative regularization caused by semantic mismatch between certain teacherstudent layer pairs. To address this problem, we propose Semantic Calibration for Cross-layer Knowledge Distillation (Sem CKD), which automatically assigns proper target layers of the teacher model for each student layer with an attention mechanism. With a learned attention distribution, each student layer distills knowledge contained in multiple layers rather than a single ﬁxed intermediate layer from the teacher model for appropriate cross-layer supervision in training. Consistent improvements over state-of-the-art approaches are observed in extensive experiments with various network architectures for teacher and student models, demonstrating the effectiveness and ﬂexibility of the proposed attention based soft layer association mechanism for cross-layer distillation.

Introduction

The generalization ability of a lightweight model can be improved by training to match the prediction of a powerful model (Bucilua, Caruana, and Niculescu-Mizil 2006; Ba and Caruana 2014). This idea is popularized by knowledge distillation (KD) in which temperature scaling outputs from the teacher model are exploited to improve the performance of the student model (Hinton, Vinyals, and Dean 2015). Compared to discrete labels, soft targets predicted by the teacher model serve as an effective regularization to prevent the student model from being trapped in over-conﬁdent solutions during optimization (Pereyra et al. 2017; M uller, Kornblith, and Hinton 2019; Yuan et al. 2020). In the vanilla KD framework, the knowledge learned by a classiﬁcation model is represented only by the prediction of its ﬁnal layer (Hinton, Vinyals, and Dean 2015). Although

*Corresponding author Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

the relative probabilities assigned to different classes provide an intuitive understanding about how a model generalize, knowledge transfer in such a highly abstract form ignores a wealth of information contained in intermediate layers. Intending to further boost effectiveness of distillation, recent works proposed to align feature maps or their transformations of manually selected teacher-student layer pairs (Romero et al. 2015; Zagoruyko and Komodakis 2017; Ahn et al. 2019; Tung and Mori 2019; Passalis, Tzelepi, and Tefas 2020). An interpretation for the success of feature-map based distillation is that the multi-layer feature representations respect hierarchical concept learning process which may entail reasonable inductive bias (Bengio, Courville, and Vincent 2013). Intermediate layers of teacher and student models with different capacity tend to have different levels of abstraction (Passalis, Tzelepi, and Tefas 2020). A peculiar challenge is thus to ensure appropriate layer associations in featuremap based distillation to achieve maximum performance improvement. However, existing efforts mainly focus on particular representations of feature maps to capture the enriched knowledge and enable knowledge transfer based on handcrafted layer assignments, such as random selection or oneto-one association (Romero et al. 2015; Zagoruyko and Komodakis 2017; Ahn et al. 2019; Tung and Mori 2019; Passalis, Tzelepi, and Tefas 2020). A naive allocation strategy may cause semantic mismatch between feature maps of candidate teacher-student layer pairs, leading to negative regularization effect in training of the student model. Since we have no access to prior knowledge of the semantic level of each intermediate layer, layer association becomes a nontrivial problem. Therefore, systematic approaches need to be developed for more effective and ﬂexible knowledge transfer with feature maps. In this paper, we propose Semantic Calibration for Crosslayer Knowledge Distillation (Sem CKD) to exploit intermediate knowledge by keeping the transfer in a matched semantic level. An attention mechanism is applied in our approach for automatic soft layer association, which effectively binds a student layer with those semantically similar target layers in the teacher model. Learning from multiple target layers with an attention allocation rather than turning to a ﬁxed assignment can suppress over-regularization effect in training. To align the spatial dimension of each layer pair

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

for calculating the total loss, feature maps of each student layer are projected to the same dimension as those in the target layers. By taking advantage of semantic calibration and feature-map transfer across multiple layers, the student model can be effectively optimized with more appropriate guidance. The overall contributions of this paper are summarized as follows:

We propose a novel technique to signiﬁcantly improve effectiveness of feature-map transfer by semantic calibration via soft layer association. Our approach is readily applicable to heterogeneous settings where different architectures are used for the teacher and student models.

Attention mechanism is used to achieve soft layer association for cross-layer distillation. Its capability to alleviate the semantic mismatch problem is supported by carefully designed experiments.

Extensive experiments on CIFAR-100 and Image Net datasets with a large variety of settings based on popular network architectures demonstrate that Sem CKD consistently generalizes better than state-of-the-art approaches.

Related Work Knowledge Distillation. KD serves as an effective recipe to improve the performance of a given student model by exploiting soft targets from a pre-trained teacher model (Hinton, Vinyals, and Dean 2015). Compared to discrete labels, ﬁne-grained information among different categories provides extra supervision to optimize the student model better (Pereyra et al. 2017; M uller, Kornblith, and Hinton 2019). A new interpretation for the improvement is that soft targets act as a learned label smoothing regularization to keep the student model from producing over-conﬁdent predictions (Yuan et al. 2020). To save the expense of pre-training, some cost-effective online variants have been explored later (Anil et al. 2018; Chen et al. 2020). Feature-Map Distillation. Rather than only formalizing knowledge in a highly abstract form like predictions, recent methods attempted to leverage information contained in intermediate layers by designing elaborate knowledge representations. A bunch of techniques have been developed for this purpose, such as aligning hidden layer responses called hints (Romero et al. 2015), mimicking spatial attention maps (Zagoruyko and Komodakis 2017), or maximizing the mutual information through variational principle (Ahn et al. 2019). The transferred knowledge can also be captured by crude pairwise activation similarities (Tung and Mori 2019) or hybrid kernel formulations built on them (Passalis, Tzelepi, and Tefas 2020). With the pre-deﬁned representations, all of the above methods perform knowledge transfer with certain hand-crafted layer associations, such as random selection or one-to-one match. Unfortunately, as pointed in (Passalis, Tzelepi, and Tefas 2020), these hard associations would make the student model suffer from negative regularization, which limits the effectiveness of feature-map distillation. Based on the transfer learning framework, a newly solution is to learn association weights by a meta-network given only feature maps of the source network (Jang et al.

2019), while our proposed approach incorporates more information from teacher-student layer pairs. Feature-Embedding Distillation. Feature embedding is a good substitute for feature maps since low dimensional vectors are more tractable than high dimensional tensors. Meanwhile, feature embedding also preserves more structural information compared to the ﬁnal predictions. Therefore, a variety of knowledge distillation approaches have been proposed based on feature embedding, especially the generated relational graphs where each node represents one instance (Passalis and Tefas 2018; Peng et al. 2019; Park et al. 2019; Liu et al. 2019). The main difference among these methods lies on how edge weights are constructed. Typical choices include cosine kernel (Passalis and Tefas 2018), truncated Gaussian RBF kernel (Peng et al. 2019), or combination of distance-wise as well as angle-wise potential functions (Park et al. 2019). In contrast to pairwise transfer, CRD formulates distillation as contrastive learning to capture higher-order dependencies in the representation space (Tian, Krishnan, and Isola 2020). Although our method mainly focuses on feature-map distillation, it is also compatible with the state-of-the-art feature-embedding distillation approach to further improve the performance.

Semantic Calibration for Distillation

Background and Notations

In this section, we brieﬂy recap the basic concepts of classic knowledge distillation as well as provide necessary notations for the following illustration. Given a training dataset D = {(xi, yi)}N i=1 consisting of N instances from K categories, and a powerful teacher model pre-trained on the dataset D, our goal is reusing the same dataset to train another simple student model with cheaper computational and storage demand. For a mini-batch with size b, we denote the output of each target layer tl and student layer sl as F t tl Rb ctl htl wtl and F s sl Rb csl hsl wsl , respectively, where c is the number of output channels, h and w are spatial dimensions, superscript t and s reﬂect the corresponding models. The value of candidate layers tl and sl range from 1 to t L and s L, respectively. Note that t L and s L may be different especially when the teacher and student architectures are different. The representations at the penultimate layer from the teacher and student models are denoted as F t t L and F s s L, which are mainly used in featureembedding distillation. Take the student model as an example, outputs of the last fully connected layer g( ) are known as logits gs i = g(F s s L[i]) RK and the predicted probabilities are calculated with a softmax layer built on logits, i.e., ps i = σ(gs i /T) with T usually equals to 1. The notation F s sl[i] denotes the output of student layer sl for the i-th instance and is a shorthand for F s sl[i, :, :, :]. For classiﬁcation tasks, in additional to regular cross entropy loss (CE) between the predicted probabilities ps i and the one-hot label yi of each training sample, classic knowledge distillation (Hinton, Vinyals, and Dean 2015) incorporates another alignment loss to encourage the minimization of Kullback-Leibler (KL) divergence between ps i and soft

Figure 1: An overview of the proposed Semantic Calibration for Knowledge Distillation (Sem CKD). (a) Feature maps for certain instance from the student layer-1 are projected into three individual forms to align with the spatial dimension of those from each target layer. The learned attention allocation adaptively helps the student model focus on the most semantic-related information for effective distillation. (b) Pairwise similarities are ﬁrst calculated between every stacked feature maps and then the attention weights are obtained by the proximities among generated query and key factors.

targets pt i of the teacher model

LKDi = LCE(yi, σ(gs i )) + T 2LKL(σ(gt i/T), σ(gs i /T)), (1) where T is a hyper-parameter and a higher T leads to more considerable softening effect. We set T to 4 throughout this paper for fair comparison.

Feature-Map Distillation As mentioned earlier, feature maps of a teacher model are valuable for helping a student model achieve better performance. Recently proposed feature-map distillation approaches can be summarized as adding the following loss term to Equation (1) for each mini-batch with size b

(sl,tl) C Dist Transt F t tl , Transs F s sl ,

(2) leading to the overall loss as

i=1 LKDi + βLF MD, (3)

where functions Transt( ) and Transs( ) in each method transform feature maps of candidate teacher-student layer pairs into a particular hand-designed representation, such as attention maps (Zagoruyko and Komodakis 2017) or pairwise similarity matrices (Tung and Mori 2019). The layer association sets C of existing methods are generated by random selection or one-to-one match. However, these simple association strategies may cause the loss of useful information. Take one-to-one match as an example, extra layers have

to be discarded when the number of layers s L and t L are different, i.e., C = {(1, 1), ..., (min(s L, t L), min(s L, t L))}. With these associated layer pairs, the feature-map distillation loss is calculated by distance function Dist( , ). The hyper-parameter β in Equation (3) is used to balance two individual loss terms. Rather than performing knowledge transfer based on ﬁxed associations between candidate teacher-student layer pairs, our approach aims to learn associations for semantic calibrated cross-layer distillation.

Semantic Calibration Formulation In our approach Sem CKD, each student layer is automatically associated with those semantic-related target layers by attention allocation, as illustrated in Figure 1. Training with soft associations encourages the student model to collect and integrate multi-layer information to obtain a more suitable regularization. Moreover, Sem CKD is readily applicable to the situation where the number of candidate layers from the teacher and student models differ. The learned association set C in Sem CKD is denoted as

C = {(sl, tl) | sl [1, ..., s L], tl [1, ..., t L]}, (4)

with the corresponding weight satisﬁes Pt L tl=1 α(sl,tl) = 1, sl [1, ..., s L]. The weight α(sl,tl) Rb 1 represents the extent to which the target layer tl is attended in deriving the semantic-aware guidance for the student layer sl. We will elaborate on these attention-based weights later. All the feature maps of each student layer are projected into t L individual forms to align with the spatial dimension of each of

target layers for the following distance calculation

F s tl = Proj F s sl Rb csl hsl wsl , tl , tl [1, ..., t L], (5) with F s tl Rb ctl htl wtl . Each function Proj( , ) includes a stack of three layers with 1 1, 3 3 and 1 1 convolutions to meet the demand of capability for effective transformation1. Loss function. For a mini-batch with size b, the student model produces several feature maps across multiple layers, i.e., F s s1, ..., F s s L. After semantic layer associations and dimensional projections, the LF MD loss of Sem CKD is obtained by simply using Mean-Square-Error (MSE)

LSem CKD = X

(sl,tl) C α(sl,tl)Dist F t tl, Proj F s sl, tl

i=1 αi (sl,tl)MSE F t tl[i], F s tl [i] ,

(6) where feature maps from each student layer is transformed by a projection function Transs( ) = Proj( , ) while those from the target layers remain unchanged by identity transformation Transt( ) = I( ). The i-th element of vector α(sl,tl) is denoted as αi (sl,tl) for the corresponding instance. Equipped with the learned attention distributions, the total loss is aggregated by a weighted summation of each individual distance among the feature maps from candidate teacherstudent layer pairs. Note that Fit Net (Romero et al. 2015) is a special case of Sem CKD by ﬁxing αi (sl,tl) to 1 for certain (sl, tl) layer pair and 0 for the rest ones. Attention Allocation. Feature representations contained in a trained neural network are progressively more abstract as the layer depth increases. Semantic level of those intermediates can vary among teacher and student architectures with different capacity. To further improve the performance of feature-map distillation, each student layer had better associate with the most semantic-related target layers to derive its own regularization. Random selection or forcing feature maps from the same layer depths to be aligned may not sufﬁce due to negative effects from those mismatched layers. Layer associations based on attention mechanism provides a potentially feasible solution to this problem. Since feature maps produced by similar instances probably become clustered at separate granularity in different layers, the proximity of pairwise similarity matrices can be regarded as a good measurement of the inherent semantic similarity (Tung and Mori 2019). These similarity matrices are calculated as

As sl = R(F s sl) R(F s sl)T At tl = R(F t tl) R(F t tl)T , (7)

where R( ) : Rb c h w 7 Rb chw is a reshaping operation, and therefore As sl and At tl are b b matrices. Based on the self-attention framework (Vaswani et al. 2017), we separately project the pairwise similarity matrices

1In practice, we ﬁrst use a pooling operation to align the height and weight dimensions of the F t sl and F s sl before projections to reduce computational consumption.

Algorithm 1 Semantic Calibration for Distillation.

Input: Training dataset D = {(xi, yi)}N i=1; A pre-trained teacher model with parameter θt; A student model with randomly initialized parameters θs; Output: A well-trained student model; 1: while θs is not converged do 2: Sample a mini-batch B with size b from D. 3: Forward propagation B into θt and θs to obtain intermediate presentations F t tl and F s sl across layers. 4: Construct pairwise similarity matrices At tl and As sl as Equation (7). 5: Perform attention allocation as Equation (8-9). 6: Align feature maps by projections as Equation (5). 7: Update parameters θs by backward propagation the gradients of the loss in Equation (3) and Equation (6). 8: end while

of each student layer and target layers into two subspaces by a Multi-Layer Perceptron (MLP) to alleviate the effect of noise and sparseness. For i-th instance

Qsl[i] = MLPQ(As sl[i]) Ktl[i] = MLPK(At tl[i]). (8)

The parameters of MLPQ( ) and MLPK( ) are learned during training to generate query and key vectors and shared by all instances. Then, αi (sl,tl) is calculated as follows

αi (sl,tl) = e Qsl[i]T Ktl[i] P j e Qsl[i]T Ktj [i] . (9)

Attention-based allocation provides a possible way to suppress negative effects caused by layer mismatch and integrate positive guidance from multiple target layers, which is validated by Figure 2 and Table 3. Although the proposed approach distills only the knowledge contained in intermediate layers, its performance can be further boosted by incorporating additional orthogonal techniques, e.g., feature-embedding transfer as shown in Table 5. The full training procedure with the proposed semantic calibration formulation is summarized in Algorithm 1.

Experiments To demonstrate the effectiveness of the proposed semantic calibration strategy for cross-layer knowledge distillation, we conduct a series of classiﬁcation tasks on the CIFAR100 (Krizhevsky and Hinton 2009) and Image Net datasets (Russakovsky et al. 2015). A large variety of teacher-student combinations based on popular network architectures are evaluated, including VGG (Simonyan and Zisserman 2015), Res Net (He et al. 2016), WRN (Zagoruyko and Komodakis 2016), Mobile Net (Sandler et al. 2018) and Shufﬂe Net (Ma et al. 2018). In addition to comparing Sem CKD with representative feature-map distillation approaches, we also provide results to support and explain the success of our semantic calibration strategy in helping student models obtain a proper regularization through three carefully designed experiments. Ablation studies on the attention mechanism

Student VGG-8 VGG-13 Shufﬂe Net V2 Shufﬂe Net V2 Mobile Net V2 VGG-8 Res Net-8x4 ARI (%) 70.46 0.29 74.82 0.22 72.60 0.12 72.60 0.12 65.43 0.29 70.46 0.29 73.09 0.30

KD 72.73 0.15 77.17 0.11 75.60 0.21 75.49 0.24 68.70 0.22 73.38 0.05 74.42 0.05 72.65 % Fit Net 72.91 0.18 77.06 0.14 75.44 0.11 75.82 0.22 68.64 0.12 73.63 0.11 74.32 0.08 71.92 % AT 71.90 0.13 77.23 0.19 75.41 0.10 75.91 0.14 68.79 0.13 73.51 0.08 75.07 0.03 75.21 % SP 73.12 0.10 77.72 0.33 75.54 0.18 75.77 0.08 68.48 0.36 73.53 0.23 74.29 0.07 64.95 % VID 73.19 0.23 77.45 0.13 75.22 0.07 75.55 0.18 68.37 0.24 73.63 0.07 74.55 0.10 64.11 % HKD 72.63 0.12 76.76 0.13 76.24 0.09 76.64 0.05 69.23 0.16 73.06 0.24 74.86 0.21 61.23 %

Sem CKD 75.27 0.13 79.43 0.02 76.39 0.12 77.62 0.32 69.61 0.05 74.43 0.25 76.23 0.04

Teacher Res Net-32x4 Res Net-32x4 VGG-13 Res Net-32x4 WRN-40-2 VGG-13 Res Net-32x4 Average 79.42 79.42 74.64 79.42 75.61 74.64 79.42 68.34 %

Table 1: Top-1 test accuracy of feature-map distillation approaches on CIFAR-100.

as well as dimensional projection are also conducted. Finally, we show that Sem CKD is compatible with the featureembedding distillation technique to achieve better results and analyze its sensitivity to the hyper-parameter β. All evaluations are made in comparison to state-of-theart approaches based on standard experimental settings and reported in means and standard deviations We regard the building blocks of teacher and student networks as target layer and student layer in practice for convenience. The detailed descriptions of computing infrastructure, network architectures, data processing, hyper-parameters in model optimization for reproducibility as well as more results are included in the technical appendix. The code is available at https://github.com/Defang Chen/Sem CKD.

Comparison of Feature-Map Distillation Approaches

Table 1 gives the Top-1 test accuracy (%) on CIFAR-100 based on seven different network combinations, which consist of two homogeneous settings, i.e. the teacher and student models share similar architectures (VGG-8/13, Res Net8x4/32x4), and ﬁve heterogeneous settings. Each column apart from the ﬁrst row includes the results of corresponding student models which are generated under the supervision of the same teacher model. The results of the vanilla KD are also included for comparison. According to Table 1, it is shown that Sem CKD consistently achieves higher accuracy than state-of-the-art feature-map distillation approaches. In order to obtain an intuitive sense about quantitative improvement, we adopt Average Relative Improvement (ARI) as the previous work (Tian, Krishnan, and Isola 2020)

Acci Sem CKD Acci F MD Acci F MD Acci ST U 100%, (10)

where M is the number of different architecture combinations and Acci Sem CKD, Acci F MD, Acci ST U refer to the accuracies of Sem CKD, a certain feature-map distillation approach and a regularly trained student model in the i-th setting, respectively. This evaluation metric reﬂects the extent to which Sem CKD further improves on the basis of existing approaches compared to improvements made by these approaches upon the baseline student models.

Student Res Net-18 Shufﬂe V2x0.5 Res Net-18 69.67 53.78 69.67

KD 70.62 53.73 70.54 Fit Net 70.31 51.46 70.42 AT 70.30 52.83 70.30 SP 69.99 51.73 70.12 VID 70.30 53.97 70.26 HKD 68.86 51.60 68.44

Sem CKD 70.87 53.99 70.66

Teacher Res Net-34 Res Net-34x4 Res Net-34x4 73.26 73.54 73.54

Table 2: Top-1 test accuracy of feature-map distillation approaches on Image Net.

On average, Sem CKD shows signiﬁcantly relative improvement (68.34%) over all of the compared methods. Speciﬁcally, comparing with VID, which is the newest feature-map distillation approach under a single teacherstudent training process, the relative improvement of Sem CKD for each of the cases are 80.83%, 58.83%, 28.80%, 58.19%, 37.25%, 29.21%, respectively, leading to 64.11% in ARI. As for HKD, which relies on a costly teacherauxiliary-student paradigm, the ARI becomes rather small on two settings (3.93% for Shufﬂe Net V2 & Res Net-32x4 , 9.98% for Mobile Net V2 & WRN-40-2 ). But in general, Sem CKD still relatively outperforms HKD for about 61.23%, showing that our approach can indeed make better use of intermediate information for effective distillation. We also ﬁnd that none of the compared methods can consistently beats the vanilla KD on CIFAR-100, which probably due to semantic mismatch among associated layer pairs. This problem becomes especially severe for random selection (Fit Net method fails in 4/7 cases) and the situation where the number of candidate layers s L is larger than t L (4/5 of methods fail in the Shufﬂe Net V2 & VGG-13 setting). Nevertheless, the semantic calibration formulation helps alleviate semantic mismatch to a great extent, leading to satisﬁed performance of Sem CKD. Table 2 shows the results on a large-scale image classiﬁcation dataset and similar observations are obtained as above.

Figure 2: Negative regularization effect on CIFAR-100.

Fit Net AT SP VID HKD Sem CKD

12.05 15.52 16.30 15.82 19.86 11.27

Table 3: Semantic Mismatch Score (log-scale) for VGG-8 & Res Net-32x4 on CIFAR-100.

Semantic Calibration Analysis In this section, we experimentally study the negative regularization effect caused by manually speciﬁed layer associations and provide some explanations for the success of Sem CKD by the proposed criterion and visual evidence. Negative regularization effect occurs when feature-map distillation with certain layer association performs poorer than the vanilla KD. To reveal its existence, we conduct experiments by training the student model with only one speciﬁed teacher-student layer pair in VGG-8 & Res Net-32x4 and Mobile Net V2 & WRN-40-2 settings. In both cases, the number of candidate target layers and student layers are 3 and 4, respectively. Figure 2 shows the results of student models with these 12 teacher-student layer combinations under the two settings on CIFAR-100. For better comparison, the results of the vanilla KD and Sem CKD are plotted as dash horizontal lines with different colors. As shown in Figure 2, the performance of a student model becomes extremely poor for some layer associations, which is probably caused by large semantic gaps. Typical results that suffer from negative regularization are Student Layer4 & Target Layer-3 in Figure 2a and Student Layer-1, 2 & Target Layer-3 in Figure 2b. Another ﬁnding is that one-toone layer matching is suboptimal since better results can be achieved by exploiting the information in a target layer with different depth, such as Student Layer-1 & Target Layer-2 in Figure 2b. Although training with certain hand-craft layer association could outperform Sem CKD in a few cases, such as Student Layer-3 & Target Layer-3 in Figure 2b, Sem CKD still performs reasonably well against a large selection of associations, especially the knowledge of the best layer association for each network combination is not available in advance. Nevertheless, those cases in which training with Sem CKD are inferior to the best layer association indicates that there is room for reﬁnement of our association strategy. We then evaluate whether Sem CKD actually leads to less semantic mismatch solutions compared with other approaches. A criterion called semantic mismatch score is proposed and measured by the average Euclidean distance be-

Figure 3: Grad-CAM visualization of feature-map distillation approaches on Image Net. Region with a darker red is more important for the prediction. Best viewed in color.

tween the similarity matrices generated by feature maps of each associated teacher-student layer pair, which hopefully represents the degree of difference between the captured pairwise similarity among instances in certain semantic level. As shown in Table 3, a lower semantic mismatch score is achieved by Sem CKD thanks to our soft layer association mechanism. Detailed formulation as well as the calculation are provided in the technical appendix. To further provide visual explanations for the advantage of Sem CKD, we randomly select several images from Image Net labeled by Bow tie , Rain barrel , Racer , Bathhub and Goose , and use Grad-CAM (Selvaraju et al. 2017) to highlight the regions which are considered to be important for predicting the corresponding labels. As shown in Figure 3, the class-discriminative regions is centralized by Sem CKD which is similar to the teacher model while being scatted around the surroundings by compared methods. As visualized in the ﬁfth column, another failure mode of compared methods is that they sometimes regard the right regions as background while putting their attention on the spatial adjacency object. Moreover, Sem CKD can capture more semantic-related information like highlighting the head and neck to identify a Goose in the image.

Equal Alloc w/o Proj w/o MLP Sem CKD

72.94 0.87 72.51 0.16 72.78 0.29 75.27 0.13

Table 4: Ablation study: Top-1 test accuracy for VGG-8 & Res Net-32x4 on CIFAR-100.

Ablation Study

Table 4 presents the evaluation of three Sem CKD variants to further show the beneﬁt of each individual component. (1) Equal Allocation. In order to validate the effectiveness of allocating the attention of each student layer to multiple target layers, equal weight assignment is applied instead. This causes a lower accuracy by 2.33% (From 75.27% to 72.94%) and a considerably larger variance by 0.74%. (2) w/o Projection. Rather than projecting the feature maps of each student layer to the same dimension as those in the target layers by Equation (5), we add a new MLPV ( ) to project the pairwise similarity matrices of teacher-student layer pairs into another subspace to generate value vectors. Thus the Mean-Square-Error among feature maps in Equation (6) is replaced by these value vectors to calculate the overall loss, which reduces the performance by 2.76%. (3) w/o MLP. A simple linear transformation is used to obtain query and key vectors in Equation (8) instead of the two-layer non-linear transformation, i.e., MLP( ). The 2.49% performance drop indicates that the usefulness of MLP( ) to alleviate the effect of noise and sparseness.

Extension to Feature-Embedding Distillation Approaches

Knowledge transfer based on feature embedding of the penultimate layer is another alternative to improve the generalization ability of student models. The results in Table 5 conﬁrm that our approach holds a very satisfying property that it is highly compatible with the state-of-the-art feature-embedding distillation approach to achieve the better performance. We compare the performance of each student model trained with several newly proposed methods on three teacher-student network combinations. It is observed that by simply adding the loss term of CRD (Tian, Krishnan, and Isola 2020) into the original loss function of Sem CKD without tuning any hyper-parameter, the performance has already been further boosted. Speciﬁcally, the ARI of Sem CKD+CRD over CRD and Sem CKD is about 40.13% and 13.90%, respectively.

Sensitivity Analysis

Finally, we evaluate the impact of hyper-parameter β on the performance of knowledge distillation. We compare three representative knowledge distillation approaches, including logits transfer (KD), feature-embedding transfer (CRD) and feature-map transfer (Sem CKD). The range of hyperparameter β for Sem CKD is set as 100 to 1100 at equal interval of 100, while the hyper-parameter β for CRD ranges from 0.5 to 1.5 at equal interval of 0.1, adopting the same search space as the original paper (Tian, Krishnan, and Isola

Student VGG-8 Mobile Net V2 Res Net-8x4 70.46 0.29 65.43 0.29 73.09 0.30

PKT 73.11 0.21 68.68 0.29 74.61 0.25 RKD 72.49 0.08 68.71 0.20 74.36 0.23 IRG 72.57 0.20 68.83 0.18 74.67 0.15 CC 72.63 0.30 68.68 0.14 74.50 0.13 CRD 73.54 0.19 69.98 0.27 75.59 0.07

Sem CKD 75.27 0.13 69.61 0.05 76.23 0.04 Sem CKD+CRD 75.52 0.09 70.55 0.11 76.68 0.19

Teacher Res Net-32x4 WRN-40-2 Res Net-32x4 79.42 75.61 79.42

Table 5: Top-1 test accuracy of feature-embedding distillation approaches on CIFAR-100.

Figure 4: Impact of the hyper-parameter β for VGG-8 & Res Net-32x4 on CIFAR-100.

2020). Note that the hyper-parameter β always equals to 0 for the vanilla KD, leading to a horizontal line in Figure 4. It is seen that Sem CKD achieves the best results in all cases and outperforms CRD at about 1.73 absolute accuracy for the default hyper-parameter setting. Figure 4 also shows that the performance of Sem CKD keeps very stable after the hyper-parameter β is greater than 400, which indicates our proposed method works reasonably well in a wide range of search space for the hyper-parameter β.

Conclusion Feature maps produced by multiple intermediate layers of a powerful teacher model are valuable for improving knowledge transfer performance. A peculiar challenge for featuremap distillation is to ensure an appropriate association of teacher-student layer pairs. To alleviate negative regularization effect due to semantic mismatch between certain pairs of teacher-student intermediate layers, we propose semantic calibration via attention allocation for effective crosslayer distillation. Each student layer in our approach distills knowledge contained in multiple target layers with an automatically learned attention distribution to obtain proper supervision. Experimental results show that training with Sem CKD leads to a relative low-level semantic mismatch score and its generalization ability outperforms the compared approaches. Visualization as well as detailed analysis provide some insights to the working principle of Sem CKD.

Acknowledgments

This work is funded by National Key R&D Program of China (Grant No: 2018AAA0101505) and State Grid Corporation of China Scientiﬁc and Technology Project: Fundamental Theory of Human-in-the-loop Hybrid-Augmented Intelligence for Power Grid Dispatch and Control. The authors would like to thank the helpful comments from Ziying Guo and anonymous reviewers.

Ahn, S.; Hu, S. X.; Damianou, A. C.; Lawrence, N. D.; and Dai, Z. 2019. Variational Information Distillation for Knowledge Transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9163 9171.

Anil, R.; Pereyra, G.; Passos, A.; Orm andi, R.; Dahl, G. E.; and Hinton, G. E. 2018. Large scale distributed neural network training through online distillation. In International Conference on Learning Representations.

Ba, J.; and Caruana, R. 2014. Do Deep Nets Really Need to be Deep? In Advances in Neural Information Processing Systems, 2654 2662.

Bengio, Y.; Courville, A. C.; and Vincent, P. 2013. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8): 1798 1828.

Bucilua, C.; Caruana, R.; and Niculescu-Mizil, A. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 535 541.

Chen, D.; Mei, J.-P.; Wang, C.; Feng, Y.; and Chen, C. 2020. Online Knowledge Distillation with Diverse Peers. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 3430 3437.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770 778.

Hinton, G. E.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. ar Xiv preprint ar Xiv:1503.02531 .

Jang, Y.; Lee, H.; Hwang, S. J.; and Shin, J. 2019. Learning What and Where to Transfer. In International Conference on Machine Learning.

Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical Report .

Liu, Y.; Cao, J.; Li, B.; Yuan, C.; Hu, W.; Li, Y.; and Duan, Y. 2019. Knowledge Distillation via Instance Relationship Graph. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7096 7104.

Ma, N.; Zhang, X.; Zheng, H.; and Sun, J. 2018. Shufﬂe Net V2: Practical Guidelines for Efﬁcient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision, 122 138.

M uller, R.; Kornblith, S.; and Hinton, G. E. 2019. When Does Label Smoothing Help? In Advances in Neural Information Processing Systems. Park, W.; Kim, D.; Lu, Y.; and Cho, M. 2019. Relational Knowledge Distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3967 3976. Passalis, N.; and Tefas, A. 2018. Learning Deep Representations with Probabilistic Knowledge Transfer. In European Conference on Computer Vision, 283 299. Passalis, N.; Tzelepi, M.; and Tefas, A. 2020. Heterogeneous Knowledge Distillation using Information Flow Modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Peng, B.; Jin, X.; Li, D.; Zhou, S.; Wu, Y.; Liu, J.; Zhang, Z.; and Liu, Y. 2019. Correlation Congruence for Knowledge Distillation. In International Conference on Computer Vision, 5006 5015. Pereyra, G.; Tucker, G.; Chorowski, J.; Kaiser, Ł.; and Hinton, G. 2017. Regularizing neural networks by penalizing conﬁdent output distributions. ar Xiv preprint ar Xiv:1701.06548 . Romero, A.; Ballas, N.; Kahou, S. E.; Chassang, A.; Gatta, C.; and Bengio, Y. 2015. Fit Nets: Hints for thin deep nets. In International Conference on Learning Representations. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. S.; Berg, A. C.; and Li, F. 2015. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115(3): 211 252. Sandler, M.; Howard, A. G.; Zhu, M.; Zhmoginov, A.; and Chen, L. 2018. Mobile Net V2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510 4520. Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In International Conference on Computer Vision, 618 626. Simonyan, K.; and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations. Tian, Y.; Krishnan, D.; and Isola, P. 2020. Contrastive Representation Distillation. In International Conference on Learning Representations. Tung, F.; and Mori, G. 2019. Similarity-Preserving Knowledge Distillation. In International Conference on Computer Vision, 1365 1374. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998 6008. Yuan, L.; Tay, F. E.; Li, G.; Wang, T.; and Feng, J. 2020. Revisiting Knowledge Distillation via Label Smoothing Regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Zagoruyko, S.; and Komodakis, N. 2016. Wide Residual Networks. In Proceedings of the British Machine Vision Conference. Zagoruyko, S.; and Komodakis, N. 2017. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations.