# feature_kernel_distillation__8229d7c7.pdf Published as a conference paper at ICLR 2022 FEATURE KERNEL DISTILLATION Bobby He1,2 & Mete Ozay2 1Department of Statistics, University of Oxford 2Samsung Research UK Trained Neural Networks (NNs) can be viewed as data-dependent kernel machines, with predictions determined by the inner product of last-layer representations across inputs, referred to as the feature kernel. We explore the relevance of the feature kernel for Knowledge Distillation (KD), using a mechanistic understanding of an NN s optimisation process. We extend the theoretical analysis of Allen-Zhu & Li (2020) to show that a trained NN s feature kernel is highly dependent on its parameter initialisation, which biases different initialisations of the same architecture to learn different data attributes in a multi-view data setting. This enables us to prove that KD using only pairwise feature kernel comparisons can improve NN test accuracy in such settings, with both single & ensemble teacher models, whereas standard training without KD fails to generalise. We further use our theory to motivate practical considerations for improving student generalisation when using distillation with feature kernels, which allows us to propose a novel approach: Feature Kernel Distillation (FKD). Finally, we experimentally corroborate our theory in the image classification setting, showing that FKD is amenable to ensemble distillation, can transfer knowledge across datasets, and outperforms both vanilla KD & other feature kernel based KD baselines across a range of standard architectures & datasets. 1 INTRODUCTION & BACKGROUND A prevailing belief in the Deep Learning community is that feature learning, where data-dependent features are acquired during training, is crucial to explaining the empirical success of Neural Networks (NNs) (Fort et al., 2020; Baratin et al., 2021). A comparison in this regard is often made to kernel methods (Jacot et al., 2018), which can be thought of as feature selection methods from a fixed data-independent set of features. This separation has been caricaturised as a distinction between feature learning and kernel learning regimes (Chizat et al., 2019; Yang & Hu, 2020; Woodworth et al., 2020) of NN training. Though less amenable to theoretical analysis compared to kernel regimes, feature learning regimes have the capability to capture more of the complex empirical phenomenon that one can observe in NNs due to parameter-space non-convexity, such as: i) how ensembling trained NNs differing solely in their independent parameter initialisations can lead to improvements in predictive accuracy & uncertainty (Lakshminarayanan et al., 2017; Allen-Zhu & Li, 2020), or ii) the effectiveness of knowledge distillation with both single & ensemble teacher models (Buciluˇa et al., 2006; Hinton et al., 2015). This implies that in order to understand ensembling & knowledge distillation (KD) in NNs, we need to understand the mechanisms of NN feature learning regimes. Ensembling can be loosely summarised as aggregating predictions from multiple models, & is used widely across machine learning (ML) to improve performance (Dietterich, 2000; Breiman, 2001). Conversely, knowledge distillation (KD), the idea of transferring knowledge from a teacher model to a student model, has garnered the most attention with NNs. Remarkably, via KD it is possible to Researched during internship at Samsung Research UK. Correspondence to bobby.he@stats.ox.ac.uk. Published as a conference paper at ICLR 2022 significantly improve a single student s generalisation with knowledge from a teacher model, or an ensemble of teachers. This means that the single student s model has enough flexibility to generalise well (relative to the teacher), thus one must factor in the optimisation process (such as the parameter initialisation) in order to explain the mechanisms of ensembling and KD in NNs. To describe KD, suppose we have N input-target (x, y) Rd RCS data pairs ˆD={(xi, yi)}N i=1 sampled i.i.d. (independent & identically distributed) from some distribution D, and a student NN architecture f S(x) = WS h S(x, θS). Here h S(x, θS) Rm 1 is a student-specific feature extractor model (e.g. MLP, CNN, Res Net, or Transformer) with parameters θS, and WS RCS m is a parameter matrix for the last layer. Assume also that we have loss: L(θS, WS)= 1 N PN i=1 L(f S(xi), yi) which we seek to minimise over θS, WS in the hope that f S can generalise to unseen (x, y) pairs. L is typically cross-entropy in the classification setting. Vanilla KD (Hinton et al., 2015) distils knowledge from a trained teacher network f T (x)=WT h T (x, θT ) RCT to a student by regularising student f S towards the teacher f T : L(θS, WS) = L(θS, WS) + λKD 1 N i=1 L f S(xi) τ , f T (xi) for temperature τ>0 & regularisation λKD>0 hyperparameters. Note, this is only valid if CT =CS. Following Hinton et al. (2015), many methods have been proposed using different quirks of NNs to distil knowledge from teacher to student. A relevant line of work involves encouraging the student to match how similar/related the teacher views two inputs x, x to be (Passalis & Tefas, 2018; Tung & Mori, 2019; Park et al., 2019). These approaches have the benefit of being agnostic to teacher/student architectures & prediction spaces CT & CS, but as of yet remain heuristically motivated. In this work, we explore such approaches under the more general framework of NN feature kernel (the kernel induced by the inner product of last-layer features h) learning, allowing us to provide the missing theoretical justification. Moreover, we use our theoretical insights to introduce practical improvements for FKD in Section 4, which we show outperform these previous works in Section 5. Allen-Zhu & Li (2020) provide the first theoretical exposition of the mechanisms by which vanilla KD and ensembling improve generalisation in NNs. To this end, the authors introduce the notion of multi-view data, which is when a class in a multi-class classification problem has multiple identifying features/attributes. For example, an image of a car can be discerned by i) wheels, ii) windows, or iii) headlights. The key idea is that the NN parameter initialisation, and its random correlations with certain attributes, will bias the NN to learn only a subset of the entire set of attributes pertaining to a given class. When presented with single-view data lacking the class-identifying attribute that the NN has learnt, the NN will not generalise. For example, an NN that has learnt to classify cars based on if they have headlights will not generalise to a side-on image of a car that occludes headlights. The implication then is that ensembling NNs works in part because independent parameter initialisations learn independent sets of attributes, so more data features will be learnt across the ensemble model. Moreover, it is argued that vanilla KD in NNs works because the features learnt by the teacher model (or models) are imparted to the student via soft teacher labels that capture ambiguity in a given data input (such as an image of a car whose headlights look like the eyes of a cat). This is fundamentally different to ensembling in strongly convex feature selection problems, such as linear or random features (Rahimi & Recht, 2007) models with ℓ2 regularisation. In such cases, different initialisations reach the same unique optimum, and additional noise must be added to ensure predictive diversity in the ensemble (Matthews et al., 2017). These analyses suggest that it is not possible to fully explain KD or ensembling in NNs without feature learning, thus motivating our study of Feature Kernel Distillation, where one performs KD on NN features directly. Our contributions Feature learning can be thought of as when the feature kernel, induced by the inner product of last-layer representations in a NN, changes during training (Yang & Hu, 2020), and kernel learning in NNs can be thought of as when this kernel is constant. In this work, we take a Published as a conference paper at ICLR 2022 feature learning perspective of knowledge distillation (KD). We first highlight the importance of the feature kernel by viewing trained NNs as data-dependent kernel machines, & use this to motivate Feature Kernel Distillation (FKD). In FKD, we aim to ensure that the student s feature kernel is well suited for improved generalisation, using both the teacher s data-dependent feature kernel as well as an understanding of the student NN s optimisation process. In Section 3, we adapt the framework of Allen-Zhu & Li (2020) to show that FKD offers the same generalisation benefits as found in vanilla KD in a multi-view data setting, and is further amenable to ensemble distillation. We then derive practical considerations from our insights in Section 4, to improve FKD through an understanding of the NN s feature learning optimisation process, compared to previous methods which implicitly used the feature kernel for KD. Finally, in Section 5, we provide experimental support that our theoretical claims extend to standard image classification settings, by: verifying that FKD is amenable to ensemble distillation; can transfer knowledge across datasets with different prediction spaces (unlike vanilla KD); and outperforms vanilla KD & previous feature kernel based distillation methods over a range of architectures on CIFAR-100 and Image Net-1K. 2 MOTIVATION FOR FEATURE KERNEL DISTILLATION Figure 1: Feature Kernel Distillation (FKD) from the feature extractor of a teacher h T to that of a student h S. One obvious limitation of vanilla KD is that student f S and teacher f T need to share prediction spaces, i.e. CS=CT . In many situations, we may have a teacher network trained on a dataset with a different number of classes than the student s dataset, and it is not clear how one could apply vanilla knowledge distillation. One possibility could be to regularise directly in feature space by comparing h S and h T element-wise, but again this either requires same teacher-student last-layer sizes or additional projection layers. To eschew such unnecessary complications, we take the perspective of NNs as data-dependent kernel machines. Define an NN s feature kernel to be: Definition 1 (Feature Kernel). Suppose we have parameters θ and last-layer NN feature extractor h( , θ):Rd 7 Rm. For two inputs xi, xj Rm, the feature kernel k is the kernel induced by the inner product of h(xi, θ) and h(xj, θ), that is: k(xi, xj) def = h(xi, θ), h(xj, θ) . At initialisation, it is well known that in the infinite NN-width limit, with appropriate scaling, the feature kernel k converges almost surely to a deterministic kernel known as the Neural Network Gaussian Process (NNGP) kernel (Neal, 2012; Lee et al., 2018; Matthews et al., 2018; Yang, 2019). Yang & Hu (2020) show that there is a parameterisation-dependent dichotomy between kernel & feature learning regimes for infinite-width NNs, where the feature kernel k is constant or changes during training, respectively. It has been widely demonstrated that a crucial component of the success of finite-width NNs is their ability to flexibly learn features, and indeed the feature kernel, from data during training (Fort et al., 2020; Aitchison, 2020; Chen et al., 2020b; Maddox et al., 2021). To see the importance of the feature kernel, note that for a fixed θ with many common loss functions L, and some mild assumptions on strong convexity (which could be enforced e.g. with standard ℓ2 regularisation), the optimal W is uniquely determined and k determines the entire predictive function f ( ). For example, with squared error, L(f(x), y) = f(x) y 2 2, and ℓ2 regularisation strength λ > 0, a trained NN is precisely kernel ridge regression with the data-dependent feature kernel k, whose job is to measure how similar different inputs are. Thus, all teacher knowledge is contained in its feature kernel, k T , so the feature kernel can act as our primary distillation target, as depicted in Fig. 1. We show a corresponding result for cross-entropy loss in App. A. Fig. 2 corroborates our claims. For a Res Net20v1 (He et al., 2016) reference model trained on CIFAR10 with cross entropy, we plot test class prediction confusion matrices between said model and: Published as a conference paper at ICLR 2022 automobile ship Reference model prediction Alternate model prediction vs Retrained last layer automobile ship Reference model prediction vs Independent init Normalised frequency Figure 2: CIFAR10 test prediction confusion matrices between a fixed reference model and a model with: (left) retrained last layer, and (right) independent initialisation. i) a retrained version where all but the last-layer parameters are fixed (hence fully determined by the reference model s feature kernel as per App. A), & ii) an independent model trained from a different initialisation. As expected, there is significantly more disagreement across test predictions for models with different initialisations than those which share feature kernels. This suggests: a) different initialisations bias the same architecture to learn different features, and b) the feature kernel (largely) determines a model s test predictions. Experimental details and a breakdown of the predictive disagreements can be found in App. F. Having described the feature kernel as a central object in any NN, we use this to motivate our proposed FKD, where we treat the teacher s feature kernel, k T , as a key distillation target for the student s feature kernel, k S. Encouraging similarity across feature kernels shares useful features that the teacher has learnt with the student, which we theoretically show in Section 3. We define the FKD student loss function via an additive regularisation term between feature kernels:1 LλKD(θ, W) def = L(θ, W) + λKD D(kθ, k T ) (2) where λKD > 0 is the regularisation strength, D is some (pseudo-)distance over kernels, and the student feature kernel k S = kθ is written to make explicit the dependence on student feature extractor parameters θ. We stress that Eq. (2) does not require matching prediction (nor feature) spaces between teacher and student, allowing us to apply FKD across tasks, architectures, and datasets. We consider Eq. (2) with D set to: D(kθ, k T ) = Ex1,x2 i.i.d. ˆ D kθ(x1, x2) k T (x1, x2) p , (3) with expectation approximated by an average over a minibatch. In this work we choose p = 2, so that D gives the Frobenius norm of the difference in feature kernel gram matrices over a batch. 3 THEORETICAL ANALYSIS FOR FKD We now adapt the theoretical framework of Allen-Zhu & Li (2020), which is restricted to vanilla KD, to demonstrate the generalisation benefits of FKD over standard training. Note that FKD distils knowledge by comparing different data points, whereas vanilla KD compares a single data point across classes: this core difference is reflected throughout our analysis relative to Allen-Zhu & Li (2020). We first describe the multi-view data setting & CNN architecture we consider, before recalling that standard training without KD fails to generalise well. We then provide our main theoretical result, Theorem 2, which shows that FKD improves student test performance. Though our theoretical results are limited to a specific scenario, inspired by real-world data (Allen-Zhu & Li, 2020), & NN architecture, we believe the setup we consider is apt: it is simple enough to be tractable, yet rich enough to display the merits of FKD. Moreover, we find in Section 5 that our conclusions generalise to standard architectures & image datasets. In the interest of space & readability, we focus on providing intuition in this section, and fill in remaining details/proofs in the appendix. 1We will sometimes drop the student S sub/superscript where obvious for clarity, like in Eq. (2). Any teacher specific object, e.g. k T , will always have corresponding T sub/superscript. Published as a conference paper at ICLR 2022 Multi-view data. We consider the data classification problem introduced by Allen-Zhu & Li (2020), with C classes and inputs x with P patches each of dimension d, meaning x (Rd)P . For each class c, we suppose that there exist two attributes vc,1, vc,2 Rd. For x belonging to class c, the attributes found in patches of x will include vc,1 and vc,2, as well as a random selection of out-of-class attributes {vc ,l}c =c,l [2].2 This denotes the multi-view nature of the data distribution. In the true data-generating distribution D, we suppose that a proportion µ of the data (x, y) is singleview, which means that only one of vc,1 or vc,2 is present in x when (x, y) is from class c. These will be the data for which standard training fails to generalise. A precise definition of multi-view data is presented in App. B.1.1. Allen-Zhu & Li (2020) argue that this multi-view setting provides a compelling proxy for standard image datasets such as CIFAR-10/100 Krizhevsky (2009).3 Intuition: FKD on multi-view data Suppose we have an image classification task, with cat & car just two out of many classes. For the car class, vc,1 could correspond to headlights, whilst vc,2 could correspond to wheels. We would then expect vc,1 to also appear in patches of an input image, xcat, corresponding to a cat with headlight-like eyes. Allen-Zhu & Li (2020) show that a single trained model is biased to learn exactly one of vc,1 or vc,2, depending on its parameter initialisation. W.L.O.G, suppose that the student is biased to learn vc,2 & not vc,1. If the teacher model has learnt vc,1, this means that the teacher model knows there is a similarity between xcat & any car image, xcar, that displays headlights. Mathematically, we show that this corresponds to a large value for k T (xcat, xcar). Our FKD regularisation forces the student to also have a large value for k S(xcat, xcar), ensuring that attribute vc,1 is also learnt by the student network. Without distillation, a student NN which has learnt vc,2 & not vc,1 will not generalise to front-on images of cars that hide wheels. Convolutional NN & corresponding feature kernel. Like Allen-Zhu & Li (2020), for our theoretical analyis we consider a single hidden-layer convolutional NN (CNN) with sum-pooling.4 For each class c [C], we suppose that the CNN has m channels, giving Cm channels in total. For channel r and class c, we suppose that we have weights θc,r Rd. This gives output for class c by fc(x) def = p=1 Re LU( θc,r, xp ) (4) where for ease of analysis Re LU is Re LU-like but with continuous gradient, see App. B.2. Before we consider FKD, we must first define the feature kernel for this CNN f. To do so, we recast f(x)=W h(x, θ), where h(x, θ) RCm and W RC Cm satisfying, for r [m], c [C], c [C]: h(x, θ)r+(c 1)m def = p=1 Re LU( θc,r, xp ), and Wc ,r+(c 1)m def = 1{c = c }. (5) Given that the feature kernel is k(x, x ) def = h(x, θ), h(x , θ) , we now have that:5 p,p =1 Re LU( θc,r, xp ) Re LU( θc,r, x p ). (6) We first recall that standard training of the model f with gradient descent and cross entropy loss fails to generalise on half the µ proportion of data that is single-view. 2It is straightforward to extend to the case of more than two views per class if need be. 3https://www.microsoft.com/en-us/research/blog/three-mysteries-in-deeplearning-ensemble-knowledge-distillation-and-self-distillation/ 4It is straightforward to extend our analysis for max-pooling. 5The feature kernel defined in Eq. (6) corresponds to the Global Average Pooling CNN-GP kernel in Novak et al. (2018) in the infinite-channel limit, which captures intra-patch correlations unlike the vectorised CNN-GP, which corresponds to vectorising the spatial dimensions to give Cm P rather than Cm channels. Published as a conference paper at ICLR 2022 Theorem 1 (Standard training fails, Theorem 1 of Allen-Zhu & Li (2020)). For sufficiently many classes C and channels m [polylog(C), C], with learning rate η 1 poly(C), training time T = poly(C) η , and multi-view data distribution (App. B.1.1), the trained model f (T ) satisfies with probability at least 1 e Ω(log2(C)): Training accuracy is perfect: For all (x, y) ˆD, y = argmaxc f (T ) c (x). Test accuracy is bad but consistent: P(x,y) D y = argmaxc f (T ) c (x) [0.49µ, 0.51µ]. Now we are ready to show that regularising the student model towards the teacher model s feature kernel, as in FKD, improves test accuracy. We suppose our teacher model is an ensemble of E 1 models {fe}E e=1, each with corresponding feature kernel ke, trained as standard on the same data with independent initialisations θe 0. We average ke over e [E] to obtain our teacher feature kernel: k T (x, x ) = 1 e=1 ke(x, x ). (7) This is akin to concatenating all features in {he}E e=1 into a ECm-dimensional feature vector, albeit without the additional computational baggage. We then have our main theoretical result: Theorem 2 (FKD improves student generalisation and is better with larger ensemble). Given an arbitrary ϵ > 0. For any ensemble size E of teacher NNs trained as in Theorem 1 and sufficiently many classes C, for m = polylog(C), with learning rate η 1 poly(C), and training time T = poly(C) ensemble teacher knowledge can be distilled into a single student model f (T ) using only teacher feature kernel k T , Eq. (7), such that with probability at least 1 e Ω(log2(C)): Training accuracy is perfect: For all (x, y) ˆD, y = argmaxc f (T ) c (x). Test accuracy is good: P(x,y) D y = argmaxc f (T ) c (x) ( 1 2E+1 + ϵ)µ. Proof outline for Theorem 2, we first show, in Lemma 1, that a single trained NN s feature kernel (which we defined in Eq. (6)) can detect if two inputs share a data attribute that the NN has learnt due to its weight initialisation. We extend this result to an ensemble teacher in App. C.2, showing that the ensemble teacher feature kernel detects the union of all data attributes learnt by individual {ke}E e=1. This simplifies our calculations when showing that our distillation regulariser, Eq. (3), is effective for improved student generalisation. The full proof can be found in App. C. Intuition: FKD with ensemble of teachers To parse Theorem 2, suppose we have a single teacher i.e. E=1. Then, the test error is essentially 0.25µ in Theorem 2. The explanation is that both the student & teacher networks independently learn one of {vc,l}2 l=1 for each class c. Either vanilla KD (Theorem 4 of Allen-Zhu & Li (2020)) or our feature kernel approach allow the student to learn the union of the independent attributes learnt by the student and the teacher, so for only a quarter of the single-view test data xs will the student not have learnt the useful class attribute present in xs. For general ensemble size E, the story is the same: the student & E teachers each independently learn one of the two useful attributes {vc,l}l 2 for all c [C]. Distilling allows the student to learn the union of these attributes, which means that the student will fail on only 1 2E+1 of the single-view data. 4 FKD IN PRACTICE Next, we highlight practical considerations for implementing FKD derived from our theory. Pseudocode and Py Torch-style code for our FKD implementation are given in Algs. 1 and 2 respectively. Published as a conference paper at ICLR 2022 Correlation kernel. We propose D(ρθ, ρT ) as a regulariser in FKD instead of D(kθ, k T ), where: ρz(x, x ) def = kz(x, x ) p kz(x, x)kz(x , x ) , and ρT (x, x ) def = 1 e=1 ρe(x, x ) defines feature correlation kernel ρz, corresponding to feature kernel kz, for z [E] {θ}. The reason we use correlation kernels is that they normalise data, so that ρ(x, x)=1 x, which zeros diagonal differences in Eq. (3): we hope FKD allows the student to learn from the teacher features shared between different inputs. Non-zero diagonal differences, like in Similarity-Preserving (SP) KD (Tung & Mori, 2019), encourage the student to learn noise as we show in App. D.1, and we hypothesise that this contributes to the improved performance we observe of FKD over SP in Section 5. Moreover, this normalisation helps balance individual teacher s influence in an ensemble teacher, and ensures that FKD does not need a temperature hyperparameter τ, which produces soft labels in vanilla KD. 0.0 0.2 0.4 0.6 0.8 1.0 Normalised k(x, x) value Histogram Density FKD with FR FKD w.o. FR Figure 3: Histogram of normalised feature kernel values, k(x,x) maxx k(x ,x ), over the CIFAR-100 test set. Feature regularisation. One downside to using the correlation kernel is that our FKD regularisation, D(ρθ, ρT ), becomes invariant to the scale of kθ. For example, replacing kθ(x, x ) with p M(x)M(x )kθ(x, x ), for any M(x):Rd7 R+, leaves the student correlation kernel unchanged. This may lead to degeneracies when training θ, & large variations in k(x, x) over x may harm generalisation (as evidenced by the fact that input normalisation is common across ML, from linear models to NNs). Moreover, our proof of Theorem 2 is not quantitative, in that we only show kθ(x, x ) = Θ(1) in the number of classes C up to polylogarithmic factor in C, when x =x share a data attribute vc,l that has been learned by parameters θ. These insights motivate us to regularise the student feature h during FKD training to control the norm of k(x, x) across inputs x. We use an additive ℓ2 regularisation 1 B PB b=1 h(xb, θ) 2 2 = 1 B PB b=1 k(xb, xb) with regularisation strength λFR>0, for each minibatch {(xb, yb)}B b=1. Fig. 3 shows that using Feature Regularisation (FR) encourages a more even spread of k(x, x) across inputs for a student VGG8 network trained with a VGG13 teacher model on CIFAR-100. Corresponding plots for other architectures can be found in Fig. 7. Similar to Dauphin & Cubuk (2021), we find that FR improves generalisation & provide ablations in Section 5. Algorithm 1 Feature Kernel Distillation with SGD. Require: Maximum number of iterations T , batch size B, learning rate η, teacher correlation kernel ρT , FKD regularisation λKD > 0, Feature regularisation λFR > 0. Initialise student parameters θ0, W0. for iteration t = 0, . . . , T do Sample minibatch (x B i , y B i )B i=1 i.i.d. ˆD. Compute loss L = 1 B PB i=1 L(f(x B i ), y B i )+ 2λKD B(B 1) PB i =j ρθt(x B i , x B j ) ρT (x B i , x B j ) 2. Add feature regularisation L = L + λFR B PB i=1 h(x B i , θt) 2 2, (optional). Update parameters θt+1 θt η θL, Wt+1 Wt η W L. end for return {θT , WT } 5 EXPERIMENTS Due to space concerns, App. F contains further experiments & any missing experimental details. Ensemble distillation. We first verify that larger ensemble teacher size, E, further improves FKD student performance as suggested by Theorem 2. This is confirmed in Fig. 4, using VGG8 for all student & teacher networks on the CIFAR-100 dataset. Published as a conference paper at ICLR 2022 1 2 3 4 5 6 7 8 Teacher ensemble size C-100 Test accuracy (%) KD FKD Teacher Student Figure 4: FKD as teacher ensemble size changes. Error bars denote 95% confidence for mean of 10 runs. We also plot the test accuracy of the ensemble teacher across sizes E, whose predictive probabilities are averaged over individual teachers, as well as the test accuracy of an undistilled student model. We see that FKD consistently outperforms vanilla KD, and both distillation methods outperform the teacher in the self-distillation setting of E=1 (Furlanello et al., 2018; Zhang et al., 2019). Moreover, FKD allows a single student to match teacher performance when E=2, before positive but diminishing returns with larger E relative to the teacher ensemble. Table 1: Test accuracy (%) of FKD & baselines in a dataset transfer distillation setting. Error bars indicate 95% confidence for mean of 10 runs. Dataset Student RKD SP FKD w.o. FR FKD C-100 C-10 91.56 91.74 0.09 92.21 0.07 92.33 0.07 92.61 0.07 C-100 STL-10 72.69 72.86 0.27 73.88 0.17 75.17 0.30 75.44 0.31 C-100 Tiny-I 48.53 48.74 0.17 48.73 0.14 48.85 0.09 50.67 0.12 Dataset Transfer. We next show FKD can transfer knowledge across similar datasets. From a fixed VGG13 teacher network trained on CIFAR100, we distil to student VGG8 NNs on CIFAR-10, STL-10 & Tiny-Image Net. As no student dataset has 100 classes, unlike CIFAR-100, it is not clear how one can use vanilla KD (Hinton et al., 2015) in this case. We thus compare FKD to other feature kernel based KD methods: Relational KD (RKD) (Park et al., 2019) & Similarity-Preserving (SP) KD (Tung & Mori, 2019). In Table 1, we see that FKD without feature regularisation outperforms both baselines across all datasets, and that feature regularisation (FR) further improves FKD performance, highlighting the benefit of our practical considerations in Section 4. The improved performance is particularly stark on STL-10 (which we downsize to 32x32 resolution), where FKD improves student performance by 2.75%. STL-10 is well suited for FKD as it has only 5K labeled inputs but 100K unlabeled datapoints, which can be used in our feature kernel regulariser, D(ρθ, ρT ). Table 2: CIFAR-100 and Image Net-1K accuracies (%) comparing FKD with KD baselines. * denotes result from Tian et al. (2020); FKD uses the same teacher checkpoints provided by the authors,6 with error bars denoting 95% confidence for the mean over 5 students. CIFAR-100 Image Net-1K Teacher Res Net32x4 VGG13 Res Net32x4 Res Net50 Res Net34 Student Res Net8x4 VGG8 Shuffle Net V1 VGG8 Res Net18 Teacher* 79.42 74.64 79.42 79.34 73.26 Student* 72.50 70.36 70.50 70.36 69.97 KD* (Hinton et al., 2015) 73.33 0.22 72.98 0.17 74.07 0.17 73.81 0.11 70.66 RKD* (Park et al., 2019) 71.90 0.10 71.48 0.04 72.28 0.34 71.50 0.06 N/A SP* (Tung & Mori, 2019) 72.94 0.20 72.68 0.17 73.48 0.37 73.34 0.30 70.62 CRD* (Tian et al., 2020) 75.51 0.16 73.94 0.19 75.11 0.28 74.30 0.12 71.17 FKD w.o. FR 74.89 0.24 73.08 0.16 74.66 0.23 73.99 0.15 70.84 FKD 75.57 0.22 73.78 0.17 75.00 0.30 74.61 0.28 71.23 Comparison on CIFAR-100 and Image Net. Finally, we compare FKD to various knowledge distillation baselines on CIFAR-100 and Image Net, across a selection of teacher/student architectures. We see in Table 2 that FKD consistently outperforms: vanilla KD (Hinton et al., 2015), RKD (Park et al., 2019), and SP (Tung & Mori, 2019). Moreover, FKD either matches or outperforms the high-performing Contrastive Representational Distillation (Tian et al., 2020). We use the exact same teacher checkpoints used by Tian et al. (2020) and Chen et al. (2021) for CIFAR-100 and Image Net respectively to ensure fair comparison. We find, like in Table 1, that feature regularisation consistently improves FKD performance and that even without feature regularisation, FKD outperforms all feature kernel based KD methods. This implies that using the correlation kernel to zero out diagonal differences, as described in Section 4, indeed helps improve student performance. 6Apart from Image Net-1K which used the pretrained Res Net34 from torchvision, like Chen et al. (2021) Published as a conference paper at ICLR 2022 6 RELATED WORK NN Knowledge Distillation. Following Hinton et al. (2015), there has been much interest in expanding KD in NNs (Romero et al., 2014; Zagoruyko & Komodakis, 2016; Passalis & Tefas, 2018; Zhang et al., 2018; Yu et al., 2019; Chen et al., 2020a; Tian et al., 2020). Most similar to FKD are Park et al. (2019); Tung & Mori (2019) who also use relations between inputs to distil knowledge (albeit not from the feature kernel learning perspective and without our theoretical justification), as well as Qian et al. (2020) who focus on reducing computational costs of full-batch kernel matrix operations. App. D highlights in more detail the differences of FKD compared to previous pairwise feature kernel based KD methods. Allen-Zhu & Li (2020) made the first theoretical connection using the mechanisms of ensembling in NNs to explain the success of vanilla KD in NNs, which we extend for feature kernel based KD. Ensembling NNs. Ensembling NNs has long been studied for improving predictive accuracy (Hansen & Salamon, 1990; Krogh et al., 1995) with particular recent focus towards uncertainty quantification & Bayesian inference (Lakshminarayanan et al., 2017; Ovadia et al., 2019; Zaidi et al., 2020; Pearce et al., 2020; He et al., 2020; Wilson & Izmailov, 2020; Wenzel et al., 2020; D Angelo & Fortuin, 2021; Schut et al., 2021) and predictive diversity (Fort et al., 2019; D Amour et al., 2020). On the topic of Bayesian inference, the feature kernel has also been studied under the name of Neural Linear Model (Riquelme et al., 2018; Ober & Rasmussen, 2019), and extensions treating features h as inputs to standard Gaussian Process kernels are known by the name of Deep Kernel Learning (Wilson et al., 2016; Ober et al., 2021; van Amersfoort et al., 2021). NN Feature learning. A recent flurry of work has focussed on characterising & understanding the importance of feature learning in NNs (Chizat et al., 2019; Fort et al., 2020; Baratin et al., 2021; Lee et al., 2020; Aitchison, 2020; Ghorbani et al., 2020), fuelled in part by the development that wide NNs become (Neural Tangent) Kernel machines in certain regimes (Jacot et al., 2018; Lee et al., 2019; Yang & Littwin, 2021), thus forgoing feature learning. The consensus in these works is that there are gaps between NTK theory & practical NNs that cannot be explained without feature learning. However, Yang & Hu (2020) proved that feature-learning is still possible with infinitewidth NNs, and also that feature learning is equivalent to feature kernel learning in infinite-width NNs. This motivates our study of the feature kernel as a key object for distillation. Regularising the feature kernel to the true target covariance kernel was suggested by Yoo et al. (2021). 7 CONCLUSION We have theoretically shown that the feature kernel is a valid object for Knowledge Distillation (KD) in Neural Networks (NNs) by extending the analysis of Allen-Zhu & Li (2020), which focused on vanilla KD (Hinton et al., 2015). Further, we used our theoretical insights to motivate practical considerations when using feature kernels for distillation, such as using the feature correlation kernel & using feature regularisation, to improve on previous feature based KD methods. We term our approach Feature Kernel Distillation (FKD), and note that FKD is more widely applicable than vanilla KD, as it benefits from being agnostic to teacher and student prediction spaces. Experimentally, we have demonstrated that FKD is amenable to ensemble distillation as suggested by our theory, is able to transfer knowledge across similar datasets and that FKD outperforms vanilla KD & previous feature kernel based KD methods across a variety of architectures on CIFAR-100, and Image Net-1K. Limitations & future work. Though feature learning is central to our results, we stress that there are still gaps between our theory & practice to understanding NN ensembling & KD, demonstrated by the divergence between ensemble teacher & FKD in Fig. 4 for larger ensemble size. This could be due to: the multi-view data setting not being able to capture the full complexity of real-world data, the role of hierarchical feature learning between layers in a deep NN, or the importance of minibatching in stochastic gradient descent. Other future work could apply the multi-view data setting to analyse uncertainty quantification in NN ensembles, assess the impact of different FKD regularisation metrics in Eq. (3), or improve FKD further to compete with state-of-the-art KD methods. Published as a conference paper at ICLR 2022 Acknowledgements We thank Emilien Dupont, Yee Whye Teh, and Sheheryar Zaidi for helpful feedback on this work. Laurence Aitchison. Why bigger is not always better: on finite and infinite neural networks. In International Conference on Machine Learning, pp. 156 164. PMLR, 2020. Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. ar Xiv preprint ar Xiv:2012.09816, 2020. Aristide Baratin, Thomas George, C esar Laurent, R Devon Hjelm, Guillaume Lajoie, Pascal Vincent, and Simon Lacoste-Julien. Implicit regularization via neural feature alignment. In International Conference on Artificial Intelligence and Statistics, pp. 2269 2277. PMLR, 2021. Leo Breiman. Random forests. Machine learning, 45(1):5 32, 2001. Cristian Buciluˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535 541, 2006. Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, and Chun Chen. Online knowledge distillation with diverse peers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3430 3437, 2020a. Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. Crosslayer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7028 7036, 2021. Shuxiao Chen, Hangfeng He, and Weijie Su. Label-aware neural tangent kernel: Toward better generalization and local elasticity. Advances in Neural Information Processing Systems, 33, 2020b. L ena ıc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32:2937 2947, 2019. Alexander D Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. ar Xiv preprint ar Xiv:2011.03395, 2020. Francesco D Angelo and Vincent Fortuin. Repulsive deep ensembles are bayesian. ar Xiv preprint ar Xiv:2106.11642, 2021. Yann Dauphin and Ekin Dogus Cubuk. Deconstructing the regularization of batchnorm. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=d-Xz F81Wg1. Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1 15. Springer, 2000. Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. ar Xiv preprint ar Xiv:1912.02757, 2019. Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems, 33, 2020. Published as a conference paper at ICLR 2022 Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In International Conference on Machine Learning, pp. 1607 1616. PMLR, 2018. Yan Gao, Titouan Parcollet, and Nicholas D. Lane. Distilling knowledge from ensembles of acoustic models for joint ctc-attention end-to-end speech recognition. volume abs/2005.09310, 2020. Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? ar Xiv preprint ar Xiv:2006.13409, 2020. Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993 1001, 1990. Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the neural tangent kernel. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1010 1022. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 0b1ec366924b26fc98fa7b71a9c249cf-Paper.pdf. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8580 8589, 2018. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. Anders Krogh, Jesper Vedelsby, et al. Neural network ensembles, cross validation, and active learning. Advances in neural information processing systems, 7:231 238, 1995. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017. Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep Neural Networks as Gaussian Processes. In International Conference on Learning Representations, 2018. Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32:8572 8583, 2019. Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33, 2020. Wesley Maddox, Shuai Tang, Pablo Moreno, Andrew Gordon Wilson, and Andreas Damianou. Fast adaptation with linearized neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 2737 2745. PMLR, 2021. Published as a conference paper at ICLR 2022 Alexander G de G Matthews, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Sample-thenoptimize posterior sampling for bayesian linear models. 2017. Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian Process Behaviour in Wide Deep Neural Networks. In International Conference on Learning Representations, volume 4, 2018. Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In International Conference on Learning Representations, 2018. Sebastian W Ober and Carl Edward Rasmussen. Benchmarking the neural linear model for regression. ar Xiv preprint ar Xiv:1912.08416, 2019. Sebastian W Ober, Carl E Rasmussen, and Mark van der Wilk. The promises and pitfalls of deep kernel learning. ar Xiv preprint ar Xiv:2102.12108, 2021. Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems, 32:13991 14002, 2019. Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967 3976, 2019. Nikolaos Passalis and Anastasios Tefas. Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 268 284, 2018. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019a. URL http://papers.neurips.cc/paper/9015-pytorchan-imperative-style-high-performance-deep-learning-library.pdf. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019b. Tim Pearce, Felix Leibfried, and Alexandra Brintrup. Uncertainty in neural networks: Approximately bayesian ensembling. In International conference on artificial intelligence and statistics, pp. 234 244. PMLR, 2020. Qi Qian, Hao Li, and Juhua Hu. Efficient kernel transfer in knowledge distillation. ar Xiv preprint ar Xiv:2009.14416, 2020. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pp. 1177 1184, 2007. Published as a conference paper at ICLR 2022 Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, Franc ois Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. Speech Brain: A general-purpose speech toolkit. 2021. Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. ar Xiv preprint ar Xiv:1802.09127, 2018. Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ar Xiv preprint ar Xiv:1412.6550, 2014. Lisa Schut, Edward Hu, Greg Yang, and Yarin Gal. Deep Ensemble Uncertainty Fails as Network Width Increases: Why, and How to Fix It. In Workshop on Uncertainty & Robustness in Deep Learning, 2021. Xu Tan, Yi Ren, Di He, Tao Qin, and Tie-Yan Liu. Multilingual neural machine translation with knowledge distillation. In International Conference on Learning Representations, 2019. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In International Conference on Learning Representations, 2020. URL https://openreview.net/ forum?id=Skgp BJrtv S. Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1365 1374, 2019. Joost van Amersfoort, Lewis Smith, Andrew Jesson, Oscar Key, and Yarin Gal. On feature collapse and deep kernel learning for single forward pass uncertainty. ar Xiv preprint ar Xiv:2102.11409, 2021. Kees Van Den Doel, Uri M Ascher, and Eldad Haber. The lost honor of ℓ2-based regularization. In Large scale inverse problems, pp. 181 203. De Gruyter, 2013. Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyperparameter ensembles for robustness and uncertainty quantification. ar Xiv preprint ar Xiv:2006.13570, 2020. Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. ar Xiv preprint ar Xiv:2002.08791, 2020. Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial intelligence and statistics, pp. 370 378. PMLR, 2016. Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pp. 3635 3673. PMLR, 2020. Greg Yang. Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes. ar Xiv preprint ar Xiv:1910.12478, 2019. Greg Yang and Edward J Hu. Feature learning in infinite-width neural networks. ar Xiv preprint ar Xiv:2011.14522, 2020. Greg Yang and Etai Littwin. Tensor programs iib: Architectural universality of neural tangent kernel training dynamics. ar Xiv preprint ar Xiv:2105.03703, 2021. Published as a conference paper at ICLR 2022 Boseon Yoo, Jiwoo Lee, Janghoon Ju, Seijun Chung, Soyeon Kim, and Jaesik Choi. Conditional temporal neural processes with covariance loss. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 12051 12061. PMLR, 18 24 Jul 2021. URL http://proceedings.mlr.press/v139/yoo21b.html. Lu Yu, Vacit Oguz Yazici, Xialei Liu, Joost van de Weijer, Yongmei Cheng, and Arnau Ramisa. Learning metrics from teachers: Compact networks for image embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2907 2916, 2019. Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ar Xiv preprint ar Xiv:1612.03928, 2016. Sheheryar Zaidi, Arber Zela, Thomas Elsken, Chris Holmes, Frank Hutter, and Yee Whye Teh. Neural ensemble search for uncertainty estimation and dataset shift. ar Xiv preprint ar Xiv:2006.08573, 2020. Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3713 3722, 2019. Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4320 4328, 2018. Published as a conference paper at ICLR 2022 APPENDIX: FEATURE KERNEL DISTILLATION A FEATURE KERNEL DEPENDENCE WITH CROSS-ENTROPY LOSS Suppose, we have n data points with fixed feature extractor h = h(X) Rn p, trainable last layer weights W Rp C and targets Y Rn C. In Section 2, we described how with Mean Squared Error loss the trained NN is precisely kernel ridge regression using the feature kernel, when there is ℓ2 regularisation on the last layer weights. Hence, the feature kernel exactly determines the predictions. We now prove the same for cross entropy loss. That is to say, we wish to show that given the feature kernel k, the predictive probabilities/logits at the optimal W are independent of both W and h. We define the loss, for some regularisation λ > 0 (purely to enforce strong convexity) by i=1 CE(yi, hi W) + λ where hi R1 p is the ith row of h and CE(y, f) = log efy P c [C] efc , (8) is cross entropy loss. Proposition 1. Let feature extractor h( ), training inputs X, training targets Y , ℓ2 regularisation λ > 0 all be fixed. Suppose we are given a test point x with features h , then the test prediction logits h W at optimal W can be expressed solely in terms of feature kernel evaluations k( , )= h( ), h( ) . Proof. If we differentiate L(W) and set to zero we get: i=1 hi,j 1{yi = l} pi,l (9) where p Rn C is the result of applying softmax to each row of the logits h W . Let hi denote the extracted features for training input xi. Now recall that hi, hj = k(xi, xj) is the feature kernel evaluated at xi, xj. We multiply Eq. (9) by the test-data feature vector h to give: i=1 k(x , xi) 1{yi = l} pi,l (10) but h W are precisely the logits for test point x , and likewise pi,l is the vector of probability predictions at training point i. Hence, we could solve Eq. (10) numerically for logits/predictive probabilities given only the feature kernel (without h( ) or W ), and we see that the feature kernel once again determines logit/prediction probabilities at the optimal last layer parameters like for squared error loss, albeit this time implicitly for cross entropy loss. Published as a conference paper at ICLR 2022 B SETUP FOR TRAINING ON MULTI-VIEW DATA Before we prove Theorem 2, which demonstrates the generalisation benefits of FKD theoretically, we need to recall the data and training setup of Allen-Zhu & Li (2020) for completeness and selfcontainedness. Unless otherwise stated, everything in this section (App. B) is a simplified version of the setup in Allen-Zhu & Li (2020). Our results hold in the more general version too, but we present the setup in a simplified setting here for readability and convenience, without sacrificing the key messages and intuitions: our focus is for our theoretical and practical contributions to guide each other and align as much as possible, as opposed to e.g. maximising the generality of our theory. B.1 MULTI-VIEW DATA DISTRIBUTION Recall that we consider a C-class classification problem over P-patch inputs, where each patch has dimension d, so our inputs are described by x = (x1, . . . , x P ) (Rd)P . For simplicity, we take P = C2 and d = poly(C) for a large polynomial. Like Allen-Zhu & Li (2020), we use O, Θ, Ωto hide polylogarithmic factors in the number of classes, C, which we take to be sufficiently large. We assume that each class c [C] has exactly two attributes vc,1, vc,2 Rd, which are orthonormal for simplicity,7 such that: V = {vc,1, vc,2}c [C] is the set of all attributes. B.1.1 DATA GENERATING MECHANISM Let our data distribution for a data pair (x, y) D be defined as D = µDs + (1 µ)Dm, for multiview & single-view distributions Dm & Ds respectively. (x, y) D are generated as follows: 1. Sample y [C] uniformly at random. 2. Sample a set V (x) of attributes uniformly at random from {vc ,1, vc ,2}c =y each with probability s C and denote V(X) = V (x) {vy,1, vy,2} as the set of attribute vectors used in data x. We take s = C0.2. V (x) correspond to the ambiguous attributes present in x, such as the cat whose eyes look like car headlights. These V are crucial in our proofs as the FKD regularisation between ambiguous images ensures that the student learns attributes it would have otherwise missed. For each v V(x), pick Cp (where Cp is a global constant) many disjoint patches in [P] uniformly at random and denote this set as Pv(x). Denote P(x) = v V(x)Pv(x): all other patches p P(x) will contain noise only. 3.4. If x is single view, pick a value ˆl = ˆl(x) {1, 2} uniformly at random. ˆl corresponds to the attribute vy,ˆl that is present in x, with vy,3 ˆl missing. 5. For each p Pv(x) for some v V(x): xp = zpv + X v V αp,v v + ξp 7As d = poly(C) for a large polynomial, this isn t too far-fetched an assumption. Published as a conference paper at ICLR 2022 where αp,v [0, 1 C1.5 ] represents feature noise and ξp N(0, σ2 p Id) is independent random noise, with σp= 1 dpolylog(C). The coefficients zp 0 satisfy: (a) If x is multi view, When v {vy,1, vy,2}, (P p Pv(x) zp [1, 2) P p Pv(x) z4 p = 1 (11) Intuition: These conditions ensure that both attributes for class y are equally likely to be learnt, averaged over different random initialisations of parameters. When v V (x), (P p Pv(x) zp = 0.4 P p Pv(x) z4 p = Θ(1) (12) (b) If x is single view, When v=vy,ˆl, P p Pv(x) zp = 1 When v=vy,3 ˆl, P p Pv(x) zp = C 0.2 When v V (x), P p Pv(x) zp = Γ where Γ = 1 polylog(C) Intuition: This is where the single view name comes from, as C 0.2 1 we see that vy,3 ˆl is barely present in x. 6. For each p [P]\Pv(x): v V αp,v v + ξp for feature noise αp,v [0, 1 C1.5 ] and ξp N(0, 1 Cd Id) is independent random noise. One can think of the zero feature noise setting αp,v = 0 p, v for simplicity. But the general formulation above renders the problem unlearnable by linear classifier, as the maximum permissible feature noise across patches dominates the minimum possible signal: P C1.5 = C0.5 1 Remark It is possible to allow more relaxed assumptions, e.g. on zp, as in Allen-Zhu & Li (2020). Training data Recall we have D = µDs + (1 µ)Dm, so that a proportion (1 µ) of the data is multi-view. Our training data, ˆD, is N independent samples from D. Letting ˆD = ˆDm ˆDs denote the split into multi and single view training data. We let µ = 1 poly(C) and we suppose |N| = C1.2 that each label c appears at least Ω(1) in ˆDs. Published as a conference paper at ICLR 2022 0.4 0.2 0.0 0.2 0.4 x Figure 5: Comparison between Re LU & Re LU, for ϱ = 0.2. B.2 SMOOTHED RELU CNN Our theoretical analysis considers a single hidden layer CNN with sum-pooling, f, such that for class c: p=1 Re LU( θc,r, xp ), c [C]. For a threshold ϱ = 1 polylog(C), we define the smoothed Re LU function as: 0 if z 0 z4 4ϱ3 if z [0, ϱ] z 3 which has continuous monotonic gradient denoted as Re LU . B.3 HYPERPARAMETER VALUES IN SETUP Hyperparameter values used in the theoretical setup are given in Table 3. B.4 STANDARD TRAINING We define our empirical loss by: i [N] L(f(xi), yi) where L is the cross entropy loss defined in Eq. (8). We randomly initialise parameters θ0 c,r i.i.d. N(0, σ2 0) for σ2 0 = 1 Standard training in the theoretical analysis comprises of full-batch gradient descent on the empirical loss L with learning rate η 1 poly(C) and for T = poly(C) η iterations. C PROOF OF THEOREM 2 We restate Theorem 2, where recall E denotes the teacher ensemble size and C is the number of classes: Published as a conference paper at ICLR 2022 Table 3: Hyperparameter values used in our theoretical analysis, corresponding to the setup of Allen-Zhu & Li (2020). (*) denotes undefined in our presentation, but appearing in Allen-Zhu & Li (2020), because the hyperparameter takes only one value in this work. We note that m is restricted to be polylog(C) in the setting of Theorem 2. Hyperparameter Description Value(s) N Training set size C1.2 µ µ Proportion of single-view data 1 poly(C) d Input patch dimension poly(C) P Number of patches per input C2 m Number of channels per class [polylog(C), C] s Out-of-class attribute sparsity C0.2 σ0 Parameter initialisation standard deviation (std) C 0.5 σp Input patch additive noise std 1 dpolylog(C) Γ Out-of-class attribute strength in single-view data. 1 polylog(C) ϱ Re LU threshold 1 dpolylog(C) q (*) Re LU mid-section exponent 4 ρ (*) In-class weaker attribute strength in single-view data C 0.2 Theorem 2 (FKD improves student generalisation and is better with larger ensemble). Given an arbitrary ϵ > 0. For any ensemble size E of teacher NNs trained as in Theorem 1 and sufficiently many classes C, for m = polylog(C), with learning rate η 1 poly(C), and training time T = poly(C) ensemble teacher knowledge can be distilled into a single student model f (T ) using only teacher feature kernel k T , Eq. (7), such that with probability at least 1 e Ω(log2(C)): Training accuracy is perfect: For all (x, y) ˆD, y = argmaxc f (T ) c (x). Test accuracy is good: P(x,y) D y = argmaxc f (T ) c (x) ( 1 2E+1 + ϵ)µ. Outline of proof of Theorem 2 1. We first analyse the feature kernel k for a single trained model, in App. C.1. 2. We then extend our results to an ensembled teacher model feature kernel in App. C.2. 3. We next outline our kernel distillation training scheme and provide a key result concerning how the student s parameters become increasingly correlated with the teacher s learnt views in Apps. C.3 and C.4 4. We combine all threads to prove the final result in App. C.5. C.1 FEATURE KERNEL FOR A SINGLE TRAINED MODEL To prove Theorem 2, we first calculate the feature kernel k for a single trained model with trained parameters θ from initialisation θ0. The point of this exercise is to show that the feature kernel k is able to detect whether two inputs x, x share a common attribute which is in the subset of attributes that was learnt by the trained parameters θ , due to random correlation with initialised parameters θ0. This is formally shown in Lemma 1. Published as a conference paper at ICLR 2022 Recall the definition of the feature kernel k for our CNN architecture: p,p =1 Re LU( θc,r, xp ) Re LU( θc,r, x p ) Let us first make a few more useful definitions. For l = 1, 2, we have: Definition 2. Zc,l(x) def = 1{vc,l V(x)} X p Pvc,l(x) zp. Zc,l(x) is a scalar constant describing the strength, in x, of the presence of attribute vc,l (where c may either correspond to true class y or to a different class; the latter has the effect of producing soft labels when using vanilla knowledge distillation). Definition 3. Φ c,l def = X r [m] [ θt c,r, vc,l ]+. Definition 4. Υ c,l def = X r [m] [ θ c,r, vc,l ]+2. Intuition: Υ c,l, Φ c,l are both parameter-dependent, data-independent scalars that describe the amount that attribute vc,l has been learnt by the network parameters. Υ c,l appears in the feature kernel whereas Φ c,l appears in the function predictions fc directly, for our particular architecture in Eq. (4) (Allen-Zhu & Li, 2020). Lemma 1. We have, for x, x ˆDm, k (x, x ) = X (c,l):(c,3 l)/ M Υ c,l Zc,l(x)Zc,l(x ) O s M(x, x ) polylog(C) + O( 1 C0.8 ) where s M(x, x ) = |{(c, l) : vc,l V(x) V(x ) and (c, 3 l) / M}| is the number of shared attributes of x and x , that are also members of the set of attributes learnt by the single trained model with parameters θ , and M is defined in Fact A.e below. Proof. We have k (x, x ) = p=1 Re LU( θ c,r, xp ) p =1 Re LU( θ c,r, x p ) . First, define the hidden-layer activations for class c and channel r to be: Definition 5. Ψc,r(x) def = p=1 Re LU( θ c,r, xp ). Published as a conference paper at ICLR 2022 We know from Theorem C.2. of Allen-Zhu & Li (2020) that, for multi-view x ˆDm and some c [C]: Fact A.a For every p Pvc,l(x) and for l = 1, 2, we have: θ c,r, xp = θ c,r, vc,l zp o(σ0) Fact A.b For every p P(x)\(Pvc,1(x) Pvc,2(x)), we have | θ c,r, xp | O(σ0) Fact A.c For every p [P]\P(x), we have | θ c,r, xp | O( σ0 Fact A.d For every r [m]\M0 c, every l [2], it holds that θ c,r, vc,l O(σ0), where: M0 c def = r [m] l [2] : θ0 c,r, vc,l (1 O( 1 log(C))) max r [m][ θ0 c,r, vc,l ]+ . Note from Proposition B.1 of Allen-Zhu & Li (2020), that m0 def = |M0 c| = O(log5(C)) with probability at least 1 e Ω(log5(C)). M0 c denotes the key channels in [m] which have won the lottery and are relevant for class c, in that in the C limit the predictions for fc are the same as if one forgets the other channels, as shown in Allen-Zhu & Li (2020). Fact A.e For every p Pvc,l(x) and r [m], if (c, 3 l) M, then we have: | θ c,r, xp | O(σ0), where: M def = (c, l) [C] [2] max r [m][ θ0 c,r, vc,l ]+ 1 + 1 log2(m) max r [m][ θ0 c,r, vc,3 l ]+ Intuition: M denotes the data attributes vc,l which are more likely to be learnt by the NN parameters (compared to their fellow class attributes vc,3 l) because of correlations of the initial parameters with such attributes. So, Fact A.e is saying that if (c, 3 l) M, then the attribute (c, l) is not learnt at all during standard single model training. l=1 1{vc,l V(x)} X p Pvc,l(x) Re LU θ c,r, xp (13) l{Pvc,l(x)} Re LU O(σ0) (14) p [P ]\P(x) Re LU O( σ0 First, note that |Pvc ,l (x)| = Cp is constant c , l . From Fact A.b, Eq. (14) can be easily seen to be O(σ4 0s) = O(C 1.8) as σ2 0 = 1 C and s = C0.2, and likewise Eq. (15) can be seen to be O(( σ0 C )4P) = O(C 2) as P = C2, by Fact A.c. We note here that summing these equations over m and C will be bounded above by O( 1 polylog(C)). Published as a conference paper at ICLR 2022 Now, let s consider Eq. (13). Let notation vc,1, vc,2 S denote either vc,1 or vc,2 S for some set S, and vc,1, vc,2 / S denote neither vc,1 nor vc,2 S. There are three cases to consider: 1. If vc,1, vc,2 / V(x), then Eq. (13) is zero. 2. Else if l [2] such that vc,l V(x) and (c, 3 l) M, then by Fact A.e, we have Eq. (13) is O(σ4 0)= O(C 2). This setting is when only attribute (c, l) appears in x, but attribute (c, 3 l) was dominant at initialisation so (c, l) has not been learnt by θ . 3. Else, we have that: l=1 1{vc,l V(x)} X p Pvc,l(x) Re LU θ c,r, xp . Putting this all together with Fact A.a, we see that l=1 1{(c, l) s M(x, x)} X p Pvc,l(x) Re LU θ c,r, vc,l zp + o(σ0) + O( 1 C1.8 ). Now, if r / M0 c (i.e. neuron r is not dominant at initialisation), and (c, 1), (c, 2) s M(x, x), the by Fact A.d, we have that Ψc,r(x)= O(C 1.8). On the other hand, recall ϱ = 1 polylog(C) and m0 = O(log5(C)). And also, recall that if z > ϱ, then Re LU(z) = z + O(ϱ). Moreover, by Claim C.11 of Allen-Zhu & Li (2020) we know that r , l such that θ c,r , vc,l = Ω( 1 m0 ). Hence, for any r M0 c, we know that either: 1. Ψc,r(x) = P2 l=1 1{(c, l) s M(x, x)} θ c,r, vc,l Zc,l(x) O(ϱ) + O( 1 C1.8 ), or 2. Ψc,r(x) = 1{(c, 1), (c, 2) s M(x, x)}O(ϱ) + O( 1 C1.8 ). Moreover, Ψc,r(x) = O(1) r, by e.g. Lemma C.21 of Allen-Zhu & Li (2020). And so it can be seen, for ϱ small enough we have: r=1 Ψc,r(x)Ψc,r(x ) l=1 1{(c, 3 l) / M}Zc,l(x)Zc,l(x )[ θ c,r, vc,l ]+2 O( 1 polylog(C)) + O( 1 C1.8 ) l=1 1{(c, 3 l) / M}Zc,l(x)Zc,l(x )[Υ c,l O( 1 polylog(C)) + O( 1 C1.8 ) Published as a conference paper at ICLR 2022 where we also use Fact A.e above, such that it is not possible for both θ c,r, vc,l + and θ c,r, vc,3 l + to be large.8 Note also from e.g. the proof of Allen-Zhu & Li (2020) Theorem 1, that if (c, 3 l) / M, then maxr θ c,r, vc,l = Θ(1). Thus, we see that the contribution to k from the m class-c channels/neurons is: kc(x, x ) def = r=1 Ψc,r(x)Ψc,r(x ) (16) = Θ(1) if (c, 1), (c, 2) s M(x, x ) O( 1 C1.8 ) else (17) Summing kc over [C] completes the proof of the lemma. We now use Lemma 1 to analyse k and ρ (x, x ) = k (x,x ) k (x,x)k (x ,x ). First we look at Υ c,l. Recall that θ c,r, vc,l = O(1) for all c [C], l [2], r [m], and likewise so is M0 c. More specifically, If (c, l) M, then we have from the proof of Theorem 1 in Allen-Zhu & Li (2020) that P r θ c,r, vc,l + Ω(log(C)), and so maxr θ c,r, vc,l + = Θ(1). If neither (c, 1) nor (c, 2) are in M, then Claim C.10 of Allen-Zhu & Li (2020) shows us that both maxr θ c,r, vc,1 +, maxr θ c,r, vc,2 + are Θ(1). Combining these facts, we have that: Υ c,l = Θ(1) if (c, 3 l) / M and so from Lemma 1, we have: k (x, x ) = X (c,l):(c,3 l)/ M Υ c,l Zc,l(x)Zc,l(x ) O s M(x, x ) polylog(C) + O(C 0.8) (18) (c,l):(c,3 l)/ M Zc,l(x)Zc,l(x ) O s M(x, x ) polylog(C) + O(C 0.8) (19) = Θ(s M(x, x )) O s M(x, x ) polylog(C) + O(C 0.8). (20) Finally, we arrive at an expression for the correlation kernel ρ of a trained single model by ρ (x, x ) = Θ s M(x, x ) p s M(x)s M(x ) 1 + O( 1 polylog(C)) + O( 1 C0.8 ) (21) where we define that s M(x) = s M(x, x) is the number of attributes in x that have been learnt by the trained network θ , and is s(1 o(1)) with high probability. 8Unless neither (c, 1) nor (c, 2) M for a given c, but that only occurs in o(C) classes, and does not change the order of e.g. s M(x, x ) which is what we really care about. Published as a conference paper at ICLR 2022 Compare ρ in Eq. (21) to the soft probability labels pτ RC of Allen-Zhu & Li (2020) (Claim F.4) with temperature τ = 1 log2(C): ( 1 s(x) if vc,1 or vc,2 is in V(x) 0 else where s(x) is the number of indices c [C] such that vc,1 or vc,2 is in V(x). Note, the setting of Allen-Zhu & Li (2020) is with a large Ω(1) ensemble, so every attribute is learnt (akin to M being empty for us). In the case that M = {} being empty, if x = x and they share at least one feature, then from App. B.1.1 with high probability they will share exactly one feature, so that s M(x, x ) = 1. Moreover, s(x) = s M(x) s M(x ), hence we see that Eq. (21) matches roughly with pτ c(x), but without the need for a temperature hyperparameter. We see that vanilla KD learns new attributes by comparing a single data point x between classes, and giving larger target labels to the classes where ambiguous attributes learnt by the teacher are present in x. On the other hand, ρ gives higher values to data pairs x, x that share attributes that have been learnt by the trained model, and as we will see later this is how FKD learns new attributes in the student. C.2 ENSEMBLED TEACHER To summarise what we have done so far in App. C.1, we have seen in Eqs. (20) and (21) that it is possible, for a single trained model θ , to simplify both the feature kernel k (x, x ) and correlation kernel ρ (x, x ) in terms of the number of shared attributes between x, x which are also learnt by the trained model. The set of attributes learnt by the single trained model is captured by the set M = (c, l) [C] [2] max r [m][ θ0 c,r, vc,l ]+ 1 + 1 log2(m) max r [m][ θ0 c,r, vc,3 l ]+ where θ0 was the random parameter initialisation for θ . From Fact A.e, we know that if (c, 3 l) M, then the attribute vc,l has not been learnt by the network. Consider now an ensemble of E = Θ(1) independently trained networks, {θ e}E e=1, with an averaged feature kernel: k T (x, x ) = 1 e=1 k e(x, x ). Suppose {θe,0}E e=1 denotes the corresponding independent parameter initialisations. Then, for each e [E], let us define: Me def = (c, l) [C] [2] max r [m][ θe,0 c,r, vc,l ]+ 1 + 1 log2(m) max r [m][ θe,0 c,r, vc,3 l ]+ . Note that these Me are completely independent sets due to the independent initialisations, and also by Proposition B.2 of Allen-Zhu & Li (2020), we know that P (c, 1) or (c, 2) Me 1 o(1) c [C], e [E]. Therefore, C |Me| C(1 op(1)) e [E]. Published as a conference paper at ICLR 2022 Moreover, Eq. (11) & Proposition B.2 of Allen-Zhu & Li (2020) tell us that each of the two attributes vc,1, vc,2 are equally likely to be in Me (and so learnt in the multi-view setup), This means that: e=1 Me| = 1 2E 1 C(1 op(1)). Define MT = TE e=1 Me. From Eq. (19) and the definition of k T , we see that: k T (x, x ) = Θ(1) X (c,l) 1{(c, 3 l) / MT }Zc,l(x)Zc,l(x ) O s MT (x, x ) polylog(C) + O(C 0.8) = Θ(s MT (x, x )) O s MT (x, x ) polylog(C) + O(C 0.8) (23) where for the reader s convenience, we redefine: s MT (x, x ) = {(c, l) : vc,l V(x) V(x ) and (c, 3 l) / MT }. We see that only for those attributes (c, l) such that (c, 3 l) MT does the ensembled teacher k T miss the fact that we should have a strong Θ(1) kernel value between x, x . This is when |V(x) V(x )| = {vc,l} is non-empty (or in other words, when s MT (x, x ) =|V(x) V(x )|), and hence there should be a large kernel value k T (x, x ). So we see that only |MT | of the attributes are not learnt by the teacher, which is a fraction |MT | 2C = 1 2E (1 o(1)) of all the attributes. These missed attributes are where the 1 2E+1 test error in Theorem 2 comes from (teacher ensemble of size E, and plus 1 for the attributes learnt from the student s initialisation too). What s more, we can decompose the teacher s feature kernel k T = P c k T c into contributions k T c from each class c, like in Eq. (16). From Eq. (17), we see that the contribution to the teacher s feature kernel from class , for x = x ; k T c (x, x ) = Θ(1) if (c, 1), (c, 2) s MT (x, x ) O( 1 C1.8 ) else, (24) is able to decipher between whether or not x, x share an attribute from class c for all attributes apart from those (c, l) such that (c, 3 l) MT . C.3 TRAINING SCHEME FOR FKD We note at this point that we are morally done in terms of proving Theorem 2, with Eqs. (23) and (24), our key results telling us that the (ensemble) teacher kernel k T can identify when two inputs share common attributes that have been learnt by the (ensemble) teacher, and more specifically that k T c can do so when said common attribute is from class c. What remains is a repackaging of the proof techniques of Allen-Zhu & Li (2020) (particularly for their Theorem 4 regarding self-distillation), that knowledge distillation (this time only using feature kernels instead of temperature-scaled logits, and with explicit dependence on teacher ensemble size) can improve generalisation performance of a student. Published as a conference paper at ICLR 2022 For convenience, the theoretical analysis of Allen-Zhu & Li (2020) introduces some slight discrepancies between the actual practical weight updates of vanilla KD Hinton et al. (2015), i.e. the gradients of: L = L + λ 1 τ , f T (xi) and the weight updates in their theoretical exposition. Namely, 1. The authors assume that a temperature-dependent threshold caps the logits to give soft labels: pτ c(x) = emin{τ 2fc(x),1}/τ P j [C] emin{τ 2fj(x),1}/τ . 2. The authors truncate the negative part of the gradient of the KD regularisation to only encourage logits to increase not decrease, with weight updates for θc,r on input x: θt c,r def = θt c,r θt+1 c,r θc,r L + η 1 pτ c(x) pτ,T c (x) θc,rfc(x) where pτ,T are the temperature-scaled teacher labels. 3. The authors scale the output of both student and teacher models by a (polylogarithmic) factor, in order to ensure that both reach the threshold to give soft labels in Item 1. above. 4. Self-distillation (Furlanello et al., 2018; Zhang et al., 2019) distils a single teacher and a single student of same architecture into the student, like an ensemble of size 2 (student+teacher). Allen-Zhu & Li (2020) modify the training scheme for their theoretical analysis of self-distillation so that the student is first trained on its own in order separate learning its own attributes/features from those of the teacher. Our analysis covers a similar scheme. These modifications are justified in that they make the theoretical analysis more convenient, whilst illustrating the main mechanisms by which KD works, which is to share dark knowledge that is held in the teacher (in the form of the multi-view attributes that have been acquired by the teacher due to its parameter initialisation), with the student. In the same vein, we now introduce some modifications to the practical implementation of FKD we propose in Alg. 1 to aid our theoretical analysis, and describe the main mechanisms by which FKD works, corroborating our initial analyses in Section 2 and App. A about how the feature kernel is a crucial object in any NN and captures all the dark knowledge that a teacher network could possess in the multi-view data setting. It is likely possible to extend our proof of Theorem 2 with different modifications/training schemes, but given that the focus of this work is to introduce FKD as a principled alternative to vanilla KD with certain advantages such as prediction-space independence, and that the multi-view setting we consider is a plausible simplification of real world data (as demonstrated in Allen-Zhu & Li (2020)), we leave this to future work. We stress that any simplifications to the update rule in Alg. 1 for this section can be efficiently computed, only requiring access to pairwise evaluations of the student and teacher feature kernels, if need be. Modified training regime for FKD 1. We first suppose that the student is trained as standard (as in App. B.4) for T1 = poly(C) η steps, and learns its own subset of attributes MS, dependent on its initialisation θ0 s, before being trained with the FKD objective: Published as a conference paper at ICLR 2022 Intuition: This mirrors the self-distillation setup of Allen-Zhu & Li (2020) Theorem 4. The idea being that the student first learns MS before picking up the other attributes that the teacher has access to. 2. For a given feature kernel k, we threshold the feature kernel k based on value, to define a modification k such that k(x, x ) = 1 if k(x, x ) 1 m2 0 else This condition delineates between the setting where x, x share common attributes learnt by student parameters θT1 in the initial phase of training (i.e. delineates between whether s Ms(x, x ) nonempty or empty). To see this: note that if vc,l V(x) V(x ) and (c, 3 l) / MS then we know from Allen-Zhu & Li (2020) that ΦT1 c,l Ω(log(C)). Hence maxr θ c,r, vc,l Ω(log 4(C)) as the number of active neurons m0 = |M0 c| = O(log5C), and so it s easy to see that for large enough m we have k(x, x ) 1 m2 via Lemma 1. On the other hand, if {vc,l} = V(x) V(x ) and (c, 3 l) MS then from Lemma 1 we know that k(x, x ) = O(C 0.8) 1 m2 . 3. Similar to Allen-Zhu & Li (2020), we also truncate our FKD regularisation to only encourage kernel values to increase, and not decrease. For any input pair x1, x2, we have parameter update: θc,r(x1, x2) kc(x1, x2) k T c (x1, x2) X j {1,2} Ψc,r(xj) θc,rΨc,r(x3 j) where recall Ψc,r(x) def = p=1 Re LU( θc,r, xp ) so that θc,rΨc,r(x) = p=1 Re LU ( θc,r, xp )xp kc(x, x ) def = r=1 Ψc,r(x)Ψc,r(x ) were defined in Definition 5 and Eq. (16). Intuition: If the loss was (kc(x1, x2) k T c (x1, x2))2 then the gradient with respect to θc,r would be: kc(x1, x2) k T c (x1, x2) X j {1,2} Ψc,r(xj) θc,rΨc,r(x3 j) so the only differences with Eq. (25) are truncating kc(x1, x2) k T c (x1, x2) and also the thresholding to obtain k. Published as a conference paper at ICLR 2022 To summarise, after training the student on its own for T1 steps (such that we are in the setting of Theorem 1) to reach parameters θT1 s , we update for T2 = poly(C) η steps as (hiding S subscript): θt c,r= ηEx1,x2 ˆ D2 kc(x1, x2) k T c (x1, x2) X j {1,2} Ψc,r(xj) θc,rΨc,r(x3 j) (26) C.4 FEATURE CORRELATION GROWTHS We now seek to analyse to what extent the attributes {vc,l}c,l are learnt during our T2 FKD training steps. The central objects describing how much vc,l has been learnt by parameters θ are: Φt c,l def = X r [m] [ θt c,r, vc,l ]+ and Φt c def = X l [2] Φt c,l as well as Ψc,r(x) as defined above. Intuition: Φc,l is a data-independent quantity that reflects the strength of correlation with feature vc,l by parameters θ. On the other hand, Ψc,r(x) is a data-dependent quantity that reflects the activation of channel r for class c with input x. Also recall that Zc,l(x) def = 1{vc,l V(x)} X p Pvc,l(x) zp and define Vc,r,l(x) (which is convenient for calculating the size of gradient updates for θc,r): Definition 6. Vc,r,l(x) def = 1{vc,l V(x)} X Re LU ( θc,r, xp )zp Like how Lemma 1 simplified the feature kernel in terms of data-dependent Zc,l(x) and dataindependent Υc,l, we have a result from Allen-Zhu & Li (2020) to simplify function predictions fc in terms of Zc,l(x) and Φc,l: Claim 1 (Claim F.7 from Allen-Zhu & Li (2020)). For every t T1 + T2, every c [C], every (x, y) ˆD (or every test sample (x, y) D with probability 1 e Ω(log2(C))): f t c(x) = X Φt c,l Zt c,l(x) O( 1 polylog(C)) We also have the following facts from Allen-Zhu & Li (2020) regarding the correlation of gradient θc,rΨt c,r(x) with vc,l for (x, y) ˆD, l [2] and r [m]: Claim 2 (c.f. Claim F.6 of Allen-Zhu & Li (2020)). For every t T1 + T2, for every (x, y) ˆD, every c [C], r [m] and l [2]: If vc,1, vc,2 V(x), then θc,rΨt c,r(x), vc,l Vc,r,l(x) O(σp P) θc,rΨt c,r, vc,l 1{(vc,l V(x)}Vc,r,l(x) + O(C 2) For every i = c, | θc,rΨt c,r(x), vi,l | O(C 1.5) Published as a conference paper at ICLR 2022 Disclaimer Technically, Allen-Zhu & Li (2020) only show Claims 1 and 2 for t T1 and one would need to use similar proof techniques (such as their inductive hypothesis F.1) to show the case for T1 t T2, which we skip for conciseness. We now study the growth of the student s Φc,l, for those (c, l) which have been learnt by the teacher but not the student: Lemma 2 (Correlation Growth for attributes learnt by teacher). For every c [C], l [2], T2 t T1, such that (c, 3 l) / MT , suppose Φt c,l 1 2m, then we have: Φt+1 c,l Φt c,l + Ω(ηs2 C2 ) Φt c,l 4 Re LU (Φt c,l) Proof. For any c [C], r [m], l [2], we have from Claim 2 θt c,r, vc,l =ηEx1,x2 ˆ D2 k T c (x1, x2) kc(x1, x2) + X j {1,2} Ψc,r(xj) Vc,r,l(x3 j) O(σp P) Note that as µ 1 poly(C), we can suppose that both x1, x2 are multi-view data. Using Claim 2, we have that θt c,r, vc,l ηEx1,x2 ˆ D2 k T c (x1, x2) kc(x1, x2) + X j {1,2} Ψc,r(xj) Vc,r,l(x3 j) O(σp P) Let r = argmaxr [m]{ θt c,r , vc,l }, such that definitely θt c,r, vc,l Ω(Φt c,l) because m = polylog(C). But if x is multi-view and vc,l V(x) such that P p Pvc,l z4 p = Θ(1), and also by Fact A.a we have that: Vc,r,l(x) Ω(1) Re LU θt c,r, vc,l Ω Re LU (Φt c,l) Re LU( θt c,r, vc,l zp o(σ0)) Ω(Φt c,l 4) Now, we have assumed that (c, 3 l) / MT , such that the teacher model has learnt attribute (c, l) and satisfies k T c (x, x ) = 1 when vc,l V(x) V(x ) (for large enough polylogarithmic m). Also, it is simple to see that when vc,l V(x) V(x ) & vc,3 l / V(x) V(x ), for large enough m, that Φt c,l 1 2m implies that kc(x, x ) 1 m2 by Lemma 1, and so kc(x, x ) = 0, i.e. the student has not (yet) learnt vc,l . So we see there are two more conditions that must be satisfied in order for k T c (x1, x2) kc(x1, x2) +=1 > 0: 1. vc,l V(x1) V(x2) so that k T c (x1, x2) = 1 2. vc,3 l / V(x1) V(x2) so that kc(x1, x2) = 0. Going back to App. B.1.1, we know that these conditions occur with probability s2 C2 (1 o(1)) for independently sampled x1, x2 ˆD. Finally, putting everything together we have that: θt+1 c,r , vc,l + θt c,r, vc,l + Ω(ηs2 C2 )Φt c,l 4 Re LU (Φt c,l) Published as a conference paper at ICLR 2022 summing over r [m] and noting θt c,r , vc,l 0 r , up to small error (as σp P = 1 poly(C) for a large polynomial), gives us our result. Lemma 2 immediately gives us the following corollaries, because ΦT1 c,l Ω(σ0) from Allen-Zhu & Li (2020) Induction Hypothesis F.1.g and Re LU is increasing. Corollary 1. Define iteration threshold T2 = Θ( C2 ηs2σ7 0 ) = Θ( C5.1 η ), then for every (c, l) such that (c, 3 l) / MT we have: ΦT1+T2 c,l 1 4m But likewise, we can also bound the growth of Φc,l Lemma 3. If (c, 3 l) MS\MT , once Φt c,l q m , it no longer gets updated (for large C): Proof. Recall Definitions 3 and 4 that Φ c,l= P r [m][ θt c,r, vc,l ]+ and Υt c,l= P r [m][ θt c,r, vc,l ]+2 Hence by Cauchy-Schwarz we have: Φt c,l 2 mΥt c,l = Υt c,l loglog(C) m2 2 0.42m2 for large enough C. Thus, if x1, x2 are both multi-view and vc,l V(x1) V(x2), by Lemma 1 we must have kc(x1, x2) 1 m2 It s also not difficult to check that any other possible setting for x1, x2, and V(x1) V(x2) will lead to k T c (x1, x2) kc(x1, x2) + = 0, and hence k T c (x1, x2) kc(x1, x2) + = 0 x1, x2 C.5 WRAPPING UP PROOF OF THEOREM 2 Proof. We are now ready to wrap up our proof. Recall from the proof of Theorem 1 in Allen-Zhu & Li (2020), that after the initial phase of T1 steps of student training on its own: ΦT1 c Ω(log(C)) c [C], and more specifically: If (c, 3 l) / MS, then ΦT1 c,l Ω(log(C)) This gives us perfect test accuracy on the multi-view data, and 50% accuracy on the singleview data, so 0.5µ test accuracy overall without distillation, as per Theorem 1. Published as a conference paper at ICLR 2022 Moreover, we have that if (c, 3 l) / MT and (c, 3 l) MS, then by Corollary 1 and Lemma 3: r m ΦT1+T2 c,l 1 4m loglog(C) ΦT1+T2 c,l 1 4m We see that this change in ΦT1+T2 c,l after FKD training is much smaller than Ω(log(C)) and so the student after FKD training still has perfect multi-view accuracy, as well as correct predictions on any single-view data that possess the attributes learnt in the initial phase of training. On the other hand, for single-view data, we know that if we have data point x, y and attribute vc,l V(x), such that c = y, then P p Pvc,l(x) zp = Γ = O( 1 polylog(C)), as defined in App. B.1.1. Hence for small enough Γ ( 1 m) we have that if (c, 3 l) / MT and the single view data x is of class c with ˆl(x) = l then we have correct prediction, as per Claim 1. Combining these means that we have correct prediction for any single-view data x of class c, and ˆl(x) = l such that (c, 3 l) / MT Ms. By the independence of these sets we have that |MT Ms| = (2 E)k(1 o(1)) this means we have test error less than (2 E 1 + ϵ)µ for any ϵ > 0, for large enough C as required. D DIFFERENCES BETWEEN FKD & OTHER FEATURE KERNEL BASED KD METHODS In this section, we highlight how our FKD approach overcomes some of the shortcomings of previous feature kernel based KD methods which only use pairwise evaluations of the feature kernel: SP (Tung & Mori, 2019) and RKD (Park et al., 2019). One advantage of FKD relative to these previous works is that we have shown FKD is amenable to ensemble distillation. Moreover, it goes without saying that Feature Regularisation, which arises naturally thanks to our feature kernel learning perspective in Section 4, is already a significant departure that improves FKD relative to SP & RKD. However, even without FR we observe in Section 5 that FKD outperforms both RKD & SP across different datasets and architectures, which warrants explanation. D.1 IMPORTANCE OF ZERO DIAGONAL DIFFERENCES: SP (TUNG & MORI, 2019) First, we consider diagonal kernel differences, k S(x, x) k T (x, x) for fixed x, and motivate using zero diagonal differences, which is not present in SP (Tung & Mori, 2019) but is in FKD thanks to our use of the correlation kernel. Fig. 6 displays this comparison between FKD & SP graphically. Intuition: Downside of non-zero diagonal differences The key intuition, which we detail below using our theoretical setup, is that non-zero diagonal differences k(x, x) k T (x, x) = 0 encourage the student to learn noise in input x, compared to when we have zero diagonal differences k(x, x) k T (x, x) = 0. In the latter case, we only have non-zero differences for k(x, x ) k T (x, x ) where x = x . Published as a conference paper at ICLR 2022 (kx, x kx, x )2 for SP (kx, x kx, x )2 for FKD Figure 6: Comparison of (normalised) squared differences in kx,x = k(x, x ) between student S & teacher T , across a minibatch of size 64 of CIFAR-100 training data, for SP (left) and FKD (right). We see that whereas FKD has zero diagonal differences, SP is largely dominated by non-zero diagonal differences. Note there is a slight abuse of notation here, in that we plot squared differences in normalised kernels, so that FKD uses the correlation kernel and SP uses row-normalisation (Tung & Mori, 2019). Diagonal updates Consider our parameter update Eq. (25) when x1 = x2 = x: If we didn t have zero diagonal differences and instead kc(x, x) k T c (x, x) = 1, then: θc,r(x, x) = 2ηΨc,r(x) p=1 Re LU ( θc,r, xp )xp, Now suppose v = vc,1 V(x). For each p Pv(x), recall (from App. B.1.1) that: xp = zpv + ξp, where we assume zero feature noise for simplicity. We then see that (c.f. Claim C.13 of Allen-Zhu & Li (2020)): θc,r(x, x), ξp = 2 Θ(η)Ψc,r(x) Re LU ( θc,r, xp ) + O( 1 as v, ξp = O( 1 d) with high probability. But at the same time (by e.g. Claim F.6 of Allen-Zhu & Li (2020)): θc,r(x, x), v = 2ηΨc,r(x)Vc,r,1(x)(1 o(1)) and note that that Vc,r,1(x) Θ(1) by definition. Moreover, in order for the student network to learn attribute vc,1, then eventually it must satisfy maxr θc,r , v = O(1) from Allen-Zhu & Li (2020). Thus, for ϱ small enough, if r = argmaxr θc,r , v , we have Re LU ( θc,r, xp ) = 1. Non-diagonal updates On the other hand, if x1 = x2 and we have v V(x1) V(x2): θc,r(x1, x2) = η X j=1,2 Ψc,r(xj) p=1 Re LU ( θc,r, x3 j,p )x3 j,p, and so if x1 = x and ξp denotes the random noise in x1,p, then: θc,r(x1, x2), ξp = Θ(η)Ψc,r(x2) Re LU ( θc,r, x1,p ) + O( 1 Published as a conference paper at ICLR 2022 as x2,p, ξp = O( 1 d). But this time we have θc,r(x1, x2), v = η Ψc,r(x1)Vc,r,1(x2) + Ψc,r(x2)Vc,r,1(x1) (1 o(1)), We see that whereas in diagonal updates θc,r(x, x) the increments for θc,r(x, x), ξp and θc,r(x, x), v are 1:1, for non-diagonal updates θc,r(x1, x2) they are 1:2 respectively. Thus, the parameter updates for θc,r with non-zero diagonal differences in feature kernels are more likely to learn noise, ξp, compared to our zero diagonal updates which rely only on non-diagonal θc,r(x1, x2) for x1 = x2. This is why we zero out diagonal differences for FKD, using the feature correlation matrix in practice. D.2 PROBLEM OF HOMOGENEOUS NNS IN RKD (PARK ET AL., 2019) The distance-wise version of RKD (Park et al., 2019) is as follows: for x, x , we calculate ψT (x, x ) = h T (x, θT ) h T (x , θT ) 2, where recall h T is the last-layer teacher feature extractor. Likewise, we also calculate ψS(x, x ) = h S(x, θS) h S(x , θS) 2. The RKD loss adds λKDEx,x [(ψT (x, x ) ψS(x, x ))2] to the student s training loss.9 While RKD (Park et al., 2019) does ensure zero diagonal differences, i.e. that ψT (x, x) ψS(x, x) = 0, it suffers from a related issue, due to the homogeneity of NNs that use Re LU nonlinearity, which is ubiquitous in image classification tasks. Suppose we take x and define x = Mx for some M > 0. For example, think of taking a cat image and multiplying all the pixel values by M. For Re LU (C)NNs without bias parameters, we have that h(x, θ) is 1-homogeneous: h(x , θ) = Mh(x, θ). This means that it is likely (depending on the norms of the features h S and h T ) that we will have ψT (x, Mx) ψS(x, Mx) = 0. But a cat image multiplied by some scalar M is still a cat image, hence RKD runs into the same problems as in App. D.1 of learning noise in x. On the other hand for FKD: correlation kernel ρS(x, Mx)=ρT (x, Mx)=1, hence ρS(x, Mx) ρT (x, Mx)=0 x Rd, M > 0. E PYTORCH-STYLE PSEUDOCODE FOR FKD In Alg. 2, we provide Py Torch-style Paszke et al. (2019b) pseudocode for the distillation and feature regularisation losses in FKD. We note that FKD only requires pairwise computations of feature (correlations) kernels. This alleviates the need for matrix multiplication/inversion operations with batch-by-batch size matrices, which is beneficial for scalability. F EXPERIMENTAL DETAILS AND FURTHER RESULTS F.1 FIG. 2: PREDICTIVE DISAGREEMENT ACROSS INDEPENDENT INITIALISATIONS VS RETRAINED LAST LAYER All models are Res Net20v1 trained with standard hyperparameters: 160 epochs training time with batch size 128 and learning rate 0.1 which is decayed by a factor of 10 after epochs 80 and 120. SGD optimiser with momentum 0.9 and weight decay of 0.0001. 9We do not consider the angle-wise RKD loss here, but there will be similar issues due to homogeneity. Published as a conference paper at ICLR 2022 Algorithm 2 Py Torch-style pseudocode for Feature Kernel Distillation (FKD). # B: Batch size. # L_FKD: FKD regularisation strength. # L_FR: Feature regularisation strength. # D_s: Student feature dimension. # D_t: Teacher feature dimension. # f_s: Student features B x D_s # f_t: Teacher features B x D_t # mm: matrix-matrix multiplication # Compute student feature correlation kernel matrix s_c s_k = mm(f_s, f_s.T) # B x B s_k_diag_inv_sqrt = torch.diag(s_k).pow(-1/2) s_k_diag_inv_sqrt = s_k_diag_inv_sqrt.reshape(-1, 1) # B x 1 s_c = s_k_diag_inv_sqrt * s_k * s_k_diag_inv_sqrt.T # B x B # Compute teacher feature correlation kernel matrix t_c with torch.no_grad(): t_k = mm(f_t, f_t.T) # B x B t_k_diag_inv_sqrt = torch.diag(t_k).pow(-1/2) t_k_diag_inv_sqrt = t_k_diag_inv_sqrt.reshape(-1, 1) # B x 1 t_c = t_k_diag_inv_sqrt * t_k * t_k_diag_inv_sqrt.T # B x B distil_loss = ((t_c - s_c).pow(2)).mean() feat_reg_loss = (f_s.pow(2)).mean() # FKD loss to be added to supervised loss loss_fkd = L_FKD * distil_loss + L_FR * feat_reg_loss CIFAR10 data is normalised in each channel such that the training data is zero mean and unit standard deviation. Random crops and horizontal flips used as data augmentation. All models are initialised with Kaiming initialisation He et al. (2015). Out of 10000 test points, the predictive disagreements between a reference model and either: independent initialisations (top row) or retrained last layers (bottom row) are depicted in Table 4. We see two clear trends. First, the retrained last layer has much fewer disagreements with the reference model than an independent initialisation model, highlighting the importance of the feature kernel. Secondly, the vast majority of disagreements between independent initialisations are where one of the models is correct. This reinforces our intuition/theoretical analysis that ensembling NN works because different initialisations bias the models to capture different useful features, and hence ensemble distillation (via feature kernels) can improve student performance. Table 4: Breakdown of predictive disagreements between reference and alternate models over 10000 CIFAR10 test points, in terms of which model (if any) was correct. Mean standard deviations over 3 independent initialisations for top row, and over 3 independent reference models for bottom row. All models achieved between 8.0%-8.5% test error. Alternate model Reference correct Alternate correct Neither correct Total disagreement Independent Init 350 14.6 383 15.9 124 3.7 857 29.8 Retrained LL 30 2.9 35 1.7 15 5.4 80 4.1 F.2 FIG. 3: ADDITIONAL FEATURE KERNEL HISTOGRAMS In Fig. 7, we provide additional plots to Fig. 3 that depict the difference in distribution (over x) of feature kernel values k(x, x), for FKD with and without Feature Regularisation (FR). Published as a conference paper at ICLR 2022 0.2 0.4 0.6 0.8 1.0 Normalised k(x, x) value Histogram Density Res Net32x4 Shuffle Net V1 0.0 0.2 0.4 0.6 0.8 1.0 Normalised k(x, x) value Res Net32x4 Res Net8x4 0.0 0.2 0.4 0.6 0.8 1.0 Normalised k(x, x) value Res Net50 VGG8 FKD with FR FKD w.o. FR Figure 7: Comparison of normalised k(x, x) values between FKD with & without Feature Regularisation (FR), across different Teacher Student architectures, on CIFAR-100 test set. We see, like in Fig. 3 that FR encourages a more even distribution of k(x, x) across x, for all architectures. Negative hypothesis. We originally hypothesised that FR could benefit FKD, in addition to balancing the distribution of k(x, x), as it could reduce the sparsity in the NN last-layer representation activations, which is consistent with the proofs of Theorems 1 and 2. Indeed, the NN predictions are dominated by a select few neurons who have won the lottery (Allen-Zhu & Li, 2020) on account of being most correlated with one of the attributes {vc;l}l [2],c [C] at random initialisation. In our analysis, as few as O(log5(m) out of m neurons could be inactive. From our feature kernel learning perspective, this seems like a highly undesirable phenomenon, because if k(x, x ) = h(x; θ), h(x ; θ) = PCm r=1 hr(x, θ)hr(x , θ) only has useful contributions from a dominant minority of r [Cm], then we are not utilising the full capacity of the model. FR seemed appropriate to reduce this sparsity as ℓ2 regularisation is known to promote non-sparse solutions (Van Den Doel et al., 2013). However, we experimentally found the opposite to our hypothesis: that FR trained FKD student had more inactive neurons relative to FKD students trained without FR. This again highlights that, while (we believe) our results in this work provide compelling evidence to highlight the validity of feature kernel based distillation, there are still gaps between our theory and practice, and further questions to be answered in future work. F.3 FIG. 4: ENSEMBLE DISTILLATION All individual VGG8 networks that made up the teacher ensemble were trained using the default training regime from Tian et al. (2020), with independent parameter initialisations. Indeed, all of our experiments in Section 5 used Tian et al. (2020) s excellent open-source Py Torch codebase (Paszke et al., 2019a).10 For student networks, we used the training regime for vanilla KD from Tian et al. (2020) for all ensemble sizes. For FKD, we used the hyperparameters from our Res Net50 VGG8 experiment in Table 2 for all ensemble sizes. F.4 TABLE 1: DATASET TRANSFER The VGG13 teacher checkpoint is provided by Tian et al. (2020). For both CIFAR-10 and STL10, all student networks are trained for 160 epochs with batch size 64 using SGD with momentum, with learning rate decays at epochs 80, 120, 150. The student trained without KD used default hyperparameters from Tian et al. (2020), which are indeed strong hyperparameters for standard training. For FKD, RKD (Park et al., 2019) and SP (Tung & Mori, 2019), we tuned the learning 10https://github.com/Hobbit Long/Rep Distiller Published as a conference paper at ICLR 2022 rate, learning rate decay, and KD regularisation strength λKD on a labeled validation set of size 5000 for CIFAR-10 and 1000 for STL-10, before retraining using best hyperparameters on the full training(+unlabeled) dataset. We also tuned the FR regularisation strength, λFR for FKD when FR was used. All RKD and SP hyperparameters were tuned in a large window around their default values from Tian et al. (2020), which were all author recommended. For FKD, we allowed λKD to range in [1,1000], and λFR to range in [0,20]. All hyperparameters sweeps were conducted using Bayes search. For STL-10, we used a batch size of 512 for all KD methods regularisation terms, compared to 64 for the standard cross-entropy loss. This was due to the fact that STL-10 has only 5K labeled datapoints, and we wanted to ensure that the student used as much of the unlabeled data as possible for each feature-kernel based KD method s additional regularisation term during 160 epochs of training. 512 batch size was the maximum power of 2 before we ran into memory issues on a 11GB VRAM GPU, which occured for the RKD method. Both CIFAR-10 and STL-10 data are normalised in each channel such that the training data is zero mean and unit standard deviation. Random crops and horizontal flips used as data augmentation. STL-10 images are downsized from 96x96 to 32x32 resolution. F.5 TABLE 2: CIFAR-100 AND IMAGENET COMPARISON CIFAR-100. All networks were trained for 240 epochs with batch size 64, with learning rate decay at epochs 150, 180, 210 using SGD with momentum. All teacher networks use the exact same checkpoints as provided by Tian et al. (2020). Learning rate, learning rate decay, λKD, and λFR (when used) were tuned as in App. F.4 on a validation of size 5000. All other hyperparameters were set to the default values used by Tian et al. (2020). The CIFAR-100 data is normalised in each channel such that the training data is zero mean and unit standard deviation, with random crops and horizontal flips used for data augmentation. All results provided denote the test set accuracy at the end of the 240 epochs of training. Image Net. The Image Net dataset (ILSVRC-2012) consists of about 1.3 million training images and 50,000 validation images from 1,000 classes. Each training image is extracted as a randomly sampled 224x224 crop or its horizontal flip without any padding operation. All teacher networks use the exact same checkpoints as provided by Chen et al. (2021). The initial learning rate is 0.1 and divided by 10 at 30 and 60 of the total 90 training epochs. We set the mini-batch size to 256 and the weight decay to 10 4. λKD, and λFR (when used) were tuned as in App. F.4 on a validation of size 5000 except using Bayes search. All results are reported in a single trial. All other hyperparameters were set to the default values used by Chen et al. (2021). All results provided denote the Top-1 test accuracy (%). Accuracy of baselines were reported in Tian et al. (2020). F.6 SENSITIVITY TO λKD In Fig. 8, we plot the sensitivity of FKD to the strength of the distillation regularisation λKD in Eq. (2) for the VGG13 VGG8 experiment on CIFAR-100. We see that a well tuned λKD( 300 here) is important for best student generalisation. Feature regularisation λFR = 20 in Fig. 8. F.7 TABLE 5: ANALYSES IN NEURAL MACHINE TRANSLATION In this section, We performed analyses for a neural machine translation (NMT) task proposed by Tan et al. (2019). In the analyses, we could only obtain data for En-De (from English to German) translation since links to the datasets for other languages are broken. Therefore, we employed a self-distillation method on a pre-trained English model for En-De translation as follows: Published as a conference paper at ICLR 2022 101 102 103 104 FKD Regularisation Strength Test accuracy (%) VGG13 VGG8 on CIFAR-100 Figure 8: Comparison of normalised k(x, x) values between FKD with & without Feature Regularisation (FR), across different Teacher Student architectures, on CIFAR-100 test set. We see, like in Fig. 3 that FR encourages a more even distribution of k(x, x) across x, for all architectures. We train a single teacher transformer model on the IWSLT dataset for English (Tan et al., 2019). We perform self-distillation on the teacher model for En-De translation (Tan et al., 2019). We did not search for optimal hyperparameters, and used default parameters of the code provided by the authors of Tan et al. (2019). The results are given in Table 5. Table 5: BLEU of the teacher model of (Tan et al., 2019) (Teacher), self-distillation of (Tan et al., 2019) (SD), SD with KD of (Hinton et al., 2015), SD with FKD, and SD with FKD loss obtained by replacing distillation loss (2) of (Tan et al., 2019) with FKD, in En - De neural machine translation tasks. Teacher (Tan et al., 2019) SD (Tan et al., 2019) SD with KD SD with FKD SD with FKD loss 27.32 27.49 27.51 27.64 27.79 We first note that, we adapted vanilla KD and our FKD for sequential data in the KD loss (2) of (Tan et al., 2019) in this task. More precisely, we first computed vanilla KD and FKD on token probabilities, and added these loss functions to the KD loss (eq 2 of (Tan et al., 2019)) in KD and FKD. In the results, aggregating vanilla KD with the KD loss (eq 2 of (Tan et al., 2019)) improved accuracy from 27.49 to 27.51. However, FKD further boosted BLEU to 27.64. We then replaced KD loss (Eq. 2 of (Tan et al., 2019)) with FKD for training. Remarkably, FKD further boosted the BLEU to 27.79. These results suggest that the proposed FKD can be applied in NMT tasks, successfully. We hope that these results will motivate researchers to employ FKD in various different NLP tasks including but not limited to multilingual NMT, named entity recognition and question answering. F.8 TABLE 6: ANALYSES IN AUTOMATIC SPEECH RECOGNITION In this section, we used a CRDNN model (VGG + LSTM,GRU,Li GRU+ DNN) on the TIMIT dataset. In this experiment, we used a distillation approach proposed by Gao et al. (2020) for ASR tasks as follows: We train a single teacher model on the TIMIT dataset Ravanelli et al. (2021). Published as a conference paper at ICLR 2022 We perform self-distillation on the teacher Gao et al. (2020). We did not search for optimal hyperparameters, and used default parameters of the Speech Brain Library. In this task, replacing CTC/NLL distillation losses with KD (Hinton et al., 2015) did not converge. Additional investigation with hyperparameter search is needed. We used phoneme error rate (PER) to measure accuracy of models. Table 6: Phoneme error rate (PER) of methods in automatic speech recognition tasks. Teacher Distilled Teacher (Gao et al., 2020) KD (Hinton et al., 2015) FKD 13.26 12.80 12.86 12.59 The results are given in Table 6. Similar to the NMT task, we adapted vanilla KD and our FKD for sequential data as follows: We first computed vanilla KD and FKD loss functions on token probabilities, and then added to the total loss (eq 7 of Gao et al. (2020)). In the analyses, Vanilla KD (Hinton et al., 2015) increased the PER from 12.80 to 12.86. However, FKD further improved the PER from 12.80 to 12.59. In this task, training models by replacing CTC/NLL distillation losses (eq 4 or 5 of Gao et al. (2020)) with KD (Hinton et al., 2015) and FKD did not converge. In conclusion, these results propound that FKD can be applied for different tasks, i.e., image classification, NMT and ASR, boosting accuracy of baseline distillation methods. We hope that these initial results will motivate researchers in different communities (computer vision, NLP, and ASR) to further expound and apply FKD in additional sub-tasks.