# selfdistillation_as_instancespecific_label_smoothing__8154f043.pdf

Self-Distillation as Instance-Speciﬁc Label Smoothing

Zhilu Zhang Cornell University zz452@cornell.edu

Mert R. Sabuncu Cornell Univerisity msabuncu@cornell.edu

It has been recently demonstrated that multi-generational self-distillation can improve generalization [11]. Despite this intriguing observation, reasons for the enhancement remain poorly understood. In this paper, we ﬁrst demonstrate experimentally that the improved performance of multi-generational self-distillation is in part associated with the increasing diversity in teacher predictions. With this in mind, we offer a new interpretation for teacher-student training as amortized MAP estimation, such that teacher predictions enable instance-speciﬁc regularization. Our framework allows us to theoretically relate self-distillation to label smoothing, a commonly used technique that regularizes predictive uncertainty, and suggests the importance of predictive diversity in addition to predictive uncertainty. We present experimental results using multiple datasets and neural network architectures that, overall, demonstrate the utility of predictive diversity. Finally, we propose a novel instance-speciﬁc label smoothing technique that promotes predictive diversity without the need for a separately trained teacher model. We provide an empirical evaluation of the proposed method, which, we ﬁnd, often outperforms classical label smoothing.

1 Introduction

First introduced as a simple method to compress high-capacity neural networks into a low-capacity counterpart for computational efﬁciency, knowledge distillation [15] has since gained much popularity across various application domains ranging from computer vision to natural language processing [19, 22, 28, 39, 41] as an effective method to transfer knowledge or features learned from a teacher network to a student network. This empirical success is often justiﬁed with the intuition that deeper teacher networks learn better representation with greater model complexity, and the "dark knowledge" that teacher networks provide facilitates student networks to learn better representations and hence enhanced generalization performance. Nevertheless, it still remains an open question as to how exactly student networks beneﬁt from this dark knowledge. The problem is made further puzzling by the recent observation that even self-distillation, a special case of the teacher-student training framework in which the teacher and student networks have identical architectures, can lead to better generalization performance [11]. It was also demonstrated that repeated self-distillation process with multiple generations can further improve classiﬁcation accuracy.

In this work, we aim to shed some light on self-distillation. We start off by revisiting the multigenerational self-distillation strategy, and experimentally demonstrate that the performance improvement observed in multi-generational self-distillation is correlated with increasing diversity in teacher predictions. Inspired by this, we view self-distillation as instance-speciﬁc regularization on the neural network softmax outputs, and cast the teacher-student training procedure as performing amortized maximum a posteriori (MAP) estimation of the softmax probability outputs. The proposed framework provides us with a new interpretation of the teacher predictions as instance-speciﬁc priors conditioned on the inputs. This interpretation allows us to theoretically relate distillation to label smoothing, a commonly used technique to regularize predictive uncertainty of NNs, and suggests

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

that regularization on the softmax probability simplex space in addition to the regularization on predictive uncertainty can be the key to better generalization. To verify the claim, we systematically design experiments to compare teacher-student training against label smoothing. Lastly, to further demonstrate the potential gain from regularization on the probability simplex space, we also design a new regularization procedure based on label smoothing that we term Beta smoothing.

Our contributions can be summarized as follows:

1. We provide a plausible explanation for recent ﬁndings on multi-generational self-distillation.

2. We offer an amortized MAP interpretation of the teacher-student training strategy.

3. We attribute the success of distillation to regularization on both the label space and the

softmax probability simplex space, and verify the importance of the latter with systematically designed experiments on several benchmark datasets.

4. We propose a new regularization technique termed Beta smoothing that improves upon

classical label smoothing at little extra cost.

5. We demonstrate self-distillation can improve calibration.

2 Related Works

Knowledge distillation was ﬁrst proposed as a way for model compression [1, 5, 15]. In addition to the standard approach in which the student model is trained to match the teacher predictions, numerous other objectives have been explored for enhanced distillation performance. For instance, distilling knowledge from intermediate hidden layers were found to be beneﬁcial [14, 17, 18, 32, 34, 39]. Recently, data-free distillation, a novel scenario in which the original data for the teacher is unavailable to students, has also been extensively studied [6, 7, 25, 40].

The original knowledge distillation technique for neural networks [15] has stimulated a ﬂurry of interest in the topic, with a large number of published improvements and applications. For instance, prior works [2, 33] have proposed Bayesian techniques in which distributions are distilled with Monte Carlo samples into more compact models like a neural network. More recently, there has also been work on the importance of distillation from an ensemble of model [24], which provides a complementary view on the role of predictive diversity. Lopez-Paz et al. [23] combined distillation with the theory of privileged information, and offered a generalized framework for distillation. To simplify distillation, Zhu et al. [45] proposed a method for one-stage online distillation. There have also been successful applications of distillation for adversarial robustness [28].

Several papers have attempted to study the effect of distillation training on student models. Furlanello et al. [11] examined the effect of distillation by comparing the gradients of the distillation loss against that of the standard cross-entropy loss with ground truth labels. Phuong et al. [31] considered a special case of distillation using linear and deep linear classiﬁers, and theoretically analyzed the effect of distillation on student models. Cho and Hariharan [8] conducted a thorough experimental analysis of knowledge distillation, and observed that larger models may not be better teachers. Another experimentally driven work to understand the effect of distillation was also done in the context of natural language processing [44]. Most similar to our work is [42], in which the authors also established a connection between label smoothing and distillation. However, our argument comes from a different theoretical perspective and offers complementary insights. Speciﬁcally, [42] does not highlight the importance of instance-speciﬁc regularization. We also provide a general MAP framework and a careful empirical comparison of label smoothing and self-distillation.

3 Preliminaries

We consider the problem of k-class classiﬁcation. Let X Rd be the feature space and Y = {1, .., k} be the label space. Given a dataset D = {xi, yi}n

n=1 where each feature-label pair (xi, yi) 2 X Y, and we are interested in ﬁnding a function that maps input features to corresponding labels f : X ! Rc. In this work, we restrict the function class to the set of neural networks fw(x) where w = {Wi}L

i=1 are the parameters of a neural network with L layers. We deﬁne a likelihood model p(y|x; w) = Cat (softmax (fw(x))), a categorical distribution with parameters softmax (fw(x)) 2 (L). Here (L) denotes the L-dimensional probability simplex. Typically,

maximum likelihood estimation (MLE) is performed. This leads to the cross-entropy loss

yij log p(y = j|xi; w), (1)

where yij corresponds to the j-th element of the one-hot encoded label yi.

3.1 Teacher-Student Training Objective

Given a pre-trained model (teacher) fwt, distillation loss can be deﬁned as:

]j log p(y = j|xi; w), (2)

where [ ]j denotes the j th element of a vector. A second network (student) fw can then be trained with the following total loss:

L(w) = Lcce(w) + (1 )Ldist(w), (3)

where 2 [0, 1] is a hyper-parameter, and T corresponds to the temperature scaling hyper-parameter that ﬂattens teacher predictions. In self-distillation, both teacher and student models have the same network architecture. In the original self-distillation experiments conducted by Furlanello et al. [11], and T are set to 0 and 1, respectively throughout the entire training process.

Note that, temperature scaling has been applied differently compared to previous literature on distillation [15]. As addressed in Section 5, we only apply temperature scaling to teacher predictions in computing distillation loss. We empirically observe that this yields results consistent with previous reports. Moreover, as we show in the Appendix A.2, performing temperature scaling only on the teacher but not the student models can lead to signiﬁcantly more calibrated predictions.

4 Multi-Generation Self-Distillation: A Close Look

Self-distillation can be repeated iteratively such that during training of the i-th generation, the model obtained at (i 1)-th generation is used as the teacher model. This approach is referred to as multigenerational self-distillation, or Born-Again Networks (BAN). Empirically it has been observed that student predictions can consistently improve with each generation. However, the mechanism behind this improvement has remained elusive. In this work, we argue that the main attribute that leads to better performance is the increasing uncertainty and diversity in teacher predictions. Similar observations that more tolerant teacher predictions lead to better students were also made by Yang et al. [38]. Indeed, due the monotonicity and convexity of the negative log likelihood function, since the element that corresponds to the true label class of the softmax output p(y = yi|xi; w) is often much greater than that of the other classes, together with early stopping, each subsequent model will likely have increasingly unconﬁdent softmax outputs corresponding to the true label class.

4.1 Predictive Uncertainty

We use Shannon Entropy to quantify the uncertainty in instance-speciﬁc teacher predictions p(y|x; wi), averaged over the training set, which we call Average Predictive Uncertainty, and deﬁne as:

Ex [H (p( |x; wi))] 1

H (p( |xj; wi)) = 1

p(yc|xj; wi) log p(yc|xj; wi). (4)

Note that previous literature [10, 29] has also proposed to use the above measure as a regularizer to prevent over-conﬁdent predictions. Label smoothing [29, 35] is a closely related technique that also penalizes over-conﬁdent predictions by explicitly smoothing out ground-truth labels. A detailed discussion on the relationship between the two can be found in Appendix A.1.

Figure 1: Results for sequential self-distillation over 10 generations are shown above. Model obtained at the (i 1)-th generation is used as the teacher model for training at the i-th generation. Accuracy and NLL are obtained on the test set using the student model, whereas the predictive uncertainty and conﬁdence diversity are evaluated on the training set with teacher predictions.

4.2 Conﬁdence Diversity

Average Predictive Uncertainty is insufﬁcient to fully capture the variability associated with teacher predictions. In this paper, we argue it is also important to consider the amount of spreading of teacher predictions over the probability simplex among different (training) samples. For instance, two teachers can have very similar Average Predictive Uncertainty values, but drastically different amounts of spread on the probability simplex if the softmax predictions of one teacher are much more diverse among different samples than the other. We coin this population spread in predictive probabilities Conﬁdence Diversity." As we show below, characterizing the Conﬁdence Diversity can be important for understanding teacher-student training.

The differential entropy1 over the entire probability simplex is a natural measure to quantify the conﬁdence diversity. However, accurate entropy estimation can be challenging, and its computation is severely hampered by the curse of dimensionality, particularly in applications with a large number of classes. To alleviate the problem, in this paper, we propose to measure only the entropy of the softmax element corresponding to the true label class, thereby simplifying the measure to a one-dimensional entropy estimation task. Mathematically, if we denote c = φ(x, y) := [softmax

]y, and let p C be the probability density function of the random variable C := φ(X, Y ) where (X, Y ) p(x, y), then, we quantify Conﬁdence Diversity via the differential entropy of C:

p C(c) log p C(c) dc. (5)

We use the KNN-based entropy estimator to compute h(C) over the training set [4]. In essence, the above measure quantiﬁes the amount of spread associated with the teacher predictions on the true label class. The smaller the value, the more similar the softmax values are across different samples.

4.3 Sequential Self-Distillation Experiment

We perform sequential self-distillation with Res Net-34 on the CIFAR-100 dataset for 10 generations. At each generation, we train the neural networks for 150 epochs using the identical optimization procedure as in the original Res Net paper [13]. Following Furlanello et al. [11], and T are set to 0 and 1 respectively throughout the entire training process. Additional experiments with different values of T can be found in Appendix A.3. Fig. 1 summarizes the results. As indicated by the general increasing trend in test accuracy, sequential distillation indeed leads to improvements. The entropy plots also support the hypothesis that subsequent generations exhibit increasing diversity and uncertainty in predictions. Despite the same increasing trend, the two entropy metrics quantify different things. The increase in average predictive uncertainty suggests overall a drop in the conﬁdence of the categorical distribution, while the growth in conﬁdence diversity suggests an increasing variability in teacher predictions. Interestingly, we also see obvious improvements in terms of NLL, suggesting in addition that BAN can improve calibration of predictions [12].

To further study the apparent correlation between student performance and entropy of teacher predictions over generations, we conduct a new experiment, where we instead train a single teacher. This teacher is then used to train a single generation of students while varying the temperature hyperparameter T in Eq. 3, which explicitly adjusts the uncertainty and diversity of teacher predictions. For consistency, we keep = 0. Results are illustrated in Fig. 2. As expected, increasing T leads to

1This is distinct from the average predictive uncertainty discussed in the previous section, which measures the average Shannon entropy of probability vectors.

Figure 2: Results with teacher predictions scaled by varying temperature T. The ﬂat lines in the plots correspond to the largest/smallest values achieved over 10 generations of sequential distillation with T = 1 in the previous experiments for accuracy, predictive uncertainty and conﬁdence diversity/NLL.

greater predictive uncertainty and diversity in teacher predictions. Importantly, we see this increase leads to drastic improvements in the test accuracy of students. In fact, the gain is much greater than the best achieved with 10 generations of BAN with T = 1 (indicated with the ﬂat line in the plot). The identiﬁed correlation is consistent with the recent ﬁnding that early-stopped models, which typically have much larger entropy than fully trained ones, serve as better teachers [8]. Lastly, we also see improvements in NLL with increasing entropy of teacher predictions. However, too high T leads to a subsequent increase in NLL, likely due to teacher predictions that lack in conﬁdence.

A closer look at the entropy metrics of the above experiment reveals an important insight. While the average predictive uncertainty is strictly increasing with T, the conﬁdence diversity plateaus after T = 2.5. The plateau of conﬁdence diversity coincides closely with the stagnation of student test accuracy, hinting at the importance of conﬁdence diversity in teacher predictions. The apparent correlation between accuracy and conﬁdence diversity can be also seen from the additional sequential self-distillation experiments found in Appendix A.3. This makes intuitive sense. Given a training set, we would expect that some of the samples be much more typically representative of the label class than others. Ideally, we would hope to classify the typical examples with much greater conﬁdence than an ambiguous example of the same class. Previous results show that training with such instancespeciﬁc uncertainty can indeed lead to better performance [30]. Our view is that in self-distillation, the teacher provides the means for instance-speciﬁc regularization.

5 An Amortized MAP Perspective of Self-Distillation

The instance-speciﬁc regularization perspective on self-distillation motivates us to recast the training procedure as performing Maximum a posteriori (MAP) estimation on the softmax probability vector. Speciﬁcally, suppose now that the likelihood p(y|x, z) = Cat(z) be a categorical distribution with parameter z 2 (L) and the conditional prior p(z|x) = Dir( x) be a Dirichlet distribution with instance-speciﬁc parameter x. Due to conjugacy of the Dirichlet prior, a closed-form solution of ˆzi = ci+ xi 1 P

j cj+ xj 1, where ci corresponds to number of occurrences of the i-th category, can be easily obtained.

The above framework is not useful for classiﬁcation when given a new sample x without any observations y. Moreover, in the common supervised learning setup, only one observation of label y is available for each sample x. The MAP solution shown above merely relies on the provided label y for each sample x, without exploiting the potential similarities among different samples (xi) s in the entire dataset for more accurate estimation. For example, we could have different samples that are almost duplicates (cf. [3]), but have different yi s, which could inform us about other labels that could be drawn from zi. Thus, instead of relying on the instance-level closed-form solution, we can train a (student) network to amortize the MAP estimation ˆzi softmax

with a given training set, resulting in an optimization problem of:

log p(z|xi, yi; w, x) = max

log p(y = yi|z, xi; w) + log p(z|xi; w, x)

log[softmax (fw(xi))]yi | {z } Cross entropy

([ xi]c 1) log[z]c

| {z } Instance-speciﬁc regularization

Eq. 6 is an objective that provides us with a function to obtain a MAP solution of z given an input sample x. Note that, we do not make any assumptions about the availability or number of label observations of y for each sample x. This enables us to ﬁnd an approximate MAP solution to x at testtime when x and y are unavailable. The resulting framework can be generally applicable to various scenarios like semi-supervised learning or learning from multiple labels per sample. Nevertheless, in the following, we restrict our attention to supervised learning with a single label per training sample.

5.1 Label Smoothing as MAP

The difﬁculty now lies in obtaining the instance-speciﬁc prior Dir( x). A naive independence assumption that p(z|x) = p(z) can be made. Under such an assumption, a sensible choice of prior would be a uniform distribution across all possible labels. Choosing [ x]c = [ ]c = β

k + 1 for all c 2 {1, ..., k} for some hyper-parameter β, the MAP objective becomes

log[z]yi + β

k log[z]c. (7)

As noted in prior work, this loss function is equivalent to the commonly used label smoothing (LS) regularization [29, 35] (derivations can be found in Appendix A.1). Observe also that the training objective in essence promotes predictions with larger predictive uncertainty, but not conﬁdence diversity.

5.2 Self-Distillation as MAP

A better instance-speciﬁc prior distribution can be obtained using a pre-trained (teacher) neural network. Let us consider a network fwt trained with the regular MLE objective, by maximizing p(y|x; wt) = Cat(softmax

, where [softmax (fwt(x))]i = [exp(fwt(x))]i P

j[exp(fwt(x))]j . Now, due to

conjugacy of the Dirichlet prior, the marginal likelihood p(y|x; x) is a Dirichlet-multinomial distribution [26]. In the case of single label observation considered, the marginal likelihood reduces to a categorical distribution. As such, we have: p(y|x; x) = Cat( x), where x is normalized such that [ x]i = [ x]i P

j[ x]j . We can thus interpret exp (fwt(x)) as the parameters of the Dirichlet distribution to obtain a useful instance-speciﬁc prior on z. However, we observe that there is a scale ambiguity that needs resolving, since any of the following will yield the same x:

x = β exp(fwt(x)/T) + γ, (8)

where T = 1 and γ = 0, and β corresponds to some hyper-parameter. Using T > 1 and γ > 0 corresponds to ﬂattening the prior distribution, which we found to be useful in practice - an observation consistent with prior work. Note that in the limit of T ! 1, the instance-speciﬁc prior reduces to a uniform prior corresponding to classical label smoothing. Setting γ = 1 (we also experimentally explore the effect of varying γ. See Appendix A.10 for details), we obtain

x = β exp(fwt(x)/T) + 1 = β

[exp(fwt(x)/T)]j softmax(fwt(x)/T) + 1. (9)

Plugging this into Eq. 6 yields

log[z]yi + β

[softmax(fwt(xi)/T)]c log[z]c, (10)

very similar to the distillation loss of Eq. 3, with an additional sample-speciﬁc weighting term !xi = P

j[exp(fwt(xi)/T)]j!

Despite the interesting result, we empirically observe that, with temperature values T found to be useful in practice, the relative weightings of samples are too close to yield a signiﬁcant difference from regular distillation loss. Hence, for all of our experiments, we still adopt the distillation loss of Eq. 3. However, we believe that, with teacher models trained with an objective more appropriate than MLE, the difference might be bigger. We hope to explore alternative ways of obtaining teacher models to effectively utilize the sample re-weighted distillation objective as future work.

The MAP interpretation, together with empirical experiments conducted in Section 4, suggests that multi-generational self-distillation can in fact be seen as an inefﬁcient approach to implicitly ﬂatten and diversify the instance-speciﬁc prior distribution. Our experiments suggest that instead, we can more effectively tune for hyper-parameters T and γ to achieve similar, if not better, results. Moreover, from this perspective, distillation in general can be understood as a regularization strategy. Some empirical evidence for this can be found in Appendix A.6 and A.7.

5.3 On the Relationship between Label Smoothing and Self-Distillation

The MAP perspective reveals an intimate relationship between self-distillation and label smoothing. Label smoothing increases the uncertainty of predictive probabilities. However, as discussed in Section 4, this might not be enough to prevent overﬁtting, as evidenced by the stagnant test accuracy despite increasing uncertainty in Fig. 2. Indeed, the MAP perspective suggests that, ideally, each sample should have a distinct probabilistic label. Instance-speciﬁc regularization can encourage conﬁdence diversity, in addition to predictive uncertainty.

While the predictive uncertainty can be explicitly used for regularization as previously discussed in Section 4.1, we observe empirically that promoting conﬁdence diversity directly through the proposed measure in Section 4.2 can be hard in practice, yielding unsatisfactory results. This could have been caused by difﬁculty in estimating conﬁdence diversity accurately using mini-batch samples. Naively promoting conﬁdence diversity during the early stage of training could also have harmed learning. As such, we can view distillation as an indirect way of achieving this objective. We leave it as future work to further explore alternative techniques to enable direct regularization of conﬁdence diversity.

6 Beta Smoothing Labels

Self-distillation requires training a separate teacher model. In this paper, we propose an efﬁcient enhancement to label smoothing strategy where the amount of smoothing will be proportional to the uncertainty of predictions. Speciﬁcally, we make use of the exponential moving average (EMA) predictions as implemented by Tarvainen and Valpola [36] of the model at training, and obtain a ranking based on the conﬁdence (the magnitude of the largest element of the softmax) of predictions at each mini-batch, on the ﬂy, from smallest to largest. Instead of assigning uniform distributions [ x]c = β

k + 1 for all c 2 {1, ..., k} to all samples as priors, during each iteration, we sample and sort a set of i.i.d. random variables {b1 ... bm} from Beta(a, 1) where m corresponds to the mini-batch size and a corresponds to the hyper-parameter associated with the Beta distribution. Then, we assign [ xi]yi = βbi + 1 and [ xi]c = β 1 bi

k 1 + 1 for all c 6= yi as the prior to each sample xi, based on the ranking obtained. In this way, samples with larger conﬁdence obtained through the EMA predictions will receive less amount of label smoothing and vice versa. Thus, the amount of label smoothing applied to a sample will be proportional to the amount of conﬁdence the model has about that sample s prediction. Those instances that are more challenging to classify will, therefore, have more smoothing applied to their labels.

In practice, for consistency with distillation, Eq. 3 is used for training. Beta-smoothed labels of bi on the ground truth class and 1 bi

k 1 on all other classes are used in lieu of teacher predictions for each xi. Lastly, note that EMA predictions are used in order to stabilize the ranking obtained at each iteration of training. We empirically observe a signiﬁcant performance boost with the EMA predictions. We term this method Beta smoothing.

To better examine the role of EMA predictions has on Beta smoothing, we conduct two ablation studies. Firstly, since the EMA predictions are used for Beta smoothed labels, we compare the effectiveness of Beta smoothing against self-training explicitly using the EMA predictions (see Appendix A.5 for details). Moreover, to test the importance of ranking obtained from EMA predictions, we include in the Appendix A.8 an additional experiment for which random Beta smoothing is applied to each sample.

Beta smoothing regularization implements an instance-speciﬁc prior that encourages conﬁdence diversity, and yet does not require the expensive step of training a separate teacher model. We note that, due to the constantly changing prior used at every iteration of training, Beta smoothing does not, strictly speaking, correspond to the MAP estimation in Eq. 6. Nevertheless, it is a simple and effective way to implement the instance-speciﬁc prior strategy. As we demonstrate in the following section,

it can lead to much better performance than label smoothing. Moreover, unlike teacher predictions which have unique softmax values for all classes, the difference between Beta and label smoothing only comes from the ground-truth softmax element. This enables us to conduct more systematic experiments to illustrate the additional gain from promoting conﬁdence diversity.

7 Empirical Comparison of Distillation and Label Smoothing

To further demonstrate the beneﬁts of the additional regularization on the softmax probability vector space, we design a systematic experiment to compare self-distillation against label smoothing. In addition, experiments on Beta smoothing are also conducted to further verify the importance of conﬁdence diversity, and to promote Beta smoothing as a simple alternative that can lead to better performance than label smoothing at little extra cost. We note that, while previous works have highlighted the similarity between distillation and label smoothing from another perspective [42], we provide a detailed empirical analysis that uncovers additional beneﬁts of instance-speciﬁc regularization.

7.1 Experimental Setup

We conduct experiments on CIFAR-100 [20], CUB-200 [37] and Tiny-imagenet [9] using Res Net [13] and Dense Net [16]. We follow the original optimization conﬁgurations, and train the Res Net models for 150 epochs and Dese Net models for 200 epochs. 10% of the training data is split as the validation set. All experiments are repeated 5 times with random initialization. For simplicity, label smoothing is implemented with explicit soft labels instead of the objective in Eq. 7. We ﬁx = 0.15 in label smoothing for all our experiments (additional experiments with = 0.1, 0.3 can be found in the Appendix A.4). The hyper-parameter of Eq. 3 is taken to be 0.6 for self-distillation. Only one generation of distillation is performed for all experiments. To systematically decompose the effect of the two regularizations in self-distillation, given a pre-trained teacher and , we manually search for temperature T such that the average effective label of the ground-truth class, + (1 )[softmax (fwt(xi)/T)]yi, is approximately equal to 0.85 to match the hyper-parameter chosen for label smoothing. Eq. 3 is also used for Beta smoothing with = 0.4. The parameter a of the Beta distribution is set such that E[ + (1 )bi] = , to make the average probability of ground truth class the same as label smoothing.

We emphasize that the goal of the experiment is to methodically decompose the gain from the two aforementioned regularizations of distillation. Note that, both and T can inﬂuence the amount of predictive uncertainty and conﬁdence diversity in teacher predictions at the same time. This coupled effect can make hyper-parameter tuning hard. Due to limited computational resources, hyperparameter tuning is not performed, and the results for all methods can be potentially enhanced. Lastly, we also incorporate an additional distillation experiment in which the deeper Dense Net model is used as the teacher model for comparison against self-distillation. Results can be found in Appendix A.9.

7.2 Results

Test accuracies are summarized in the top row for each experiment in Fig. 3. Firstly, all regularization techniques lead to improved accuracy compared to the baseline model trained with cross-entropy loss. In agreement with previous results, self-distillation performs better than label smoothing in all of the experiments with our setup, in which the effective degree of label smoothing in distillation is, on average, the same as that of regular label smoothing. The results suggest the importance of conﬁdence diversity in addition to predictive uncertainty. It is worth noting that we obtain encouraging results with Beta smoothing. Outperforming label smoothing in all but the CIFAR-100 Res Net experiment, it can even achieve comparable performance to that of self-distillation for the CUB-200 dataset with no separate teacher model required. The improvements of Beta smoothing over label smoothing also serve direct evidence on the importance of conﬁdence diversity, as the only difference between the two is the additional spreading of the ground truth classes. We hypothesize that the gap in accuracy between Beta smoothing and self-distillation is mainly due to better instance-speciﬁc priors set by a pre-trained teacher network. The differences in the non-ground-truth classes between the two methods could also account for the small gap in accuracy performance.

Results on calibration are shown in the bottom rows of Fig. 3, where we report the expected calibration error (ECE) [12]. As anticipated, all regularization techniques lead to enhanced calibration.

Figure 3: Experimental Results performed on CIFAR-100, CUB-200 and the Tiny-Imagenet dataset. "CE", "LS", "B" and "SD" refers to "Cross Entropy", "Label Smoothing", "Beta Smoothing" and "Self-Distillation" respectively. The top rows of each experiment show bar charts of accuracy on test set for each experiment conducted, while the bottom rows are bar charts of expected calibration error.

Nevertheless, we see that the errors obtained with self-distillation are much smaller in general compared to label smoothing. As such, instance-speciﬁc priors can also lead to more calibrated models. Beta smoothing again not only produces models with much more calibrated predictions compared to label smoothing but compares favorably to self-distillation in a majority of the experiments.

8 Discussion and Future Directions

In this paper, we provide empirical evidence that diversity in teacher predictions is correlated with student performance in self-distillation. Inspired by this observation, we offer an amortized MAP interpretation of the popular teacher-student training strategy. The novel viewpoint provides us with insights on self-distillation and suggests ways to improve it. For example, encouraged by the results obtained with Beta smoothing, there are possibly better and/or more efﬁcient ways to obtain priors for instance-speciﬁc regularization.

Recent literature shows that label smoothing leads to better calibration performance [27]. In this paper, we demonstrate that distillation can also yield more calibrated models. We believe this is a direct consequence of not performing temperature scaling on student models during training. Indeed, with temperature scaling also on the student models, the student logits are likely pushed larger during training, leading to over-conﬁdent predictions.

More generally, we have only discussed the teacher-student training strategy as MAP estimation. There have been other recently proposed techniques involving training with soft labels, which we can interpret as encouraging conﬁdence diversity or implementing instance-speciﬁc regularization. For instance, the mixup regularization [43] technique creates label diversity by taking random convex combinations of the training data, including the labels. Recently proposed consistency-based semisupervised learning methods such as [21, 36], on the other hand, utilize predictions on unlabeled training samples as an instance-speciﬁc prior. We believe this unifying view of regularization with soft labels can stimulate further ideas on instance-speciﬁc regularization.

Acknowledgments

This work was supported by NIH R01 grants (R01LM012719 and R01AG053949), the NSF Neuro Nex grant 1707312, and NSF CAREER grant (1748377).

Statement of the Potential Broader Impact

In this paper, we offer a new interpretation of the self-distillation training framework, a commonly used technique for improved accuracy used among practitioners in the deep learning community, which allows us to gain some deeper understanding of the reasons for its success. With the ubiquity of deep learning in our society today and countless potential future applications of it, we believe our work can potentially bring positive impacts in several ways.

Firstly, despite the empirical utility of distillation and numerous successful applications in many tasks and applications ranging from computer vision to natural language processing problems, we still lack a thorough understanding of why it works. In our opinion, blindly applying methods and algorithms without a good grasp on the underlying mechanisms can be dangerous. Our perspective offers a theoretically grounded explanation for its success that allows us to apply the techniques to real-world applications broadly with greater conﬁdence.

In addition, the proposed interpretation of distillation as a regularization to neural networks can potentially allow us to obtain models that are more generalizable and reliable. This is an extremely important aspect of applying deep learning to sensitive domains like healthcare and autonomous driving, in which wrong predictions made by machines can lead to catastrophic consequences. Moreover, our new experimental demonstration that models trained with the distillation process can potentially lead to better-calibrated models that can facilitate safer and more interpretable applications of neural networks. Indeed, for real-world classiﬁcation tasks like disease diagnosis, in addition to accurate predictions, we need reliable estimates of the level of conﬁdence of the predictions made, which is something that neural networks are lacking currently as pointed out by recent research. More calibrated models, in our opinion, enhances the explainability and transparency of neural network models.

Lastly, we believe the introduced framework can stimulate further research on the regularization of deep learning models for better generalization and thus safer applications. It was recently demonstrated that deep neural networks do not seem to suffer from overﬁtting. Our ﬁnding suggests that overﬁtting can still occur, though in a different way than conventional wisdom, and deep learning can still beneﬁt from regularization. As such, we encourage research into more efﬁcient and principled forms of regularization to improve upon the distillation strategy.

We acknowledge the risks associated with our work. To be more speciﬁc, our ﬁnding advocates for the use of priors for the regularization of neural networks. Despite the potentially better generalization performance of trained models, depending on the choice of priors used for training, unwanted bias can be inevitably introduced into the deep learning system, potentially causing issues of fairness and privacy.

[1] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural

information processing systems, pages 2654 2662, 2014.

[2] Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. Bayesian dark

knowledge. In Advances in Neural Information Processing Systems, pages 3438 3446, 2015.

[3] Björn Barz and Joachim Denzler. Do we train on test data? purging cifar of near-duplicates.

ar Xiv preprint ar Xiv:1902.00423, 2019.

[4] Jan Beirlant, Edward J Dudewicz, László Györﬁ, and Edward C Van der Meulen. Nonparametric

entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17 39, 1997.

[5] Cristian Buciluˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535 541, 2006.

[6] Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer.

Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169 13178, 2020.

[7] Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing

Xu, Chao Xu, and Qi Tian. Data-free learning of student networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 3514 3522, 2019.

[8] Jang Hyun Cho and Bharath Hariharan. On the efﬁcacy of knowledge distillation. In Proceedings

of the IEEE International Conference on Computer Vision, pages 4794 4802, 2019.

[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-

scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[10] Abhimanyu Dubey, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Maximum-entropy ﬁne

grained classiﬁcation. In Advances in Neural Information Processing Systems, pages 637 647, 2018.

[11] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar.

Born again neural networks. In International Conference on Machine Learning, pages 1607 1616, 2018.

[12] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural

networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321 1330. JMLR. org, 2017.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[14] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi.

A comprehensive overhaul of feature distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1921 1930, 2019.

[15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.

ar Xiv preprint ar Xiv:1503.02531, 2015.

[16] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected

convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708, 2017.

[17] Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity

transfer. ar Xiv preprint ar Xiv:1707.01219, 2017.

[18] Jangho Kim, Seong Uk Park, and Nojun Kwak. Paraphrasing complex network: Network

compression via factor transfer. In Advances in neural information processing systems, pages 2760 2769, 2018.

[19] Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of

the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317 1327, 2016.

[20] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

[21] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. ar Xiv preprint

ar Xiv:1610.02242, 2016.

[22] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern

analysis and machine intelligence, 40(12):2935 2947, 2017.

[23] David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. Unifying distillation

and privileged information. ar Xiv preprint ar Xiv:1511.03643, 2015.

[24] Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. In

International Conference on Learning Representations, 2019.

[25] Paul Micaelli and Amos J Storkey. Zero-shot knowledge transfer via adversarial belief matching.

In Advances in Neural Information Processing Systems, pages 9551 9561, 2019.

[26] Thomas Minka. Estimating a dirichlet distribution, 2000.

[27] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In

Advances in Neural Information Processing Systems, pages 4696 4705, 2019.

[28] Nicolas Papernot, Patrick Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as

a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582 597. IEEE, 2016.

[29] Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Reg-

ularizing neural networks by penalizing conﬁdent output distributions. ar Xiv preprint ar Xiv:1701.06548, 2017.

[30] Joshua C Peterson, Ruairidh M Battleday, Thomas L Grifﬁths, and Olga Russakovsky. Hu-

man uncertainty makes classiﬁcation more robust. In Proceedings of the IEEE International Conference on Computer Vision, pages 9617 9626, 2019.

[31] Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In

International Conference on Machine Learning, pages 5142 5151, 2019.

[32] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and

Yoshua Bengio. Fitnets: Hints for thin deep nets. ar Xiv preprint ar Xiv:1412.6550, 2014.

[33] Yichen Shen, Zhilu Zhang, Mert R Sabuncu, and Lin Sun. Learning the distribution: A

uniﬁed distillation paradigm for fast uncertainty estimation in computer vision. ar Xiv preprint ar Xiv:2007.15857, 2020.

[34] Suraj Srinivas and Francois Fleuret. Knowledge transfer with jacobian matching. In Interna-

tional Conference on Machine Learning, pages 4723 4731, 2018.

[35] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-

thinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818 2826, 2016.

[36] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged

consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195 1204, 2017.

[37] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD

Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.

[38] Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan L Yuille. Training deep neural networks

in generations: A more tolerant teacher educates better students. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 5628 5635, 2019.

[39] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation:

Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4133 4141, 2017.

[40] Jaemin Yoo, Minyong Cho, Taebum Kim, and U Kang. Knowledge extraction with no observable

data. In Advances in Neural Information Processing Systems, pages 2705 2714, 2019.

[41] Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. Visual relationship detection with

internal and external linguistic knowledge distillation. In Proceedings of the IEEE international conference on computer vision, pages 1974 1982, 2017.

[42] Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisit knowledge distillation:

a teacher-free framework. ar Xiv preprint ar Xiv:1909.11723, 2019.

[43] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond

empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017.

[44] Chunting Zhou, Graham Neubig, and Jiatao Gu. Understanding knowledge distillation in

non-autoregressive machine translation. ar Xiv preprint ar Xiv:1911.02727, 2019.

[45] Xiatian Zhu, Shaogang Gong, et al. Knowledge distillation by on-the-ﬂy native ensemble. In

Advances in neural information processing systems, pages 7517 7527, 2018.