# a_statistical_perspective_on_distillation__3066d848.pdf

A Statistical Perspective on Distillation

Aditya Krishna Menon 1 Ankit Singh Rawat 1 Sashank J. Reddi 1 Seungyeon Kim 1 Sanjiv Kumar 1

Knowledge distillation is a technique for improving a student model by replacing its one-hot training labels with a label distribution obtained from a teacher model. Despite its broad success, several basic questions e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? have received limited formal study. In this paper, we present a statistical perspective on distillation which sheds light on these questions. Our core observation is that a Bayes teacher providing the true classprobabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantiﬁes the utility of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a good teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.

1. Introduction

Distillation is the process of using a teacher model to improve the performance of a student model (Bucilˇa et al., 2006; Ba & Caruana, 2014; Hinton et al., 2015). In its simplest form, one trains the student to ﬁt the teacher s soft distribution over labels, rather than one-hot training labels. While originally devised as a means of model compression, distillation has proven successful in improving students with the same architecture as the teacher (Rusu et al., 2016; Furlanello et al., 2018), and found several broader uses (Papernot et al., 2016; Yim et al., 2017; Liu et al., 2019).

Given its empirical success, it is natural to ask: why does distillation help? Answering this requires confronting a number of puzzling empirical observations, e.g., that improving teacher accuracy can harm distillation performance (Müller

1Google Research, New York. Correspondence to: Aditya Krishna Menon <adityakmenon@google.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

et al., 2019). One commonly accepted intuition from Hinton et al. (2015) is that the teacher s soft labels provide dark knowledge via weights on the wrong labels y0 6= y for an example (x, y). But what are the formal statistical beneﬁts of using soft over one-hot labels, and what is the ideal set of soft labels? While several works have studied various aspects of distillation (Lopez-Paz et al., 2016; Phuong & Lampert, 2019; Mobahi et al., 2020; Ji & Zhu, 2020; Zhang & Sabuncu, 2020; Zhou et al., 2021; Hsu et al., 2021) (cf. 2.2), precise answers to these questions remain elusive.

In this paper, we present a statistical perspective on distillation which sheds light on why it can aid performance, explicate the statistical value of dark knowledge , and provide a formal criterion as to what constitutes a good teacher. Our key observation is that a teacher providing the true (Bayes) class-probabilities can lower the variance of the student objective, and thus improve performance. Further, teachers that reliably approximate these probabilities for which merely being accurate does not sufﬁce possess a bias-variance tradeoff quantifying how they may improve generalisation. Beyond providing conceptual insight, this perspective facilitates novel applications of distillation to problems such as bipartite ranking and multiclass retrieval.

In sum, our contributions are:

(i) We present a statistical view of distillation, by establish-

ing that the student s expected loss inherently smooths labels with the Bayes class-probabilities ( 3). We then quantify the beneﬁt of using these Bayes probabilities in place of one-hot labels. This gives a statistical perspective on dark knowledge : for an example (x, y), the logits on wrong labels y0 6= y encode information about the underlying class distribution. This helps the student minimise a better approximation to the true generalisation error, which can improve performance.

(ii) We quantify a bias-variance tradeoff for teachers pro-

viding an approximation to the Bayes probabilities ( 4). This gives a concrete criterion for assessing if a teacher is good , i.e., the quality of its probability estimates as estimated, e.g., by the log-loss. This does not necessarily correspond to a more accurate teacher; see Figure 1.

(iii) We illustrate how our statistical perspective facilitates

novel practical applications of distillation, e.g., to bipartite ranking and multiclass retrieval problems ( 5).

A Statistical Perspective on Distillation

8 14 20 32 44 56 7eacher depth

7op-1 accuracy

teacher student

(a) Top-1 accuracy.

8 14 20 32 44 56 Teacher depth

teacher student

(b) Log-loss.

8 14 20 32 44 56 Teacher depth

Expected cali Eration error

teacher student

(c) Expected calibration error.

Figure 1. Illustration of how teachers with better test set class-probability estimates generally yield better students, in keeping with our

bias-variance bound (Proposition 3). On CIFAR-100, we train teachers that are Res Nets of varying depths, and distill these to a student Res Net of ﬁxed depth 8. Figure 1(a) reveals that as its depth increases, the teacher gets increasingly more accurate on the test set. However, the teacher s probability estimates on the test set become progressively poorer approximations of the Bayes class-probabilities p (x) beyond depth 14: the teacher s test set log-loss and calibration error (Guo et al., 2017) (both measures of the quality of the probability estimates) increase beyond this depth (Figure 1(b) and 1(c)). Intuitively, the depth 14 provides an optimal balance between the bias and variance in the teacher s predictions. In line with Proposition 3, the depth 14 teacher with the best probability estimates produces the most accurate student (Figure 1(a)). See 4.3 for more details, and 4.1 for the formal bias-variance tradeoff.

Our ﬁndings are veriﬁed for linear models, neural networks, and decision trees, on both controlled synthetic and realworld datasets. This illustrates a broader point of our statistical perspective: distillation can be understood as a basic tool which has utility not just limited to neural networks.

More broadly, our simple measure of a teacher s utility for distillation the quality of its probability estimates, which may be estimated by the test set log-loss, square-loss, or calibration error gives one means of reasoning about various empirical ﬁndings; e.g., temperature scaling can be seen as improving the teacher s probability calibration (cf. 4.2), while teachers that are merely accurate but provide poor probability estimates may distill poorly (Figure 1, 3).

2. Background and Notation

2.1. Multiclass Classiﬁcation

In multiclass classiﬁcation, we are given a training sample S

.= {(xn, yn)}N

n=1 PN, for distribution P over instances X and labels Y = [L]

.= {1, 2, . . . , L}. Our goal is to learn a predictor f : X ! RL with minimal risk:

.= E (x,y) P

Here, (y, f(x)) is the loss of predicting f(x) 2 RL when the true label is y 2 [L]. A canonical example is the softmax cross-entropy, or the log-loss with softmax activation:

(y, f(x)) = fy(x) + log

y02[L] efy0(x)i

We may approximate the risk R(f) via the empirical risk

yn (f(xn)), (3)

for one-hot encoding ey 2 {0, 1}L of a given label y 2 [L], and vector of losses for each possible label (f(x))

(1, f(x)), . . . , (L, f(x))

2.2. Knowledge Distillation

Distillation involves using a teacher model to improve a student model. Classically, this may be done via matching

the two models logits (Bucilˇa et al., 2006). We follow the generalisation in Hinton et al. (2015), wherein one computes teacher class-probability estimates pt(x)

.= [pt(y|x)]y2[L], where pt(y|x) estimates how likely x is to be classiﬁed as y. These are used by a student model which replaces the empirical risk (3) with the distilled risk

R(f; S) = 1

n2[N] pt(xn)> (f(xn)), (4)

so that the one-hot encoding of labels is replaced with the teacher s distribution over labels. To construct (4), we simply require pt be a valid label distribution. For a neural network teacher trained to minimise the softmax cross-entropy, pt can be the softmax of the temperature scaled teacher logits (Hinton et al., 2015, Equation 2). This recovers logit matching as a special case (Hinton et al., 2015). Extensions of this setup, such as matching intermediate layers, have also been considered (Romero et al., 2015; Zagoruyko & Komodakis, 2017; Kim et al., 2018; Jain et al., 2020).

2.3. Why Does Distillation Help?

While it is well-accepted that distillation is empirically valuable, there is less consensus as to why this is the case. Hinton et al. (2015) attributed its success to the encoding of dark knowledge in the probabilities the teacher assigns to the wrong labels y0 6= y for example (x, y). This plausibly

A Statistical Perspective on Distillation

aids the student by weighting samples differently (Furlanello et al., 2018; Tang et al., 2020). However, formalisations of this intuition are limited. One elegant exception is Lopez Paz et al. (2016), who showed that distillation can be helpful assuming soft labels speed up learning; however, this does not establish when and why this can be the case.

The impact of distillation on the optimisation of the student model has been explored for speciﬁc model classes (Phuong & Lampert, 2019; Rahbar et al., 2020; Ji & Zhu, 2020). Distillation may also be seen as a data-dependent regulariser on the student model (Dong et al., 2019; Yuan et al., 2020), as explicated by Mobahi et al. (2020); Zhang & Sabuncu (2020) for self-distillation (wherein the student and teacher have the same model architecture). Foster et al. (2019) provided a generalisation bound for students constrained to learn a model close to the teacher. Such works do not explicate what constitutes an ideal teacher, nor quantify how an approximation to this ideal teacher affects generalisation.

Recently, Zhou et al. (2021) provided a bias-variance perspective on distillation, which is similar in spirit to the analysis of 4. However, they did not quantify the variancereduction beneﬁts of employing the Bayes probabilities (cf. Lemma 1), formalise how these beneﬁts translate to approximate Bayes probabilities (cf. Proposition 4), nor provide broader examples of the value of distillation (cf. 5 and Appendix B) that exploit this view.

3. Distillation: a Class-Probability View

We now present a statistical perspective on distillation, which gives insight into why it can aid generalisation. Central to our perspective are two observations:

(i) the risk in (1) we seek to minimise inherently smooths

labels by the class-probabilities P(y | x); and, (ii) such smoothing yields a lower variance objective com-

pared to using one-hot labels ey.

3.1. Bayes Knows Best: Distilling Class-Probabilities

We begin with the following elementary observation: the population risk R(f) for f : X ! RL is

p (x)> (f(x))

where p (x)

y2[L] is the Bayes classprobability distribution over the labels. Intuitively, P(y|x) is the suitability of y for x: when p (x) is not concentrated on a single label, there is an inherent confusion amongst the labels. The risk involves drawing x P(x), and averaging the loss of f(x) over all y 2 [L], weighted by P(y | x).

Given an (xn, yn) P, the empirical risk (3) approximates p (xn) with the one-hot eyn, which is only supported on one label. While eyn is an unbiased estimate of p (xn), it is

a signiﬁcant reduction in granularity. By contrast, consider the following Bayes-distilled risk on a sample S PN:

n=1 p (xn)> (f(xn)). (6)

This is a distillation objective (cf. (4)) using a Bayes teacher, which provides the student with the true classprobabilities. Rather than ﬁtting to a single label realisation yn Discrete(p (xn)), a student minimising (6) considers all alternate label realisations, weighted by their likelihood. When is the log-loss of (2), (6) is the KL divergence between p and softmax(f(xn)) plus a constant.

Both the standard empirical risk ˆR(f; S) in (3) and Bayesdistilled risk ˆR (f; S) in (6) are unbiased estimates of the population risk R(f) in (1). Intuitively, however, we expect that a student minimising (6) ought to generalise better. We can make this intuition precise in the following.

Lemma 1. Let V denote the variance of a random variable. For any ﬁxed predictor f : X ! RL,

where equality holds iff 8x 2 X, the loss values (f(x)) are constant on the support of p (x), i.e.,

(8x 2 X) (8y, y0 2 supp(p (x))) (y, f(x)) = (y0, f(x)).

Proof of Lemma 1. By deﬁnition,

p (x)> (f(x))

N V [ (y, f(x))]

In both cases, the second term simply equals R(f)2 since both estimates are unbiased. For ﬁxed x 2 X, the result follows by Jensen s inequality applied to the random variable Z(x)

.= (y, f(x)). Equality occurs iff each Z(x) is constant, which requires to be constant on supp(p (x)).

The condition on equality of variance is intuitive: the two risks trivially agree when, 8x, f is non-discriminative (attaining equal loss on all labels), or when a label is inherently deterministic (p (x) is concentrated on one label). For discriminative predictors and non-deterministic labels, however, the Bayes-distilled risk can have much lower variance.

A Statistical Perspective on Distillation

The reward of reducing variance is better generalisation: a student minimising (6) better minimises the population risk (1) compared to using one-hot labels. Leveraging Maurer & Pontil (2009), we may quantify how the Bayesdistilled loss empirical variance inﬂuences generalisation.

Proposition 2. Pick any bounded loss . 1 Fix a hypothesis class F of predictors f : X ! RL, with induced class H [0, 1]X of functions h(x)

.= p (x)> (f(x)). Suppose H

has uniform covering number N1. Then, for any δ 2 (0, 1), with probability at least 1 δ over S PN,

R(f) ˆR (f; S) + O

N , H , 2N) and V

N(f) is the empirical variance of {p (xn)> (f(xn))}N

Proof of Proposition 2. This a simple consequence of Maurer & Pontil (2009, Theorem 6), which is a uniform convergence version of Bennet s inequality (Bennett, 1962).

By contrast, the bound achievable for the standard empirical risk using one-hot labels will depend on the variance of the one-hot loss values (Maurer & Pontil, 2009, Theorem 6). Combined with Lemma 1, the Bayes-distilled empirical risk results in a lower variance penalty. Thus, the Bayes-distilled risk yields a generalisation bound with a more favourable rate of convergence with increased sample size N. For a more reﬁned generalisation analysis accounting for teacher underand over-ﬁtting, see Dao et al. (2021) which builds on an earlier version of this work.

To summarise the above, a student should ideally have access to the underlying class-probabilities p (x), rather than a single realisation y Discrete(p (x)): these probabilities result in a lower-variance student objective, which aids generalisation. This provides a statistical perspective on the value of dark knowledge : for the Bayes teacher p , the logits on alternate labels y0 6= y for an example (x, y) provide information about the Bayes class-probabilities.

3.2. Illustration: the Value of a Bayes Teacher

We illustrate our statistical perspective in a controlled setting where the Bayes p (x) is known, and show that distilling with such a Bayes teacher beneﬁts learning. We generate a (binary) labelled training sample S = {(xn, yn)}N

i=1 from a distribution P comprising 10-dimensional isotropic Gaussian class-conditionals with means (0.25, 0.25, . . . , 0.25). We may explicitly compute the Bayes p (x) = (P(y =

1Note that the boundedness assumption on the loss is standard (Boucheron et al., 2005, Theorem 4.1), and may be enforced in practice with some form of regularisation (e.g., weight decay).

Figure 2. Illustration that distilling from a Bayes teacher aids

generalisation. We consider learning a linear model student on a synthetic problem with known Bayes probabilities p . Distilling using these probabilities signiﬁcantly improves accuracy over training with one-hot labels, particularly in the small-sample regime.

0 | x), P(y = 1 | x)) as P(y = 1 | x) = σ(( )>x), for .= (0.5, 0.5, . . . , 0.5) and sigmoid function σ( ).

We compare standard logistic regression on S and Bayesdistilled logistic regression using p (xn) per (6). Figure 2 compares these approaches for varying training set sizes N, where for each N we perform 100 independent trials and measure accuracy on a test set of 104 samples. For small N, Bayes-distillation offers a noticeable gain over one-hot training, in line with our theory of the former ensuring lower variance (Lemma 1). Both methods see improved performance with larger N, but one-hot encoding has greater gains: thus, the variance reduction offered by Bayesdistillation can compensate for having only a few student samples. For additional experiments, see Appendix C.3.

4. Distilling from an Imperfect Teacher

The previous section explicates how an idealised Bayes teacher can beneﬁt a student. We now study how this translates to the more realistic setting of using an imperfect, approximate Bayes teacher learned from data.

4.1. A Bias-Variance Bound for Distillation

Our ﬁrst observation is that a teacher s predictor pt is typically an imperfect estimate of the true p . For example, a teacher minimising softmax cross-entropy effectively minimises E[KL(p (x)kpt(x))]. Of course, in practice a teacher is unlikely to learn pt = p , as its model class may not be rich enough to capture the true p . Further, even if the teacher can represent p , it may not be able to learn this perfectly given a ﬁnite sample, owing to both statistical (e.g., overﬁtting) and optimisation (e.g., non-convexity) issues.

Will such an imperfect estimate of p still improve generalisation? To answer this, we establish a bias-variance tradeoff for distillation. Speciﬁcally, we show the difference between the distilled risk R(f; S) (cf. (4)) and population risk R(f) (cf. (5)) depends on how well the teacher esti-

A Statistical Perspective on Distillation

mates the Bayes class-probability distribution p in a mean squared-error (MSE) sense. This, in turn, admits a classic bias-variance decomposition for the teacher estimates. (In the following, for simplicity we assume that the student and teacher are trained on separate samples.)

Proposition 3. Pick any bounded loss . Suppose we have a teacher model pt with corresponding distilled empirical risk in (4). For any predictor f : X ! RL,

( R(f; S) R(f))2

pt(x)> (f(x))

kpt(x) p (x)k2

Proof of Proposition 3. Let

.= R(f; S) R(f). Then,

( R(f; S) R(f))2

Observe that

(pt(x) p (x))> (f(x))

kpt(x) p (x)k2 k (f(x))k2

kpt(x) p (x)k2 c k (f(x))k1

kpt(x) p (x)k2

where the second line is by the Cauchy-Schwartz inequality, and the third line by the equivalence of norms with a constant c. Now, (7) follows since R(f) is a constant, implying

V[ ] = V[ R(f; S)] = 1

pt(x)> (f(x))

Unpacking the Proposition 3, the ﬁdelity of the distilled risk depends on two factors: how variable the expected loss is for a random instance; and how well the teacher estimates pt approximates the true p in an MSE sense. The latter dominates in the large N regime, and admits a classic biasvariance tradeoff: per Appendix A.2, we may write (7) as

( R(f; S) R(f))2

pt(x)> (f(x))

The ﬁnal two terms reﬂect a classic phenomena: increasing the teacher complexity can lower bias (yield better approximations of p ), but this is traded off with higher variance (more unstable predictions). Balancing the delicate tradeoff between these quantities yields a teacher with low MSE against p . Proposition 3 establishes that such a teacher yields improved bounds on the student s generalisation error. Indeed, per the previous section, we may deduce that

R(f) R(f; S) + C(F, N) + O

Ekpt(x) p (x)k2

where C is the penalty term for the hypothesis class F from Proposition 2. As is intuitive, using an imperfect teacher invokes an additional penalty depending on how far the predictions are from the Bayes, in a squared-error sense. See Proposition 4 (Appendix A) for a formal statement of (9), and a comparison to existing bounds.

4.2. Discussion and Implications

We have provided a statistical perspective on distillation, resting on the observation that a Bayes teacher can reduce variance in the student objective, and that a teacher approximating the Bayes probabilities offers a bias-variance tradeoff. Our results follow readily from this perspective, but the resulting implications and conceptual insights into distillation are subtle, and merit discussion.

What makes a teacher good ? The above speciﬁes a concrete means of assessing the quality of a teacher s soft labels: it is beneﬁcial for them to be good probability estimates, in the sense of having low squared error against the Bayes probabilities p . Indeed, Proposition 3 establishes that in such cases, the resulting student can generalise better.

We make three qualifying remarks. First, in practice, exactly assessing the squared error to p is infeasible (since p is often unknown), but one may estimate the quality of the teacher probabilities on a holdout set, e.g. by computing the log-loss, square-loss, or calibration error (Guo et al., 2017) on the one-hot labels. While such estimates are necessarily imperfect, they can detect poor teacher probabilities.

Second, Proposition 3 may be loose, and further is not a lower bound. A comprehensive theory of distillation would require specifying such necessary conditions. Nonetheless, the qualitative trend of our bounds can still hold in practice. For example, Figure 1 (see also 4.3) illustrates how increasing the depth of a Res Net model may increase accuracy, but degrade quality of probability estimates.

Third, the focus on approximating the Bayes probabilities p suggests that the teacher s predictions ought to primarily reﬂect aleatoric, rather than epistemic, uncertainty (Senge et al., 2014; Hüllermeier & Waegeman, 2021).

Why can more accurate teachers distill worse? A curious empirical observation is that more accurate teachers may lead to worse students (Müller et al., 2019). Our statistical view offers one possible insight on this behaviour: recall that we show that good probability modelling can aid student generalisation. However, while models producing good probabilities will also be accurate, models that are accurate do not necessarily offer good probabilities (Devroye et al., 1996, Section 6.7). Indeed, despite being accurate, deep networks often make over-conﬁdent, poorly calibrated predictions (Guo et al., 2017; Rothfuss et al., 2019).

A Statistical Perspective on Distillation

0.125 0.25 0.5 1.0 2.0 4.0 8.0 7eacher distortion α

7eacher 06E

6tudent accuracy

Figure 3. Illustration that teacher accuracy is insufﬁcient to predict

distillation performance. On a synthetic problem, we construct a family of teachers (parameterised by ) with the same accuracy (red dashed line), but differing probability estimate quality (blue line), as measured by MSE against the Bayes p . Distilling each teacher to linear student models yields notably different accuracies (red solid line), with best performance for the teacher with best probability estimates (i.e., = 1.0).

7eacher 06E

6tudent accuracy

Figure 4. Illustration that suitable temperature scaling can im-

prove the quality of teacher probabilities. Distilling from a Res Net-32 to Res Net-14 on CIFAR-100, when the temperature T is too high or too low, the teacher distribution is too ﬂat or spiky respectively, and thus a poor reﬂection of the true p . Tuning T improves the teacher probability quality, as measured by the square-loss on the test set. This also tracks the student accuracy, particularly at the extremes, consistent with our perspective.

Why does temperature scaling help? Temperature scaling (Hinton et al., 2015) is a common trick wherein one smooths overly conﬁdent teacher logits f t(x) via pt(y | x) / exp(f t

y(x)/T) for T > 0. In line with our statistical view, such scaling can help the student target be closer to the true label distribution p (and thus aid generalisation), rather than just conveying the most accurate label.

Trading off bias for variance: model complexity. Following (8), to obtain a tighter bound on student generalisation, one may favour a higher-bias teacher (i.e., pt is a poorer approximation to p ) if it has lower variance (i.e., pt varies less across samples). Such a bias-variance tradeoff can be obtained, e.g., by tuning the teacher complexity.

In the context of neural networks, recent works have challenged the conventional wisdom (Geman et al., 1992) of increased model complexity (e.g., increased network width and/or depth) implying a higher variance (Neal et al., 2018; Yang et al., 2020). Note that Proposition 3 does not assume

or impose a particular bias-variance tradeoff; it speciﬁes how a given tradeoff affects student generalisation. In particular, it suggests that models simultaneously achieving low bias and variance ought to distill well.

Trading off bias for variance: label smoothing. Label smoothing (Szegedy et al., 2016) mixes the student labels with uniform predictions, i.e., uses pt(x) = (1 ) ey +

L 1 for 2 (0, 1]. From the perspective of modelling p (x), this introduces a bias over using the observed labels, but lowers variance owing to the (1 ) scaling. Provided the bias is not too large, smoothing can thus aid generalisation. Recent work has also studied how smoothing induces variance-reduction in optimisation (Xu et al., 2020).

On training sample re-use. Our analysis requires that the teacher and student employ distinct training samples. This holds in the common setting where the teacher labels a large

unlabelled pool (Radosavovic et al., 2018; Yalniz et al., 2019). Recently, Dao et al. (2021) showed how the teacher and student can use the sample training sample with a more careful cross-ﬁtting procedure. This reduces the risk of overﬁtting in the teacher s probability estimates, at the expense of a slightly increased computational cost.

4.3. Illustration: the Value of Teacher Probabilities

We illustrate the above points with empirical results on both controlled synthetic as well as real-world problems, using linear models, neural networks, and decision trees. These highlight that statistical reasoning about distillation is not necessarily limited to their use in deep neural networks.

Teacher accuracy does not sufﬁce (Figure 3). We ﬁrst illustrate that teacher accuracy does not always translate to good student performance. For the Gaussian data of 3.2 with Bayes probabilities p , we construct teacher probabilities p(x) = (p (x)). Here, is such that changing does not affect accuracy, but makes p more concentrated, and thus degrades their probabilistic quality (e.g., MSE). Concretely, we let (u) = 1

2 (2u 1) sgn(2u 1) for 2 {2 3, 2 2, . . . , 23}. (See Appendix C.2 for a plot.)

Figure 3 conﬁrms that despite the teacher accuracy being the same for all , the student accuracy is systematically harmed when 6= 1. Further, the student s accuracy closely tracks the quality of the teacher s probability estimates, as measured by MSE k p p k2

2. This is in keeping with our analysis on the value of good probability estimates.

Temperature scaling and probability quality (Figure 4). We verify that temperature scaling improves the quality of the teacher probabilities, and that this tracks the distilled student performance. We perform distillation from a Res Net32 to Res Net-14 on CIFAR-100, and apply temperature scaling with T 2 {1.0, 1.5, 2.0, . . . , 5.0}. For each T, we

A Statistical Perspective on Distillation

measure the test set MSE of the teacher probabilities (against the one-hot labels), as well as the accuracy of the student model. Note that, unlike the synthetic setting where we had access to p , the former is necessarily an estimate of the actual quality of the teacher probabilities.

Figure 4 shows that as T is varied, the MSE of the teacher probabilities varies smoothly, with optimal temperature T = 2. The optimal student accuracy is achieved with a slightly higher temperature T = 3 reﬂecting that we have provided upper bounds on the student risk but as the probability quality degrades beyond this point, student accuracy rapidly declines, consistent with our perspective.

Trading off bias for variance: decision trees (Figure 5). We illustrate that in choosing amongst different teachers, one may favour a higher-bias teacher i.e., one with worse probability estimates if it has lower variance. We show this in a controlled setting, wherein we train a series of increasingly complex teacher models on a synthetic problem. Here, the data is sampled from a marginal P(x) which is uniform on [0, 1]2, and P(y = 1 | x) has a checkerboard pattern, so that positives and negatives are sprinkled in alternating squares; see Appendix C.1 for an illustration.

We consider decision tree teachers of depth d 2 {4, 5, 6, 7, 8}. Increasing d reduces teacher bias, but increases variance (since deeper trees can better approximate p , but are more complex). For ﬁxed d, we train a teacher on a training sample S (with N = 5000). We distill its predictions to a depth 4 student tree, and compute the test set teacher MSE and student AUC-ROC over 100 trials.

Figure 5 shows that at depth d = 6, the teacher achieves the best MSE approximation of p . In keeping with our analysis, this also corresponds to the teacher whose resulting student generalises the best. Note that at d > 6, the teacher has lower bias, but higher variance; the higher-bias d = 6 teacher achieves a better tradeoff in terms of MSE. See Appendix C.4 for additional results on a distinct problem.

Trading off bias for variance: Res Net (Figure 1). Recall that in Figure 1, we train teacher Res Nets of varying depths on CIFAR-100, and distill these to a student Res Net of ﬁxed depth 8. We see that teachers with better probabilities (in an MSE sense) generally yield better students. Further, even though the teacher model gets increasingly more accurate as its depth increases, improved accuracy does not correspond to improved MSE. Prior work has observed that mismatch between the sizes of the student and teacher can also affect distillation (Cho & Hariharan, 2019; Mirzadeh et al., 2020). To mitigate such confounders, in Figure 10 (Appendix), we extend Figure 1 to include students with depth 14 and 20, and ﬁnd the general trends for depth 8 hold.

4.0 5.0 6.0 7.0 8.0 9.0 7eache U de Sth

7eache U 06E

6tudent A8C

Figure 5. Illustration of bias-variance tradeoff in a controlled set-

ting: one may favour a higher-bias teacher i.e., one with comparatively worse probability estimates if it has lower variance. On a synthetic problem, we consider a family of decision tree teachers of various depths, which are distilled to a depth 4 student. Increasing the teacher depth reduces the teacher bias, at the expense of increased variance. There is an optimal teacher depth d = 6 that balances these terms, and minimises the teacher MSE. This teacher MSE closely tracks student performance, per our analysis.

5. Applications of the Statistical View

Our statistical view has given conceptual insight into how distillation can improve classiﬁcation. We now show a potential practical beneﬁt of this view: it gives a simple, generic way to apply distillation to settings beyond classiﬁcation. Speciﬁcally, by expressing objectives in terms of the Bayes class-probabilities, one may derive empirical estimates based on teacher model outputs. The results diverge from a naïve application of distillation, as we now illustrate.

5.1. Distillation for Bipartite Ranking

Given a distribution P over X { 1}, the bipartite ranking problem (Agarwal & Niyogi, 2009) involves learning a scorer f : X ! R that minimises the pairwise disagreement,

.= E x|y=+1 E x0|y= 1 Jf(x) < f(x0)K,

i.e., the probability that a randomly drawn positive scores lower than a randomly drawn negative. This is equivalently one minus the area under the ROC curve of f (Agarwal et al., 2005), and measures how well f distinguishes the positive from negative samples. Given a training sample S = {(xn, yn)}N

n=1 PN, an empirical estimate of PD is

Jf(xi) < f(xj)K, (10)

where S+, S are the subset of positive and negative samples. Intuitively, we consider pairs of samples with positive and negative labels, and assess the difference in their scores.

In a distillation context, given a powerful teacher model, one can construct a tighter approximation to the original risk. Indeed, since P(x | y) / P(y | x) P(x) we may write

p (x) (1 p (x0)) Jf(x) < f(x0)K

A Statistical Perspective on Distillation

where µ is the marginal distribution over instances, and p (x)

.= P(y = +1 | x). Per the previous section, given a teacher model, we may use its probabilities pt in place of p to obtained the distilled bipartite risk,

i2S,j2S {i}

pt(xi) (1 pt(xj)) Jf(xi) < f(xj)K.

Observe that we consider all pairs of samples, as opposed to partitioning them into groups of positives and negatives in (10). Further, each term involves two applications of smoothing with the teacher probabilities, as opposed to the standard single application in classiﬁcation (cf. (4)). This point also arises in the following.

5.2. Distillation for Multiclass Retrieval

Given a distribution P over X [L], the multiclass retrieval problem (Lapin et al., 2018; Reddi et al., 2019) involves learning a predictor f : X ! RL such that for an example (x, y), the top-ranked labels in f(x) 2 RL contain the true

label y. One popular choice of loss for this task is the family of binary decoupled losses (Reddi et al., 2019),

(y, f(x)) = φ(fy(x)) +

y06=y φ( fy0(x)), (11)

where φ: R ! R+ is a margin loss for binary classiﬁcation, e.g., the hinge loss φ(z) = [1 z]+. Intuitively, the ﬁrst term encourages high score for the positive label y, while the second encourages low score for the negatives y0 6= y.

For ﬁxed x 2 X, the expected loss over draws of y is:

p (x)> (f(x)) = p (x)>(φ(f(x)) φ( f(x))) + 1>φ( f(x))

= p (x)>φ(f(x)) + (1 p (x))>φ( f(x))

y2[L] P(y | x) φ(fy(x)) + X

y02[L](1 P(y0 | x)) φ( fy0(x)), (12)

for all-ones vector 1 2 RL, and p (x) is the vector of P(y | x) as before. Compared to (11), the ﬁrst term considers all labels y 2 [L] as smoothed positives , akin to the standard application of distillation. Interestingly, the second term considers all labels y0 2 [L] as smoothed negatives , with variable weights depending on P(y0 | x). Intuitively, the loss pays more attention to those negatives y0 which are plausible alternate explanations for x.

Following our statistical perspective, on a ﬁnite sample S = {(xn, yn)}N

n=1, one may replace these Bayes probabilities with teacher model estimates pt, yielding:

R(f; S) = 1

φ(fy(xn)) + z(xn)

y02[L] y0(xn) φ( fy0(xn))

.= 1 pt(y0 | xn). (13)

Akin to the standard classiﬁcation case, applying such differential weighting on the positive and negative labels can lead to a lower-variance estimate of the expected loss. We may extend this to the popular softmax cross-entropy (cf. (2)),

(y, f(x)) = fy(x) + log

Here, the inner summation may be regarded as penalising high scores for negative labels y0 6= y. Inspired by (12) we may differentially weight the negatives, yielding the loss:

R(f; S) = 1

pt(y | xn) [ fy(xn) + z(xn)]

y02[L] y0(xn) efy0 (xn)

.= 1 Jy0 6= yn K pt(y0 | xn).

Note here that yn(xn) = 1, unlike in (13); this choice ensures the non-negativity of (y, f(x)), since the loss can be seen as log-loss under a weighted softmax distribution.

We reiterate that the above uses the teacher in two ways: the ﬁrst is the standard use of distillation to smooth the positive labels. The second is a novel use of distillation to smooth the negative labels. Compared to the standard loss, for any candidate label y, we apply (instanceand label-dependent) weights to negative labels y0 6= y when computing the loss. We thus term this objective negative-aware distillation.

Empirically, we have found superior performance by weighting negatives using unnormalised teacher probabilities, rather than pt directly. Speciﬁcally, one may use 1 σ(a st

y0(x)), where st

y0(x) is the teacher logit, σ( ) is the sigmoid function, and a > 0 is a scaling parameter which may be tuned. Intuitively, compared to using pt, such a weighting allows for multiple y0 to have high (or low) weights simultaneously. This is useful in retrieval scenarios, where there may be multiple relevant labels for a given x 2 X.

5.3. Empirical Illustration

We defer an empirical analysis of our bipartite ranking formulation to Appendix C.6. We now assess the negativeaware distillation objective on benchmark datasets for multiclass retrieval, AMAZONCAT-13K and AMAZONCAT670K (Mc Auley & Leskovec, 2013; Bhatia et al., 2015). We use a feedforward teacher model with a single (linear) hidden layer of width 512, trained to minimise the softmax cross-entropy. For the student , we make the hidden layer width 8 for AMAZONCAT-13K and 64 for AMAZONCAT-

670K (since the latter has many more labels).

We compare training the student with one-hot labels, teacher logits (distillation), and teacher logits with additional negative smoothing in the softmax per (14). Our aim in doing so is to conﬁrm that the core idea of negative-aware distilla-

A Statistical Perspective on Distillation

AMAZONCAT-13K AMAZONCAT-670K

Method P@1 P@3 P@5 P@1 P@3 P@5

Teacher 0.8495 0.7412 0.6109 0.3983 0.3598 0.3298

Student 0.7913 0.6156 0.4774 0.3307 0.3004 0.2753 Student + SD 0.8131 0.6363 0.4918 0.3461 0.3151 0.2892 Student + NAD 0.8560 0.7148 0.5715 0.3480 0.3161 0.2865

Table 1. Precision@k of negative-aware distillation (NAD), stan-

dard distillation (SD) and student baseline on multiclass retrieval task. NAD improves performance over both techniques.

tion (14) using a teacher to smooth negatives in addition to positives can improve upon standard distillation.

We compare all methods based on the precision@k for k 2 {1, 3, 5}, averaged over multiple runs. Table 1 summarises our ﬁndings. Distillation offers a small but consistent performance bump over the student baseline. Negativeaware distillation further improves upon this, especially for k = 1 and 3, conﬁrming the value of weighing negatives differently. The gains are signiﬁcant on AMAZONCAT-13K, where negative-aware distillation even improves upon the teacher model. Finally, we note that our use of (negativeaware) distillation here is subject to the same principles as the previous sections: e.g., temperature scaling on our teacher models improves their probabilistic calibration, and this tracks the student performance; see Appendix C.7.

5.4. Discussion and Implications

The above are two examples of how the statistical view of distillation facilitates its applications to problems beyond multi-class classiﬁcation. The key ingredient in each application is expressing the true expected loss in terms of the Bayes class-probabilities. The resulting objectives involve non-standard uses of the teacher compared to classic distillation, e.g., to weight both positive and negative labels.

More broadly, one may involve a similar procedure for other problems with complex objectives; see Appendix B for additional examples, including robustness to label noise.

6. Concluding Remarks

Our statistical perspective on distillation builds on a simple observation: distilling with the Bayes class-probabilities yields a better estimate of the population risk. This provides conceptual insight into why distillation can help, and provides a simple, principled means of using distillation in settings beyond classiﬁcation.

There are several potential directions for future study. For example, combining our analysis with study of the other effects of distillation (e.g., on optimisation (Phuong & Lampert, 2019)) would be of interest. It is also of interest to study more ﬁne-grained notions of bias and variance, e.g., the notion of regularisation samples in Zhou et al. (2021).

Agarwal, S. and Niyogi, P. Generalization bounds for rank-

ing algorithms via algorithmic stability. J. Mach. Learn. Res., 10:441 474, June 2009. ISSN 1532-4435.

Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., and

Roth, D. Generalization bounds for the area under the roc curve. Journal of Machine Learning Research, 6(14): 393 425, 2005.

Ba, J. and Caruana, R. Do deep nets really need to be deep?

In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 2654 2662. Curran Associates, Inc., 2014.

Bennett, G. Probability inequalities for the sum of indepen-

dent random variables. Journal of the American Statistical Association, 57(297):33 45, 1962.

Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P. Sparse

local embeddings for extreme multi-label classiﬁcation. In Advances in Neural Information Processing Systems, pp. 730 738, 2015.

Boucheron, S., Bousquet, O., and Lugosi, G. Theory of

classiﬁcation: a survey of some recent advances. ESAIM: Probability and Statistics, 2005.

Bucilˇa, C., Caruana, R., and Niculescu-Mizil, A. Model

compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 06, pp. 535 541, New York, NY, USA, 2006. ACM.

Cho, J. H. and Hariharan, B. On the efﬁcacy of knowledge

distillation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4793 4801, 2019.

Dao, T., Kamath, G. M., Syrgkanis, V., and Mackey, L.

Knowledge distillation as semiparametric inference. In International Conference on Learning Representations, 2021.

Devroye, L., Györﬁ, L., and Lugosi, G. A Probabilistic

Theory of Pattern Recognition, volume 31 of Stochastic Modelling and Applied Probability. Springer, 1996. ISBN 978-1-4612-0711-5.

Dong, B., Hou, J., Lu, Y., and Zhang, Z. Distillation early stopping? harvesting dark knowledge utilizing anisotropic information retrieval for overparameterized neural network, 2019.

Duchi, J., Hashimoto, T., and Namkoong, H. Distribution-

ally robust losses for latent covariate mixtures, 2020.

A Statistical Perspective on Distillation

Foster, D. J., Greenberg, S., Kale, S., Luo, H., Mohri, M.,

and Sridharan, K. Hypothesis set stability and generalization. In Advances in Neural Information Processing Systems 32, pp. 6726 6736. Curran Associates, Inc., 2019.

Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., and

Anandkumar, A. Born-again neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pp. 1602 1611, 2018.

Geman, S., Bienenstock, E., and Doursat, R. Neural net-

works and the bias/variance dilemma. Neural Computation, 4(1):1 58, 1992.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On

calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pp. 1321 1330, 2017.

Hinton, G. E., Vinyals, O., and Dean, J. Distilling the

knowledge in a neural network. Co RR, abs/1503.02531, 2015.

Hsu, D., Ji, Z., Telgarsky, M., and Wang, L. Generalization

bounds via distillation. In International Conference on Learning Representations, 2021.

Hüllermeier, E. and Waegeman, W. Aleatoric and epistemic

uncertainty in machine learning: an introduction to concepts and methods. Machine Learning, 110(3):457 506, 2021.

Jain, H., Gidaris, S., Komodakis, N., Pérez, P., and Cord,

M. Quest: Quantized embedding space for transferring knowledge. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.-M. (eds.), Computer Vision ECCV 2020, pp. 173 189, Cham, 2020. Springer International Publishing.

Ji, G. and Zhu, Z. Knowledge distillation in wide neural net-

works: Risk bound, data efﬁciency and imperfect teacher. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, 2020.

Kim, J., Park, S., and Kwak, N. Paraphrasing complex

network: Network compression via factor transfer. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 2765 2774, Red Hook, NY, USA, 2018. Curran Associates Inc.

Lapin, M., Hein, M., and Schiele, B. Analysis and optimiza-

tion of loss functions for multiclass, top-k, and multilabel classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(7):1533 1554, July 2018.

Liu, Y., Jia, X., Tan, M., Vemulapalli, R., Zhu, Y., Green,

B., and Wang, X. Search to distill: Pearls are everywhere but not the eyes, 2019.

Lopes, R. G., Fenu, S., and Starner, T. Data-free knowl-

edge distillation for deep neural networks. Co RR, abs/1710.07535, 2017.

Lopez-Paz, D., Schölkopf, B., Bottou, L., and Vapnik, V.

Unifying distillation and privileged information. In International Conference on Learning Representations (ICLR), November 2016.

Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar, S.

Does label smoothing mitigate label noise?, 2020.

Maurer, A. and Pontil, M. Empirical bernstein bounds and sample-variance penalization. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009.

Mc Auley, J. and Leskovec, J. Hidden factors and hidden

topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, Rec Sys 13, pp. 165 172, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450324090.

Mirzadeh, S.-I., Farajtabar, M., Li, A., and Ghasemzadeh,

H. Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. In AAAI, 2020.

Mobahi, H., Farajtabar, M., and Bartlett, P. L. Selfdistillation ampliﬁes regularization in hilbert space. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33, 2020.

Müller, R., Kornblith, S., and Hinton, G. E. When does

label smoothing help? In Advances in Neural Information Processing Systems 32, pp. 4696 4705, 2019.

Natarajan, N., Dhillon, I. S., Ravikumar, P. D., and Tewari,

A. Learning with noisy labels. In Advances in Neural Information Processing Systems (NIPS), pp. 1196 1204, 2013.

Nayak, G. K., Mopuri, K. R., Shaj, V., Babu, R. V., and

Chakraborty, A. Zero-shot knowledge distillation in deep networks. Co RR, abs/1905.08114, 2019.

Neal, B., Mittal, S., Baratin, A., Tantia, V., Scicluna, M.,

Lacoste-Julien, S., and Mitliagkas, I. A modern take on the bias-variance tradeoff in neural networks. Co RR, abs/1810.08591, 2018. URL http://arxiv.org/ abs/1810.08591.

A Statistical Perspective on Distillation

Papernot, N., Mc Daniel, P. D., Wu, X., Jha, S., and Swami,

A. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016, pp. 582 597, 2016.

Phuong, M. and Lampert, C. Towards understanding knowl-

edge distillation. In Proceedings of the 36th International Conference on Machine Learning, pp. 5142 5151, 2019.

Radosavovic, I., Dollár, P., Girshick, R. B., Gkioxari, G.,

and He, K. Data distillation: Towards omni-supervised learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4119 4128, 2018.

Rahbar, A., Panahi, A., Bhattacharyya, C., Dubhashi, D.,

and Chehreghani, M. H. On the unreasonable effectiveness of knowledge distillation: Analysis in the kernel regime, 2020.

Reddi, S. J., Kale, S., Yu, F., Holtmann-Rice, D., Chen, J.,

and Kumar, S. Stochastic negative mining for learning with large output spaces. In Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 1940 1949. PMLR, 16 18 Apr 2019.

Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta,

C., and Bengio, Y. Fitnets: Hints for thin deep nets. In Bengio, Y. and Le Cun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Rothfuss, J., Ferreira, F., Boehm, S., Walther, S., Ulrich,

M., Asfour, T., and Krause, A. Noise regularization for conditional density estimation, 2019.

Rudin, C. The P-Norm Push: A simple convex ranking

algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10:2233 2271, Oct 2009.

Rusu, A. A., Colmenarejo, S. G., Gülçehre, Ç., Desjardins,

G., Kirkpatrick, J., Pascanu, R., Mnih, V., Kavukcuoglu, K., and Hadsell, R. Policy distillation. In 4th International Conference on Learning Representations, ICLR 2016, 2016.

Scott, C., Blanchard, G., and Handy, G. Classiﬁcation

with asymmetric label noise: consistency and maximal denoising. In Conference on Learning Theory (COLT), pp. 489 511, 2013.

Senge, R., Bösner, S., Dembczy nski, K., Haasenritter, J.,

Hirsch, O., Donner-Banzhoff, N., and Hüllermeier, E. Reliable classiﬁcation: Learning classiﬁers that distinguish aleatoric and epistemic uncertainty. Information Sciences, 255:16 29, 2014. ISSN 0020-0255.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,

Z. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 2818 2826, 2016.

Tang, J., Shivanna, R., Zhao, Z., Lin, D., Singh, A., Chi,

E. H., and Jain, S. Understanding and improving knowledge distillation. Co RR, abs/2002.03532, 2020.

Xu, Y., Xu, Y., Qian, Q., Li, H., and Jin, R. Towards understanding label smoothing. Co RR, abs/2006.11653, 2020.

Yalniz, I. Z., Jégou, H., Chen, K., Paluri, M., and Maha-

jan, D. Billion-scale semi-supervised learning for image classiﬁcation. Co RR, abs/1905.00546, 2019. URL http://arxiv.org/abs/1905.00546.

Yang, Z., Yu, Y., You, C., Steinhardt, J., and Ma, Y. Rethink-

ing bias-variance trade-off for generalization of neural networks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 10767 10777. PMLR, 13 18 Jul 2020.

Yim, J., Joo, D., Bae, J., and Kim, J. A gift from knowledge

distillation: Fast optimization, network minimization and transfer learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7130 7138, July 2017.

Yuan, L., Tay, F. E. H., Li, G., Wang, T., and Feng, J.

Revisiting knowledge distillation via label smoothing regularization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3902 3910. IEEE,

Zagoruyko, S. and Komodakis, N. Paying more attention

to attention: Improving the performance of convolutional neural networks via attention transfer. In 5th International Conference on Learning Representations, ICLR 2017. Open Review.net, 2017.

Zhang, Z. and Sabuncu, M. R. Self-distillation as instance-

speciﬁc label smoothing. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, 2020.

Zhou, H., Song, L., Chen, J., Zhou, Y., Wang, G., Yuan,

J., and Zhang, Q. Rethinking soft labels for knowledge distillation: A bias variance tradeoff perspective. In International Conference on Learning Representations, 2021.