# selfdistillation_amplifies_regularization_in_hilbert_space__d674f3ad.pdf

Self-Distillation Ampliﬁes Regularization in Hilbert Space

Hossein Mobahi Mehrdad Farajtabar Peter L. Bartlett

hmobahi@google.com farajtabar@google.com bartlett@eecs.berkeley.edu

Google Research, Mountain View, CA, USA Deep Mind, Mountain View, CA, USA EECS Dept., University of California at Berkeley, Berkeley, CA, USA

Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the selfdistilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of why this happens. This work provides the ﬁrst theoretical analysis of self-distillation. We focus on ﬁtting a nonlinear function to training data, where the model space is Hilbert space and ﬁtting is subject to 2 regularization in this function space. We show that selfdistillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-ﬁtting, further rounds may lead to under-ﬁtting and thus worse performance.

1 Introduction

Knowledge Distillation. Knowledge distillation was introduced in the deep learning setting [13] as a method for transferring knowledge from one architecture (teacher) to another (student), with the student model often being smaller (see also [5] for earlier ideas). This is achieved by training the student model using the output probability distribution of the teacher model in addition to original labels. The student model beneﬁts from this dark knowledge (extra information in soft predictions) and often performs better than if it was trained on the actual labels. Various extensions of this approach have been recently proposed, where instead of output predictions, the student tries to match other statistics from the teacher model such as intermediate feature representations [27], Jacobian matrices [31], distributions [15], Gram matrices [37]. Additional developments on knowledge distillation include its extensions to Bayesian settings [17, 34], uncertainty preservation [33], reinforcement learning [14, 32, 9], online distillation [19], zero-shot learning [24], multi-step knowledge distillation [23], tackling catastrophic forgetting [20], transfer of relational knowledge [25], adversarial distillation [35]. Recently [26] analyzed why the student model is able to mimic teacher model in knowledge distillation and [21] presented a statistical perspective on distillation.

Self-Distillation. The special case when the teacher and student architectures are identical is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been consistently observed that the self-distilled often achieves higher accuracy on held out data [8, 36, 2]. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

task and solely evolves by looping over training. There have been some recent attempts to understand the mysteries around distillation. [11] have empirically observed that the dark knowledge transferred by the teacher is localized mainly in higher layers and does not affect early (feature extraction) layers much. [8] interprets dark knowledge as importance weighting. [6] shows that early-stopping is crucial for reaching dark-knowledge of self-distillation. [1] empirically studies how inductive biases are transferred through distillation. Ideas similar to self-distillation have been used in areas besides modern machine learning but with different names such diffusion and boosting in both the statistics and image processing communities [22].

Contributions. Despite interesting developments, why distillation can improve generalization remains elusive. To the best of our knowledge, there is no rigorous understanding of this phenomenon. This work provides a theoretical analysis of self-distillation. While originally observed in deep learning, we show that this is a more profound phenomenon that can occur even in classical regression settings, where we ﬁt a nonlinear function to training data with models belonging to a Hilbert space and using 2 regularization in this function space. In this setting we show that the self-distillation iterations progressively limit the number of basis functions used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-ﬁtting, further rounds may lead to under-ﬁtting and thus worse performance. More precisely, we show that self-distillation results in a non-conventional power iteration where the linear operation changes dynamically; each step depends intricately on the results of earlier linear operations via a nonlinear recurrence. We also prove that using lower training error across distillation steps generally improves the sparsity effect, and speciﬁcally we provide a closed form bound on the sparsity level as the training error goes to zero. Finally, we discuss how our regularization results can be translated into generalization bounds.

Organization. In Section 2 we setup a variational formulation of nonlinear regression and discuss the existence of non-trivial solutions. Section 3 formalizes self-distillation in our setting. It then shows self-distillation iterations at some point collapse the solution. It provides a lower bound on the number of distillation rounds before the collapse is reached. In addition, it shows that coefﬁcient of the basis functions initially used to represent the solution gradually progressively become sparser. Finally, we discuss that by having the models operate in the near-interpolation regime one can ultimately achieve higher sparsity level. Section 5 draws connection between our setting and the NTK regime of neural networks. This motivates subsequent experiments on deep neural networks in that section. Full proofs for these as well as the code for reproducing examples in Sections 4 and results in Section 5 are available in the supplementary appendix.

2 Problem Setup

We ﬁrst introduce some notation. We denote a set by A, a matrix by A, a vector by a, and a scalar by a or A. The (i, j) th component of a matrix is denoted by A[i, j] and the i th component of a vector by a[i]. Also . refers to 2 norm of a vector. We use to indicate equal by deﬁnition. A linear operator L applied to a function f is shown by [Lf], and when evaluated at point x by [Lf](x). For a positive deﬁnite matrix A, we use κ to refer to its condition number κ dmax

dmin , where d s are eigenvalues of A. Consider a ﬁnite training set D K k=1{(xk, yk)}, where xk X Rd and yk Y R. Consider a space of all admissible functions (as we deﬁne later in this section) F : X Y. The goal is to use this training data to ﬁnd a function f : X Y that approximates the underlying mapping from X to Y. We assume the function space F is rich enough to contain multiple functions that can ﬁt ﬁnite training data. Thus, the presence of an adequate inductive bias is essential to guide the training process towards solutions that generalize. We infuse such bias in training via regularization. Speciﬁcally, we study regression problems of the form1 below, where R : F R is a regularization functional, and > 0 is a desired loss tolerance.

f arg min f F R(f) s.t. 1 K

f(xk) yk 2 . (1)

1Our choice of setting up learning as a constrained optimization rather than unconstrained form 1 K

k f(xk) yk 2 + c R(f) is motivated by the fact that we often have control over as a user-speciﬁed stopping criterion. In fact, in the era of overparameterized models, we can often ﬁt training data to a desired -optimal loss value [38]. However, if we adopt the unconstrained setting, it is unclear what value of c would correspond to a particular stopping criterion.

We consider regularizers with the following type,

X u(x, x )f(x)f(x ) dx dx , (2)

with u being such that f F ; R(f) 0 with equality only when f(x) = 0. Without loss of generality, we assume u is symmetric u(x, x ) = u(x , x). For a given u, the space F of admissible functions are f s for which the double integral in (2) is bounded. The conditions we imposed on R(f) implies that the operator L deﬁned as [Lf]

X u(x, . )f(x) dx has an empty null space2. The symmetry and non-negativity conditions together are called Mercer s condition and u is called a kernel. Constructing R via kernel u can cover a wide range of regularization forms including the form R(f) =

X J j=1 wj [Pjf](x) 2 dx, where Pj is some linear operator (e.g. a differential operator to penalize non-smooth functions as we will see in Section 4), and wj 0 is some weight, for j = 1, . . . , J operators. Plugging R(f) into the objective functional leads to the variational problem,

f arg min f F

X u(x, x )f(x)f(x )dxdx s.t. 1 K

f(xk) yk 2 . (3)

The Karush-Kuhn-Tucker (KKT) condition for this problem yields,

f λ arg min f F λ K

f(xk) yk 2 +

X u(x, x )f(x)f(x ) dx dx (4)

s.t. λ 0 , 1 K

f(xk) yk 2 , λ 1

f(xk) yk 2 = 0 . (5)

2.1 Existence of Non-Trivial Solutions

Stack training labels into a vector,

y K 1 [ y1 | y2 | . . . | y K] . (6)

It is obvious that when 1

K y 2 , then f has trivial solution f (x) = 0, which we refer to this case as collapse regime. In the sequel, we focus on the more interesting case of 1

K y 2 > . It is also easy to verify that collapsed solution is tied to λ = 0, y 2 > K λ > 0 . (7)

Thus by taking any λ > 0 that satisﬁes 1

k f λ(xk) yk 2 = 0, the following form f λ is an optimal solution to the problem (3), i.e. f = f λ.

f λ = arg min f F λ K

f(xk) yk 2 +

X u(x, x )f(x)f(x ) dx dx . (8)

2.2 Closed Form of f

In this section we present a closed form expression for (8). Since we are considering λ > 0, without loss of generality, we can divide the objective function in (8) by λ and let c 1/λ; obviously c > 0.

f = arg min f F 1 K

f(xk) yk 2 + c

X u(x, x )f(x)f(x ) dx dx . (9)

The variational problem (9) has appeared in machine learning context extensively [10]. It has a known solution form, due to representer theorem [29], which we will present here in a proposition. However, we ﬁrst need to introduce some deﬁnitions. Let g(x, t) be a function such that

X u(x, x ) g(x , t) dx = δ(x t), where δ(x) is Dirac delta. Such g is called the Green s function3 of the linear operator L, with L being [Lf](x)

X u(x, x ) f(x ) dx . Let the matrix GK K and the vector gx K 1 be deﬁned as,

K g(xj, xk) , gx[k] 1

K g(x, xk) . (10)

2This assumption simpliﬁes the exposition. If the null space is non-empty, one can still utilize it using [10]. 3We assume that the Green s function exists and is continuous. Detailed treatment of existence conditions is beyond the scope of this work and can be found in text books such as [7].

Figure 1: Schematic illustration of the self-distillation process for two iterations.

Proposition 1 The variational problem (9) has a solution of the form,

f (x) = g T x(c I + G) 1y . (11)

Notice that the matrix G is positive deﬁnite4. Since by deﬁnition c > 0, the inverse of the matrix c I + G is well-deﬁned. Also, because G is positive deﬁnite, it can be diagonalized as G = V T DV , where the diagonal matrix D contains the eigenvalues of G, denoted as d1, . . . , d K that are strictly greater than zero, and the matrix V contains the corresponding eigenvectors.

2.3 Bounds on Multiplier c

Earlier we showed that any c > 0 that is a root of 1

k f c (xk) yk 2 = 0 produces an optimal solution f via (9). However, in the settings that we are interested in, we do not know the underlying c or f beforehand; we have to relate them the given training data instead. As we will see later in Proposition 3, knowledge of c allows us to resolve the recurrence on y created by self-distillation loop and obtain an explicit bound y at each distillation round. Unfortunately ﬁnding c in closed form by seeking roots of 1

k f c (xk) yk 2 = 0 w.r.t. c is impossible, due to the nonlinear dependency of f on c caused by matrix inversion; see (11). However, we can still provide bounds on the value of c as shown in this section. Throughout the analysis, it is sometimes easier to work with rotated labels V y. Thus we deﬁne z V y. Note that any result on z can be easily converted back via y = V T z, as V is an orthogonal matrix. Trivially z = y . Our ﬁrst step is to present a simpliﬁed form for the error term 1

k f (xk) yk 2 using the following proposition.

Proposition 2 The following identity holds 1

k f (xk) yk 2 = 1 K

k(zk c c+dk )2.

We now proceed to bound the roots of h(c) 1 K

k(zk c c+dk )2 . Since we are considering y 2 > K , and thus by (7) c > 0, it is easy to construct the following lower and upper bounds on h h(c) 1 K

k(zk c c+dmax )2 , h(c) 1 K

k(zk c c+dmin )2 . The roots of h and h,

namely c1 and c2, can be easily derived c1 = dmax

K , c2 = dmin

K . Since h(c) h(c), the

condition h(c1) = 0 implies that h(c1) 0. Similarly, since h(x) h(c), the condition h(c2) = 0 implies that h(c2) 0. By the intermediate value theorem, due to continuity of f and the fact that z = y >

K (non-collapse condition), there is a point c between c1 and c2 at which h(c) = 0,

3 Self-Distillation

Denote the prediction vector over the training data x1, . . . x K as f K 1 f (x1) | . . . | f (x K) T = V T D(c I + D) 1V y. Self-distillation treats this prediction as target labels for a new round of training, and repeats this process as shown in Figure 1. With abuse of notation, denote the t th round of distillation by subscript t. We refer to the

4This property of G comes from the fact that u is a positive deﬁnite kernel (deﬁnite instead of semi-deﬁnite, due to empty null space assumption on the operator L), thus so is its inverse (i.e. g). Since g is a kernel, its associated Gram matrix is positive deﬁnite.

standard training (no self-distillation yet) by the step t = 0. Thus the standard training step has the form f 0 = V T D(c0I + D) 1V y0, where y0 is the ground truth labels as deﬁned in (6). Letting y1 f 0, we achieve the next model by applying the ﬁrst round of self-distillation f 1 = V T D(c1I + D) 1V y1. In general, for any t 1 we have the following recurrence,

t 1 ; yt = V T At 1V yt 1 , t 0 ; At K K D(ct I + D) 1 . (13) Note that At is also a diagonal matrix. Furthermore, since at the t th distillation step, everything is the same as the initial step except the training labels, we can use Proposition 1 to express ft(x) as,

f t (x) = g T x(ct I + G) 1yt = g T x V T D 1(Πt i=0At)V y0 . (14) Observe that the only dynamic component in the expression of f t is Πt i=0Ai. In the following, we show how Πt i=0Ai evolves over time. Speciﬁcally, we show self-distillation progressively sparsiﬁes the matrix Πt i=0Ai at a provided rate. Also recall from Section 2.1 that in each step of self-distillation we require yt >

K . If this condition breaks, the solution collapses to zero function and subsequent rounds of self-distillation keep producing f (x) = 0. In this section we present a lower bound on number of iterations t guaranteeing all intermediate problems satisfy yt >

3.1 Unfolding the Recurrence

Our goal here is to understand how yt evolves in t. By combining the equations in (13) we obtain yt = V T D(ct 1I + D) 1V yt 1. By multiplying both sides from the left by V we get V yt = V V T D(ct 1I + D) 1V yt 1, which is equivalent to,

zt = D(ct 1I + D) 1zt 1 1

K zt = D(ct 1I + D) 1 1

K zt 1 . (15)

Also we can use the bounds on c from (12) at any arbitrary t 0 and thus write ,

By combining RHS of (15) and (16) we obtain a recurrence solely in z as shown below, where dmin αt dmax.

K I + D) 1 zt 1 . (17)

We now establish a lower bound on the value of zt .

Proposition 3 For any t 0, if zi >

K for i = 0, . . . , t, then zt at(κ) z0

K b(κ) at(κ) 1

a(κ) 1 , where a(x) (r0 1)2+x(2r0 1)

(r0 1+x)2 , b(x) r2 0x (r0 1+x)2 , r0 1

K z0 , κ dmax

3.2 Guaranteed Number of Self-Distillation Rounds

By looking at the LHS of (15) it is not hard to see the value of zt is decreasing in t. That is because ct5 as well as elements of the diagonal matrix D are strictly positive. Hence D(ct 1I + D) 1

shrinks the magnitude of zt 1 in each iteration. Thus, starting from z0 >

K , as zt decreases, at some point it falls below the value

K and thus the solution collapses. We now want to ﬁnd out after how many rounds t, the solution collapse happens. Finding the exact such t, is difﬁcult, but by setting a lower bound of zt to

K and solving that in t, calling the solution t, we can guarantee realization of at least t rounds where the value of zt remains above

K . We can use the lower bound we developed in Proposition 3 in order to ﬁnd such t. This is shown in the following proposition. Note that when we are in near-interpolation regime, i.e. 0, the form of t simpliﬁes: t y0 κ

Proposition 4 Starting from y0 >

K , meaningful (non-collapsing solution) self-distillation is possible at least for t rounds,

5ct > 0 following from the assumption zt >

3.3 Evolution of Basis

Recall from (14) that the learned function after t rounds of self-distillation has the form f t (x) = g T x V T D 1(Πt i=0At)V y0. The only time-dependent part is thus the following diagonal matrix Bt deﬁned in (19). In this section we show how Bt evolves over time. Speciﬁcally, we claim that the matrix Bt becomes progressively sparser as t increases.

Bt Πt i=0At . (19)

Theorem 5 Suppose y0 >

K and t y0 κ

κ. Then for any pair of diagonals of D, namely dj and dk, with the condition that dk > dj, the following inequality holds.

The above theorem suggests that, as t increases, the smaller elements of Bt 1 shrink faster and at some point become negligible compared to larger ones. That is because in (20) we have assumed dk > dj, and thus the r.h.s. expression in the parentheses is strictly greater than 1. The latter implies that Bt 1[k,k]

Bt 1[j,j] is increasing in t. Observe that if one was able to push t , then only one entry of Bt (the one corresponding to dmax) would remains signiﬁcant relative to others. Thus, self-distillation process progressively sparsiﬁes Bt. This sparsiﬁcation affects the expressiveness of the regression solution f t (x). To see that, use the deﬁnition of f t (x) from (14) to express it in the form (21), where we gave a name to the rotated and scaled basis px D 1V gx and rotated vector z0 V y0. The solution f t is essentially represented by a weighted sum of the basis functions (the components of px). Thus, the number of signiﬁcant diagonal entries of Bt determines the effective number of basis functions used to represent the solution.

f t (x) = g T x V T D 1 Bt V y0 = p T x Btz0 . (21)

3.4 Self-Distillation versus Early Stopping

Broadly speaking, early stopping can be interpreted as any procedure that cuts convergence short of the optimal solution. Examples include reducing the number of iterations of the numerical optimizer (e.g. SGD), or increasing the loss tolerance threshold . The former is not applicable to our setting, as our analysis is independent of function parametrization and its numerical optimization. We consider the second deﬁnition. This form of early stopping also has a regularization effect; by increasing in (1) the feasible set expands and thus it is possible to ﬁnd functions with lower R(f). However, we show here that the induced regularization is not equivalent to that of self-distillation. In fact, one can say that early-stopping does the opposite of sparsiﬁcation, as we show below. The learned function via loss-based early stopping in our notation can be expressed as f 0 (single training, no self-distillation) with a larger error tolerance ,

f 0 (x) = p T x B0z0 = p T x D(c0I + D) 1z0 . (22)

The effect of larger on the value of c0 is shown in (12). However, since c0 is just a scalar value applied to matrices, it does not provide any lever to increase the sparsity of D. We now elaborate on the latter claim a bit more. Observe that, on the one hand, when c0 is large, then D(c0I + D) 1 1 c0 D, which essentially uses D and does not sparsify it further. On the other hand, if c0 is small then D(c0I + D) 1 I, which is the densest possible diagonal matrix. Thus, at best, early stopping maintains the original sparsity pattern of D and otherwise makes it even denser.

3.5 Advantage of Near Interpolation Regime

As discussed in Section (3.3), one can think of Bt 1[k,k]

Bt 1[j,j] as a sparsity measure (the larger, the

sparser). Thus, we deﬁne a sparsity index based on the lower bound we developed for Bt 1[k,k]

Bt 1[j,j] in

Proposition 5. In fact, by ﬁnding the lowest value of the bound across elements all elements satisfying dk > dj and further assuming d1 < d2 < < d K, we can ensure at least the sparsity level of,

SBt 1 min k {1,2,...,K 1}

One may wonder what is the highest sparsity S that self-distillation can attain. Since y0 >

K and dk+1 > dk, the term inside parentheses in (23) is strictly greater than 1 and thus S increases in t. However, the largest t we can guarantee before a solution collapse (see Proposition 4) is t = y0 κ

κ. By plugging this t into the deﬁnition of S (23) we eliminate t and obtain the largest sparisity index as shown in (24). In the next theorem, we show SBt 1 always improves as gets smaller. Thus, if high sparsity is desired, one can set as small as possible. One should however note that the value of cannot be identically zero, i.e. exact interpolation regime, because then f 0 = y0, and since y1 = f 0, self-distillation process keeps producing the same model in each round.

SBt 1 = min k {1,2,...,K 1}

Theorem 6 Suppose y0 >

K . Then the sparsity index SBt 1 (where t = y0 κ

κ is number of guaranteed self-distillation steps before solution collapse) decreases in , i.e. lower yields higher sparsity. Furthermore at the limit 0, the sparsity index lim 0 SBt 1 =

κ mink {1,2,...,K 1}( 1

dk 1 dk+1 ).

3.6 Multiclass Extension

We can formulate multiclass classiﬁcation, by regressing to a one-hot encoding. Speciﬁcally, a problem with Q classes can be modeled by Q output functions f1, . . . , f Q. An easy extension of our analysis to this multiclass setting is to require the functions f1, . . . , f Q be smooth by applying the same regularization R to each and then adding up these regularization terms. This way, the optimal function for each output unit can be solved for each q = 1, . . . , Q as f q arg minfq F 1 K

k(fq(xk) yqk)2 + cq R(fq).

3.7 Generalization Bounds

Our result can be easily translated into generalization guarantees. Recall from (14) that the regression solution after t rounds of self-distillation has the form f t (x) = g T x V T D 1(Πt i=0At)V y0. We can show that (proof in Appendix B), there exists a positive deﬁnite kernel g ( . , . ) that performing standard kernel ridge regression with it over the same training data K k=1{(xk, yk)} yields the function f such that f = f t . Furthermore, we can show that the spectrum of the Gram matrix G [j, k] 1 K g (xj, xk) in the latter kernel regression problem relates to spectrum of G via,

Πt i=0(dk+ci)

dt+1 k 1 . (25)

The identity (25) enables us to leverage existing generalization bounds for standard kernel ridge regression. These results often only need the spectrum of the Gram matrix. For example, Lemma 22 in [4]

shows the Rademacher complexity of the kernel class is proportional to

tr(G ) = K k=1 d k and then Theorem 8 of [4] translates that Rademacher complexity into a generalization bound. Note that

Πt i=0(dk+ci)

dt+1 k increases in t, which implies d k and consequently

tr(G ) decreases in t.

A more reﬁned bound in terms of the tail behavior of the eigenvalues d k (to better exploit the sparsity pattern) is the Corollary 6.7 of [3] which provides a generalization bound that is afﬁne in the form

0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0

(a) (b) (c) (d)

Figure 2: (a,b,c) Example with R(f)(x) 1 0 d2

dx2 f(x) 2 dx. (a) Green s function associated with the kernel of R. (b) Noisy training samples (blue dots) from underlying function (orange) y = sin(2πx). Fitting without regularization leads to overﬁtting (blue curve). (c) Four rounds of self-distillation (blue, orange, green, red) with = 0.04. (d) Evolution of diagonal entries of (diagonal matrix) Bt from (19) at distillation rounds t = 0 (left most) to t = 3 (right most). The number of training points is K = 11, so Bt which is K K diagonal matrix has 11 entries on its diagonal, each corresponding to one of the bars in the chart.

mink {0,1,...,K}

1 K K j=k+1 d j

, where the eigenvalues d k for k = 1, . . . , K, are sorted

in non-increasing order .

4 Illustrative Example

Let F be the space of twice differentiable functions that map [0, 1] to R as F {f | f : [0, 1] R}. Deﬁne the linear operator P : F F as [Pf](x) d2 dx2 f(x) subject to boundary conditions f(0) = f(1) = f (0) = f (1) = 0. The associated regularization functional becomes R(f) 1 0 d2

dx2 f(x) 2 dx. Observe that this regularizer encourages smoother f by penalizing the second order derivative of the function. The Green s function of the operator associated with the kernel of R subject to the listed boundary conditions is a spline g(x, x ) = 1

6 max (x x )3, 0 1

x )(x2 2x + x 2) [28] (see Figure 2-a). Now consider training points (xk, yk) sampled from the function y = sin(2πx). Let xk be evenly spaced in the interval [0, 1] with steps of 0.1, and yk = xk + η where η is a zero-mean normal random variable with σ = 0.5 (Figure 2-b). As shown in Figure 2-c, the regularization induced by self-distillation initially improves the quality of the ﬁt, but after that point additional rounds of self-distillation over-regularize and lead to underﬁtting. We also computed the diagonal matrix Bt (see (19) for deﬁnition) at each self-distillation round t, for t = 0, . . . , 3 (after that, the solution collapses). Recall from (21) that the entries of this matrix can be thought of as the coefﬁcients of basis functions used to represent the solution. As predicted by our analysis, self-distillation regularizes the solution by sparsifying these coefﬁcients. This is evident in Figure 2-b where smaller coefﬁcients shrink faster.

5 Experiments

In our experiments, we aim to empirically evaluate our theoretical analysis in the setting of deep networks. Although our theoretical results apply to Hilbert space rather than deep neural networks, recent ﬁndings show that at least very wide neural networks (NTK Regime) can be viewed as a reproducing kernel Hilbert space [16].

We adopt a clear and simple setup that is easy to reproduce (see the provided code) and also lightweight enough to run more then 10 rounds of self-distillation. Readers interested in stronger baselines are referred to [8, 36, 2]. However, these works are limited to one or two rounds of self-distillation. The ability to run self-distillation for a larger number of rounds allows us to demonstrate the eventual decline of the test performance. To the best of our knowledge, this is the ﬁrst time that the performance decline regime is observed. The initial improvement and later continuous decline is consistent with our theory, which shows rounds of self-distillation continuously amplify the regularization effect. While initially this may beneﬁt generalization, at some point the excessive regularization leads to under-ﬁtting.

We use Resnet [12] and VGG [30] neural architectures and train them on CIFAR-10 and CIFAR-100 datasets [18]. Training details and additional results are left to the appendix. Each curve in the plots corresponds to 10 runs from randomly initialized weights, where each run is a chain of self-distillation

Figure 3: (Top/Bottom): Accuracy of self-distillation steps using Resnet with 2/cross-entropy loss. (Left Two Plots / Right Two Plots) test and train accuracy on CIFAR-10/CIFAR-100.

steps indicated in the x-axis. In the plots, a point represents the average and the envelope around it reﬂects standard deviation. Any training accuracy reported here is based on assessing the model ft at the t th self-distillation round on the original training labels y0. WE ﬁrst train the neural network using 2 loss. The error is deﬁned as the difference between predictions (softmax over the logits) and the target labels. These results are in concordance with a regularization viewpoint of self-distillation. The theory suggests that self-distillation progressively ampliﬁes the underlying regularization effect. As such, we expect the training accuracy (over y0) to drop in each self-distillation round. Test accuracy may go up if training can beneﬁt from ampliﬁed regularization. However, from the theory we expect the test accuracy to go down at some point due to over regularization and thus underﬁtting. Both of these phenomena are observed in four left plots Figure 3. Although, our theory only applies to 2, loss, we empirically observed similar phenomena for cross entropy as shown in four right plots in Figure 3. We have included additional plots in the appendix showing the performance of 2 distillation on CIFAR-100 using VGG network (hence concluding that the theory and empirical ﬁndings are not dependent to a speciﬁc structure and apply to architectures beyond Resnet). In the appendix we have also shown that self-distillation and early-stopping have different regularization effects.

6 Conclusion

In this work, we presented a rigorous analysis of self-distillation for ridge regression in a Hilbert space of functions. We showed that self-distillation iterations in the setting we studied cannot continue indeﬁnitely; at some point the solution collapses to zero. We provided a lower bound on the number of meaningful (non-collapsed) distillation iterations. In addition, we proved that self-distillation acts as a regularizer that progressively employs fewer basis functions for representing the solution. We discussed the difference in regularization effect induced by self-distillation against early stopping. We also showed that operating in near-interpolation regime facilitates the regularization effect. We discussed how our regression setting resembles the NTK view of wide neural networks, and thus may provide some insight into how self-distillation works in deep learning. We hope that our work can be used as a stepping stone to broader settings. In particular, studying cross-entropy loss as well as other forms of regularization are interesting directions for further research.

Broader Impact

We believe that this paper is categorized as fundamental and theoretical research and is not targeted to any speciﬁc application area. The insights and theory developed here may inspire novel algorithms and more investigations in knowledge distillation and more generally in neural network regularization and generalization. Consequently this may lead to better training algorithms with lower training time, computational cost, or energy consumption. The research presented here can be used for many different application areas and a particular use may have both positive or negative implications. Though, we are not aware of any immediate short term negative impact of this research.

Acknowledgement

We would like to thank colleagues at Google Research for their feedback and comments: Moshe Dubiner, Pierre Foret, Sergey Ioffe, Yiding Jiang, Alan Mac Key, Matt Streeter, and Andrey Zhmoginov.

[1] Samira Abnar, Mostafa Dehghani, and Willem Zuidema. Transferring inductive biases through knowledge distillation, 2020.

[2] Sungsoo Ahn, Shell Xu Hu, Andreas C. Damianou, Neil D. Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9155 9163, 2019.

[3] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Ann. Statist., 33(4):1497 1537, 08 2005.

[4] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463 482, 2002.

[5] Cristian Buciluundeﬁned, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 06, page 535 541, New York, NY, USA, 2006. Association for Computing Machinery.

[6] Bin Dong, Jikai Hou, Yiping Lu, and Zhihua Zhang. Distillation early stopping? harvesting dark knowledge utilizing anisotropic information retrieval for overparameterized neural network. Ar Xiv, abs/1910.01255, 2019.

[7] D.G. Duffy. Green s Functions with Applications. Applied Mathematics. CRC Press, 2001.

[8] Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born-again neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1602 1611, 2018.

[9] Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-and-conquer reinforcement learning. In International Conference on Learning Representations, 2018.

[10] Federico Girosi, Michael Jones, and Tomaso Poggio. Regularization theory and neural networks architectures. Neural Computation, 7(2):219 269, 1995.

[11] Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. In International Conference on Learning Representations, 2019.

[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2015.

[13] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.

[14] Zhang-Wei Hong, Prabhat Nagarajan, and Guilherme Maeda. Collaborative inter-agent knowledge distillation for reinforcement learning, 2020.

[15] Zehao Huang and Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. Co RR, abs/1707.01219, 2017.

[16] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS 18, pages 8580 8589, USA, 2018. Curran Associates Inc.

[17] Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. Bayesian dark knowledge. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3438 3446. Curran Associates, Inc., 2015.

[18] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

[19] xu lan, Xiatian Zhu, and Shaogang Gong. Knowledge distillation by on-the-ﬂy native ensemble. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7517 7527. Curran Associates, Inc., 2018.

[20] Zhizhong Li and Derek Hoiem. Learning without forgetting. In ECCV, 2016.

[21] Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, and Sanjiv Kumar. Why distillation helps: a statistical perspective, 2020.

[22] Peyman Milanfar. A tour of modern image ﬁltering: New insights and methods, both practical and theoretical. IEEE Signal Process. Mag., 30(1):106 128, 2013.

[23] Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. AAAI 2020, abs/1902.03393, 2020.

[24] Gaurav Kumar Nayak, Konda Reddy Mopuri, Vaisakh Shaj, Venkatesh Babu Radhakrishnan, and Anirban Chakraborty. Zero-shot knowledge distillation in deep networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4743 4751, Long Beach, California, USA, 09 15 Jun 2019. PMLR.

[25] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3962 3971, 2019.

[26] Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5142 5151, Long Beach, California, USA, 09 15 Jun 2019. PMLR.

[27] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. Co RR, abs/1412.6550, 2014.

[28] Helene Charlotte Rytgaard. Statistical models for robust spline smoothing. Master s thesis, University of Copenhagen, 8 2016.

[29] B. Schölkopf, R. Herbrich, and AJ. Smola. A generalized representer theorem. In Lecture Notes in Computer Science, Vol. 2111, number 2111 in LNCS, pages 416 426, Berlin, Germany, 2001. Max-Planck Gesellschaft, Springer.

[30] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.

[31] Suraj Srinivas and François Fleuret. Knowledge transfer with jacobian matching. Co RR, abs/1803.00443, 2018.

[32] Yee Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4496 4506. Curran Associates, Inc., 2017.

[33] Linh Tran, Bastiaan S. Veeling, Kevin Roth, Jakub Swiatkowski, Joshua V. Dillon, Jasper Snoek, Stephan Mandt, Tim Salimans, Sebastian Nowozin, and Rodolphe Jenatton. Hydra: Preserving ensemble diversity for model distillation, 2020.

[34] Meet P. Vadera and Benjamin M. Marlin. Generalized bayesian posterior expectation distillation for deep neural networks, 2020.

[35] Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Kdgan: Knowledge distillation with generative adversarial networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 775 786. Curran Associates, Inc., 2018.

[36] Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan L. Yuille. Training deep neural networks in generations: A more tolerant teacher educates better students. In The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 5628 5635. AAAI Press, 2019.

[37] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7130 7138, 2017.

[38] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.