# generalization_bounds_via_distillation__a9b554e6.pdf

Published as a conference paper at ICLR 2021

GENERALIZATION BOUNDS VIA DISTILLATION

Daniel Hsu Columbia University, New York City djhsu@cs.columbia.edu

Ziwei Ji, Matus Telgarsky, Lan Wang University of Illinois, Urbana-Champaign {ziweiji2,mjt,lanwang2}@illinois.edu

This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation. This bound is presented both in an abstract and in a concrete form, the latter complemented by a reduction technique to handle modern computation graphs featuring convolutional layers, fullyconnected layers, and skip connections, to name a few. To round out the story, a (looser) classical uniform convergence analysis of compression is also presented, as well as a variety of experiments on cifar10 and mnist demonstrating similar generalization performance between the original network and its distillation.

1 OVERVIEW AND MAIN RESULTS

Generalization bounds are statistical tools which take as input various measurements of a predictor on training data, and output a performance estimate for unseen data that is, they estimate how well the predictor generalizes to unseen data. Despite extensive development spanning many decades (Anthony & Bartlett, 1999), there is growing concern that these bounds are not only disastrously loose (Dziugaite & Roy, 2017), but worse that they do not correlate with the underlying phenomena (Jiang et al., 2019b), and even that the basic method of proof is doomed (Zhang et al., 2016; Nagarajan & Kolter, 2019). As an explicit demonstration of the looseness of these bounds, Figure 1 calculates bounds for a standard Res Net architecture achieving test errors of respectively 0.008 and 0.067 on mnist and cifar10; the observed generalization gap is 10 1, while standard generalization techniques upper bound it with 1015.

Contrary to this dilemma, there is evidence that these networks can often be compressed or distilled into simpler networks, while still preserving their output values and low test error. Meanwhile, these simpler networks exhibit vastly better generalization bounds: again referring to Figure 1, those same networks from before can be distilled with hardly any change to their outputs, while their bounds reduce by a factor of roughly 1010. Distillation is widely studied (Bucilu u et al., 2006; Hinton et al., 2015), but usually the original network is discarded and only the ﬁnal distilled network is preserved.

The purpose of this work is to carry the good generalization bounds of the distilled network back to the original network; in a sense, the explicit simplicity of the distilled network is used as a witness to implicit simplicity of the original network. The main contributions are as follows.

The main theoretical contribution is a generalization bound for the original, undistilled network which scales primarily with the generalization properties of its distillation, assuming that wellbehaved data augmentation is used to measure the distillation distance. An abstract version of this bound is stated in Lemma 1.1, along with a sufﬁcient data augmentation technique in Lemma 1.2. A concrete version of the bound, suitable to handle the Res Net architecture in Figure 1, is described in Theorem 1.3. Handling sophisticated architectures with only minor proof alterations is another contribution of this work, and is described alongside Theorem 1.3. This abstract and concrete analysis is sketched in Section 3, with full proofs deferred to appendices. Rather than using an assumption on the distillation process (e.g., the aforementioned wellbehaved data augmentation ), this work also gives a direct uniform convergence analysis, culminating in Theorem 1.4. This is presented partially as an open problem or cautionary tale, as

Published as a conference paper at ICLR 2021

0 10 4 10 3 10 2 10 1

distillation distance Φγ, m

Gen bound: Lemma 3.1

Gen measure:

Distillation distance + error

Training error

Testing error

(a) Res Net8 trained on cifar10.

0 10 4 10 3 10 2 10 1

distillation distance Φγ, m

(b) Res Net8 trained on mnist.

Figure 1: Generalization bounds throughout distillation. These two subﬁgures track a sequence of increasingly distilled/compressed Res Net8 networks along their horizontal axes, respectively for cifar10 and mnist data. This horizontal axis measures distillation distance Φγ,m, as deﬁned below in eq. (1.1). The bottom curves measure various training and testing errors, whereas the top two curves measure respectively a generalization bound presented here (cf. Theorem 1.3 and Lemma 3.1), and a generalization measure. Notably, the top two curves drop throughout a long interval during which test error remains small. For further experimental details, see Section 2.

its proof is vastly more sophisticated than that of Theorem 1.3, but ultimately results in a much looser analysis. This analysis is sketched in Section 3, with full proofs deferred to appendices. While this work is primarily theoretical, it is motivated by Figure 1 and related experiments: Figures 2 to 4 demonstrate that not only does distillation improve generalization upper bounds, but moreover it makes them sufﬁciently tight to capture intrinsic properties of the predictors, for example removing the usual bad dependence on width in these bounds (cf. Figure 3). These experiments are detailed in Section 2.

1.1 AN ABSTRACT BOUND VIA DATA AUGMENTATION

This subsection describes the basic distillation setup and the core abstract bound based on data augmentation, culminating in Lemmas 1.1 and 1.2; a concrete bound follows in Section 1.2.

Given a multi-class predictor f : Rd Rk, distillation ﬁnds another predictor g : Rd Rk which is simpler, but close in distillation distance Φγ,m, meaning the softmax outputs φγ are close on average over a set of points (zi)m i=1:

Φγ,m(f, g) := 1

φγ(f(zi)) φγ(g(zi)) 1 , where φγ(f(z)) exp f(z)/γ . (1.1)

The quantity γ > 0 is sometimes called a temperature (Hinton et al., 2015). Decreasing γ increases sensitivity near the decision boundary; in this way, it is naturally related to the concept of margins in generalization theory, as detailed in Appendix B. due to these connections, the use of softmax is beneﬁcial in this work, though not completely standard in the literature (Bucilu u et al., 2006).

We can now outline Figure 1 and the associated empirical phenomenon which motivates this work. (Please see Section 2 for further details on these experiments.) Consider a predictor f which has good test error but bad generalization bounds; by treating the distillation distance Φγ,m(f, g) as an objective function and increasingly regularizing g, we obtain a sequence of predictors (g0, . . . , gt), where g0 = f, which trade off between distillation distance and predictor complexity. The curves in Figure 1 are produced in exactly this way, and demonstrate that there are predictors nearly identical to the original f which have vastly smaller generalization bounds.

Our goal here is to show that this is enough to imply that f in turn must also have good generalization bounds, despite its apparent complexity. To sketch the idea, by a bit of algebra (cf. Lemma A.2), we can upper bound error probabilities with expected distillation distances and errors:

Prx,y[arg max y f(x)y = y] 2Ex φγ(f(x)) φγ(g(x)) 1 + 2Ex,y 1 φγ(g(x))y .

Published as a conference paper at ICLR 2021

The next step is to convert these expected errors into quantities over the training set. The last term is already in a form we want: it depends only on g, so we can apply uniform convergence with the low complexity of g. (Measured over the training set, this term is the distillation error in Figure 1.)

The expected distillation distance term is problematic, however. Here are two approaches.

1. We can directly apply uniform convergence; for instance, this approach was followed by Suzuki et al. (2019), and a more direct approach is followed here to prove Theorem 1.4. Unfortunately, it is unclear how this technique can avoid paying signiﬁcantly for the high complexity of f.

2. The idea in this subsection is to somehow trade off computation for the high statistical cost of the complexity of f. Speciﬁcally, notice that Φγ,m(f, g) only relies upon the marginal distribution of the inputs x, and not their labels. This subsection will pay computation to estimate Φγ,m with extra samples via data augmentation, offsetting the high complexity of f.

We can now set up and state our main distillation bound. Suppose we have a training set ((xi, yi))n i=1 drawn from some measure µ, with marginal distribution µX on the inputs x. Suppose we also have (zi)m i=1 drawn from a data augmentation distribution νn, the subscript referring to the fact that it depends on (xi)n i=1. Our analysis works when dµX/dνn , the ratio between the two densities, is ﬁnite. If it is large, then one can tighten the bound by sampling more from νn, which is a computational burden; explicit bounds on this term will be given shortly in Lemma 1.2.

Lemma 1.1. Let temperature parameter γ > 0 be given, along with sets of multiclass predictors F and G. Then with probability at least 1 2δ over an iid draw of data ((xi, yi))n i=1 from µ and (zi)m i=1 from νn, every f F and g G satisfy

Pr[arg max y f(x)y = y] 2 dµX

Φγ,m(f, g) + 2

1 φγ(g(xi))yi

Radm(F) + Radm(G) +

k γ Radn(G)

where Rademacher complexities Radn and Radm are deﬁned in Section 1.4.

A key point is that the Rademacher complexity Radm(F) of the complicated functions F has a subscript m , which explicitly introduces a factor 1/m in the complexity deﬁnition (cf. Section 1.4). As such, sampling more from the data augmentation measure can mitigate this term, and leave the complexity of the distillation class G as the dominant term.

Of course, this also requires dµX/dνn to be reasonable. As follows is one data augmentation scheme (and assumption on marginal distribution µX ) which ensures this.

Lemma 1.2. Let (xi)n i=1 be a data sample drawn iid from µX , and suppose the corresponding density p is supported on [0, 1]d and is H older continuous, meaning |p(x) p(x )| Cα x x α for some Cα 0, α [0, 1]. Deﬁne a data augmentation measure νn via the following sampling procedure.

With probability 1/2, sample z uniformly within [0, 1]d.

Otherwise, select a data index i [n] uniformly, and sample z from a Gaussian centered at xi, and having covariance σ2I where σ := n 1/(2α+d).

Then with probability at least 1 1/n over the draw of (xi)n i=1, dµX

ln n nα/(2α+d)

Though the idea is not pursued here, there are other ways to control dµX/dνn , for instance via an independent sample of unlabeled data; Lemma 1.1 is agnostic to these choices.

Published as a conference paper at ICLR 2021

1.2 A CONCRETE BOUND FOR COMPUTATION GRAPHS

This subsection gives an explicit complexity bound which starts from Lemma 1.1, but bounds dµX/dνn via Lemma 1.2, and also includes an upper bound on Rademacher complexity which can handle the Res Net, as in Figure 1. A side contribution of this work is the formalism to easily handle these architectures, detailed as follows.

Canonical computation graphs are a way to write down feedforward networks which include dense linear layers, convolutional layers, skip connections, and multivariate gates, to name a few, all while allowing the analysis to look roughly like a regular dense network. The construction applies directly to batches: given an input batch X Rn d, the output Xi of layer i is deﬁned inductively as

X T 0 := X T, X T i := σi [WiΠi Di| Fi]X T i 1 = σi

WiΠi Di XT i 1 Fi XT i 1

where: σi is a multivariate-to-multivariate ρi-Lipschitz function (measured over minibatches on either side with Frobenius norm); Fi is a ﬁxed matrix, for instance an identity mapping as in a residual network s skip connection; Di is a ﬁxed diagonal matrix selecting certain coordinates, for instance the non-skip part in a residual network; Πi is a Frobenius norm projection of a full minibatch; Wi is a weight matrix, the trainable parameters; [WiΠi Di| Fi] denotes row-wise concatenation of WiΠi Di and Fi.

As a simple example of this architecture, a multi-layer skip connection can be modeled by including identity mappings in all relevant ﬁxed matrices Fi, and also including identity mappings in the corresponding coordinates of the multivariate gates σi. As a second example, note how to model convolution layers: each layer outputs a matrix whose rows correspond to examples, but nothing prevents the batch size from changes between layers; in particular, the multivariate activation before a convolution layer can reshape its output to have each row correspond to a patch of an input image, whereby the convolution ﬁlter is now a regular dense weight matrix.

A ﬁxed computation graph architecture G( ρ, b, r, s) has associated hyperparameters ( ρ, b, r, s), described as follows. ρ is the set of Lipschitz constants for each (multivariate) gate, as described before. ri is a norm bound W T i 2,1 ri (sum of the 2-norms of the rows), bi n (where n is the input batch size) is the radius of the Frobenius norm ball which Πi is projecting onto, and si is the operator norm of X 7 [WiΠi Di X T| Fi X T]. While the deﬁnition is intricate, it cannot only model basic residual networks, but it is sensitive enough to be able to have si = 1 and ri = 0 when residual blocks are fully zeroed out, an effect which indeed occurs during distillation.

Theorem 1.3. Let temperature parameter γ > 0 be given, along with multiclass predictors F, and a computation graph architecture G. Then with probability at least 1 2δ over an iid draw of data ((xi, yi))n i=1 from µ and (zi)n i=1 from νn, every f F satisﬁes

Pr[arg max y f(x)y = y] inf ( b, r, s) 1 g G( ρ, b, r, s)

Φγ,m(f, g) + 2

i=1 (1 φγ(g(xi))yi

Radm(F) + 6

Under the conditions of Lemma 1.2, ignoring an additional failure probability 1/n, then dµX

ln n nα/(2α+d) .

A proof sketch of this bound appears in Section 3, with full details deferred to appendices. The proof is a simpliﬁcation of the covering number argument from (Bartlett et al., 2017a); for another computation graph formalism designed to work with the covering number arguments from (Bartlett et al., 2017a), see the generalization bounds due to Wei & Ma (2019).

Published as a conference paper at ICLR 2021

0 10 3 10 2 10 1

distillation distance Φγ, m

Lemma 3.1 Theorem 1.4 VC bound Training error Testing error

(a) Comparison of bounds on cifar10.

0.000 0.002 0.004 0.006 0.008 0.010 0.012 0

2000 width 64 width 256 width 1024

(b) Width dependence with Theorem 1.4.

Figure 2: Performance of stable rank bound (cf. Theorem 1.4). Figure 2a compares Theorem 1.4 to Lemma 3.1 and the VC bound (Bartlett et al., 2017b), and Figure 2b normalizes the margin histogram by Theorem 1.4, showing an unfortunate failure of width independence (cf. Figure 3). For details and a discussion of margin histograms, see Section 2.

1.3 A UNIFORM-CONVERGENCE APPROACH TO DISTILLATION

In this section, we derive a Rademacher complexity bound on F whose proof internally uses compression; speciﬁcally, it ﬁrst replaces f with a narrower network g, and then uses a covering number bound sensitive to network size to control g. The proof analytically chooses g s width based on the structure of f and also the provided data, and this data dependence incurs a factor which causes the familiar 1/ n rate to worsen to 1/n1/4 (which appears as X F/n3/4). This proof is much more intricate than the proofs coming before, and cannot handle general computation graphs, and also ignores the beneﬁcial structure of the softmax. Theorem 1.4. Let data matrix X Rn d be given, and let F denote networks of the form x 7 σL(WL σ1(W1x)) with spectral norm Wi 2 si, and 1-Lipschitz and 1-homogeneous activations σi, and Wi F Ri and width at most m. Then

Rad(F) = e O X F

i (Ri/si)4/5i5/4 h X

i ln Ri i1/4 .

The term Ri/si is the square root of the stable rank of weight matrix Wi, and is a desirable quantity in a generalization bound: it scales more mildly with width than terms like W T i 2,1 and W T i F

width which often appear (the former appears in Theorem 1.3 and Lemma 3.1). Another stable rank bound was developed by Suzuki et al. (2019), but has an extra mild dependence on width.

As depicted in Figure 2, however, this bound is not fully width-independent. Moreover, we can compare it to Lemma 3.1 throughout distillation, and not only does this bound not capture the power of distillation, but also, eventually its bad dependence on n causes it to lose out to Lemma 3.1.

1.4 ADDITIONAL NOTATION

Given data (zi)n i=1, the Rademacher complexity of univariate functions H is

Rad(H) := E ϵ sup h H

i ϵih(zi), where ϵi i.i.d. Uniform({ 1, +1}).

Rademacher complexity is the most common tool in generalization theory (Shalev-Shwartz & Ben David, 2014), and is incorporated in Lemma 1.1 due to its convenience and wide use. To handle multivariate (multiclass) outputs, the deﬁnition is overloaded via the worst case labels as Radn(F) = sup y [k]n Rad({(x, y) 7 f(x)y : f F}). This deﬁnition is for mathematical convenience, but overall not ideal; Rademacher complexity seems to have difﬁculty dealing with such geometries.

Regarding norms, = F will denote the Frobenius norm, and 2 will denote spectral norm.

Published as a conference paper at ICLR 2021

0.00 0.01 0.02 0.03 0.04 0

2000 width 64 width 256 width 1024

(a) Margins before distillation.

0.2 0.1 0.0 0.1 0.2 0.3 0

width 64, Φ 0.035

width 256, Φ 0.026 width 1024, Φ 0.026

(b) Margins after distillation.

Figure 3: Width independence. Fully-connected 6-layer networks of widths {64, 256, 1024} were trained on mnist until training error zero; the margin histograms, normalized by the generalization bound in Lemma 3.1, all differ, and are close to zero. After distillation, the margin distributions are far from zero and nearly the same. In the distillation legend, the second term Φγ,m denotes the distillation distance, as deﬁned in Equation (1.1). Experiment details and an explanation of margin histograms appear in Section 2.

2 ILLUSTRATIVE EMPIRICAL RESULTS

This section describes the experimental setup, and the main experiments: Figure 1 showing progressive distillation, Figure 2 comparing Theorem 1.4, Lemma 3.1 and VC dimension, Figure 3 showing width independence after distillation, and Figure 4 showing the effect of random labels.

Experimental setup. As sketched before, networks were trained in a standard way on either cifar10 or mnist, and then distilled by trading off between complexity and distillation distance Φγ,m. Details are as follows.

1. Training initial network f. In Figures 1 and 2a, the architecture was a Res Net8 based on one used in (Coleman et al., 2017), and achieved test errors 0.067 and 0.008 on cifar10 and mnist, respectively, with no changes to the setup and a modest amount of training; the training algorithm was Adam; this and most other choices followed the scheme in (Coleman et al., 2017) to achieve a competitively low test error on cifar10. In Figures 2b, 3 and 4, a 6-layer fully connected network was used (width 8192 in Figure 2b, widths {64, 256, 1024} in Figure 3, width 256 in Figure 4), and vanilla SGD was used to optimize. 2. Training distillation network g. Given f and a regularization strength λj, each distillation gj was found via approximate minimization of the objective

g 7 Φγ,m(f, g) + λj Complexity(g). (2.1)

In more detail, ﬁrst g0 was initialized to f (g and f always used the same architecture) and optimized via eq. (2.1) with λ0 set to roughly risk(f)/Complexity(f), and thereafter gj+1 was initialized to gj and found by optimizing eq. (2.1) with λj+1 := 2λj. The optimization method was the same as the one used to ﬁnd f. The term Complexity(g) was some computationally reasonable approximation of Lemma 3.1: for Figures 2b, 3 and 4, it was just P

i W T i 2,1, but for Figures 1 and 2a, it also included a tractable surrogate for the product of the spectral norms, which greatly helped distillation performance with these deeper architectures. In Figures 2b, 3 and 4, a full regularization sequence was not shown, only a single gj. This was chosen with a simple heuristic: amongst all (gj)j 1, pick the one whose 10% margin quantile is largest (see the deﬁnition and discussion of margins below).

Margin histograms. Figures 2b, 3 and 4 all depict margin histograms, a ﬂexible tool to study the individual predictions of a network on all examples in a training set (see for instance (Schapire & Freund, 2012) for their use studying boosting, and (Bartlett et al., 2017a; Jiang et al., 2019a) for

Published as a conference paper at ICLR 2021

0.005 0.000 0.005 0.010 0.015 0.020 0.025 0

100% rand, Φ 0.012

75% rand, Φ 0.009

50% rand, Φ 0.006 25% rand, Φ 0.004

0% rand, Φ 0.003

(a) Permuting different fractions of labels.

0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004 0

100% rand, Φ 0.012

75% rand, Φ 0.009

50% rand, Φ 0.006 25% rand, Φ 0.004

(b) Zoomed in.

Figure 4: Label randomization. Here {0%, 25%, 50%, 75%, 100%} of the labels were permuted across the respective experiments. In all cases, the margin distribution is collapsed to zero. For details, including an explanation of margin histograms, see Section 2.

their use in studying deep networks). Concretely, given a predictor g G, the prediction on every example is replaced with a real scalar called the normalized margin via

(xi, yi) 7 g(xi)yi maxj =yi g(xi)j

where Radn(G) is the Rademacher complexity (cf. Section 1.4), and then the histogram of these n scalars is plotted, with the horizontal axis values thus corresponding to normalized margins. By using Rademacher complexity as normalization, these margin distributions can be compared across predictors and even data sets, and give a more ﬁne-grained analysis of the quality of the generalization bound. This normalization choice was ﬁrst studied in (Bartlett et al., 2017a), where it was also mentioned that this normalization allows one to read off generalization bounds from the plot. Here, it also suggests reasonable values for the softmax temperature γ.

Figure 1: effect of distillation on generalization bounds. This ﬁgure was described before; brieﬂy, a highlight is that in the initial phase, training and testing errors hardly change while bounds drop by a factor of nearly 1010. Regarding generalization measure , this term appears in studies of quantities which correlate with generalization, but are not necessarily rigorous generalization bounds (Jiang et al., 2019b; Dziugaite et al., 2020); in this speciﬁc case, the product of Frobenius norms requires a dense Re LU network (Golowich et al., 2018), and is invalid for the Res Net (e.g., a complicated Res Net with a single identity residual block yields a value 0 by this measure).

Figure 2a: comparison of Theorem 1.4, Lemma 3.1 and VC bounds. Theorem 1.4 was intended to internalize distillation, but as in Figure 2a, clearly a subsequent distillation still greatly reduces the bound. While initially the bound is better than Lemma 3.1 (which does not internalize distillation), eventually the n1/4 factor causes it to lose out. Also note that eventually the bounds beat the VC bound, which has been identiﬁed as a surprisingly challenging baseline (Arora et al., 2018).

Figure 3: width independence. Prior work has identiﬁed that generalization bounds are quite bad at handling changes in width, even if predictions and test error don t change much (Nagarajan & Kolter, 2019; Jiang et al., 2019b; Dziugaite et al., 2020). This is captured in Figure 3a, where the margin distributions (see above) with different widths are all very different, despite similar test errors. However, following distillation, the margin histograms in Figure 3b are nearly identical! That is to say: distillation not only decreases loose upper bounds as before, it tightens them to the point where they capture intrinsic properties of the predictors.

Figure 2b: failure of width independence with Theorem 1.4. The bound in Theorem 1.4 was designed to internalize compression, and there was some hope of this due to the stable rank term.

Published as a conference paper at ICLR 2021

Unfortunately, Figure 2b shows that it doesn t quite succeed: while the margin histograms are less separated than for the undistilled networks in Figure 3a, they are still visibly separated unlike the post-distillation histograms in Figure 3b.

Figure 4: random labels. A standard sanity check for generalization bounds is whether they can reﬂect the difﬁculty of ﬁtting random labels (Zhang et al., 2016). While it has been empirically shown that Rademacher bounds do sharply reﬂect the presence of random labels (Bartlett et al., 2017a, Figures 2 & 3), the effect is ampliﬁed with distillation: even randomizing just 25% shrinks the margin distribution signiﬁcantly.

3 ANALYSIS OVERVIEW AND SKETCH OF PROOFS

This section sketches all proofs, and provides further context and connections to the literature. Full proof details appear in the appendices.

3.1 ABSTRACT DATA AUGMENTATION BOUNDS IN SECTION 1.1

As mentioned in Section 1.1, the ﬁrst step of the proof is to apply Lemma A.2 to obtain Prx,y[arg max y f(x)y = y] 2Ex φγ(f(x)) φγ(g(x)) 1 + 2Ex,y 1 φγ(g(x))y ;

this step is similar to how the ramp loss is used with margin-based generalization bounds, a connection which is discussed in Appendix B.

Section 1.1 also mentioned that the last term is easy: φγ is (1/γ)-Lipschitz, and we can peel it off and only pay the Rademacher complexity associated with g G.

With data augmentation, the ﬁrst term is also easy:

EΦγ,m(f, g) = Z φγ(f(z)) φγ(g(z)) 1 dµX (z) = Z φγ(f(z)) φγ(g(z)) 1 dµX

Z φγ(f(z)) φγ(g(z)) 1 dνn(z),

and now we may apply uniform convergence to νn rather than µX . In the appendix, this proof is handled with a bit more generality, allowing arbitrary norms, which may help in certain settings. All together, this leads to a proof of Lemma 1.1.

For the explicit data augmentation estimate in Lemma 1.2, the proof breaks into roughly two cases: low density regions where the uniform sampling gives the bound, and high density regions where the Gaussian sampling gives the bound. In the latter case, the Gaussian sampling in expectation behaves as a kernel density estimate, and the proof invokes a standard bound (Jiang, 2017).

3.2 CONCRETE DATA AUGMENTATION BOUNDS IN SECTION 1.2

The main work in this proof is the following generalization bound for computation graphs, which follows the proof scheme from (Bartlett et al., 2017a), though simpliﬁed in various ways, owing mainly to the omission of general matrix norm penalties on weight matrices, and the omission of the reference matrices. The reference matrices were a technique to center the weight norm balls away from the origin; a logical place to center them was at initialization. However, in this distillation setting, it is in fact most natural to center everything at the origin, and apply regularization and shrink to a well-behaved function (rather than shrinking back to the random initialization, which after all deﬁnes a complicated function). The proof also features a simpliﬁed (2, 1)-norm matrix covering proof (cf. Lemma C.3). Lemma 3.1. Let data X Rn d be given. Let computation graph G be given, where Πi projects to Frobenius-norm balls of radius bi n, and W T i 2,1 ri, and [WiΠi Di| Fi] 2 si, and Lip(σi) ρi, and all layers have width at most m. Then for every ϵ > 0 there exists a covering set M satisfying

sup g G min ˆ X M

g(X T) ˆX ϵ and ln |M| 24/3n ln(2m2)

Published as a conference paper at ICLR 2021

Consequently,

From there, the proof of Theorem 1.3 follows via Lemmas 1.1 and 1.2, and many union bounds.

3.3 DIRECT UNIFORM CONVERGENCE APPROACH IN THEOREM 1.4

As mentioned before, the ﬁrst step of the proof is to sparsify the network, speciﬁcally each matrix product. Concretely, given weights Wi of layer i, letting X T i 1 denote the input to this layer, then

Wi X T i 1 =

j=1 (Wiej)(Xi 1ej) T.

Written this way, it seems natural that the matrix product should concentrate , and that considering all m outer products should not be necessary. Indeed, exactly such an approach has been followed before to analyze randomized matrix multiplication schemes (Sarlos, 2006). As there is no goal of high probability here, the analysis is simpler, and follows from the Maurey lemma (cf. Lemma C.1), as is used in the (2, 1)-norm matrix covering bound in Lemma C.3.

Lemma 3.2. Let a network be given with 1-Lipschitz homogeneous activations σi and weight matrices (W1, . . . , WL) of maximum width m, along with data matrix X Rn d and desired widths (k1, . . . , k L) be given. Then there exists a sparsiﬁed network output, recursively deﬁned via

ˆX T 0 := X T, and ˆX T i := Πiσi(Wi Mi X T i 1), where Mi := X

Zjeje T j Aej ,

where Si is a multiset of ki = |Si| indices, Πi denotes projection onto the Frobenius-norm ball of radius X F Q

j i Wj 2, and the scaling term Zj satisﬁes Zj Wk F p

σL(WL σ1(W1X T) ) ˆX T L F X F

Wi 2 F ki Wi 2 2 ,

The statement of this lemma is lengthy and detailed because the exact guts of the construction are needed in the subsequent generalization proof. Speciﬁcally, now that there are few nodes, a generalization bound sensitive to narrow networks can be applied. On the surface, it seems reasonable to apply a VC bound, but this approach did not yield a rate better than n 1/6, and also had an explicit dependence on the depth of the network, times other terms visible in Theorem 1.4.

Instead, the approach here, aiming for a better dependence on n and also no explicit dependence on network depth, was to produce an -norm covering number bound (see (Long & Sedghi, 2019) for a related approach), with some minor adjustments (indeed, the -norm parameter covering approach was applied to obtain a Frobenius-norm bound, as in Lemma 3.1). Unfortunately, the magnitudes of weight matrix entries must be controlled for this to work (unlike the VC approach), and this necessitated the detailed form of Lemma 3.2 above.

To close with a few pointers to the literature, as Lemma 3.2 is essentially a pruning bound, it is potentially of independent interest; see for instance the literature on lottery tickets and pruning (Frankle & Carbin, 2019; Frankle et al., 2020; Su et al., 2020). Secondly, there is already one generalization bound in the literature which exhibits spectral norms, due to (Suzuki et al., 2019); unfortunately, it also has an explicit dependence on network width.

ACKNOWLEDGMENTS

MT thanks Vaishnavh Nagarajan for helpful discussions and suggestions. ZJ and MT are grateful for support from the NSF under grant IIS-1750051, and from NVIDIA under a GPU grant.

Published as a conference paper at ICLR 2021

Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.

Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. 2018. ar Xiv:1802.05296 [cs.LG].

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In NIPS, pp. 6240 6249, 2017a.

Peter L. Bartlett, Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. 2017b. ar Xiv:1703.02930 [cs.LG].

Cristian Bucilu u, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, pp. 535 541, 2006.

Cody A. Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris R e, and Matei Zaharia. Dawnbench: An end-to-end deep learning benchmark and competition. In NIPS ML Systems Workshop, 2017.

Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. 2017. ar Xiv:1703.11008 [cs.LG].

Gintare Karolina Dziugaite, Alexandre Drouin, Brady Neal, Nitarshan Rajkumar, Ethan Caballero, Linbo Wang, Ioannis Mitliagkas, and Daniel M. Roy. In search of robust measures of generalization. In Neur IPS, 2020.

Dylan J. Foster and Alexander Rakhlin. ℓ vector contraction for rademacher complexity. 2019. ar Xiv:1911.06468 [cs.LG].

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. 2019. ar Xiv:1803.03635 [cs.LG].

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Pruning neural networks at initialization: Why are we missing the mark? 2020. ar Xiv:2009.08576 [cs.LG].

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In COLT, 2018.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 2015. ar Xiv:1503.02531 [stat.ML].

Heinrich Jiang. Uniform convergence rates for kernel density estimation. In ICML, 2017.

Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the generalization gap in deep networks with margin distributions. In ICLR, 2019a. ar Xiv:1810.00113 [stat.ML].

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to ﬁnd them. 2019b. ar Xiv:1912.02178 [cs.LG].

Philip M. Long and Hanie Sedghi. Generalization bounds for deep convolutional neural networks. 2019. ar Xiv:1905.12600 [cs.LG].

Vaishnavh Nagarajan and J. Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. 2019. ar Xiv:1902.04742 [cs.LG].

Gilles Pisier. Remarques sur un r esultat non publi e de b. maurey. S eminaire Analyse fonctionnelle (dit), pp. 1 12, 1980.

Published as a conference paper at ICLR 2021

Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In FOCS, pp. 143 152, 11 2006.

Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. MIT Press, 2012.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, and Jason D. Lee. Sanity-checking pruning methods: Random tickets can win the jackpot. 2020. ar Xiv:2009.11094 [cs.LG].

Taiji Suzuki, Hiroshi Abe, and Tomoaki Nishimura. Compression based bound for non-compressed network: uniﬁed generalization error analysis of large compressible deep neural network. 2019. ar Xiv:1909.11274 [cs.LG].

Colin Wei and Tengyu Ma. Data-dependent sample complexity of deep neural networks via lipschitz augmentation. 2019. ar Xiv:1905.03684 [cs.LG].

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. 2016. ar Xiv:1611.03530 [cs.LG].

Published as a conference paper at ICLR 2021

A PROOFS FOR SECTION 1.1

The ﬁrst step is an abstract version of Lemma 1.1 which does not explicitly involve the softmax, just bounded functions.

Lemma A.1. Let classes of bounded functions F and G be given with F f : X [0, 1]k and G g : X [0, 1]k. Let conjugate exponents 1/p + 1/q = 1 be given. Then with probability at least 1 2δ over the draw of ((xi, yi))n i=1 from µ and (zi)m i=1 from νn, for every f F and g G,

i=1 g(xi)yi + 2Radn (x, y) 7 g(x)y : g G + 3

i=1 f(zi) g(zi) p p + 3

n z 7 min{1, f(z) g(z) p p} : f F, g G o !1/p

n z 7 min{1, f(z) g(z) p p} : f F, g G o

Radm({z 7 f(z)y : f F}) + Radm({z 7 g(z)y : g G}) .

Proof of Lemma A.1. To start, for any f F and g G, write

Ef(x)y = E(f(x) g(x))y + Eg(x)y.

The last term is easiest, and let s handle it ﬁrst: by standard Rademacher complexity arguments (Shalev-Shwartz & Ben-David, 2014), with probability at least 1 δ, every g G satisﬁes

i=1 g(xi)yi + 2Radn({(x, y) 7 g(x)y : g G}) + 3

For the ﬁrst term, since f : X [0, 1]k and g : X [0, 1]k, by H older s inequality

E(f(x) g(x))y = Z min{1, (f(x) g(x))y} dµ(x, y)

Z min{1, f(x) g(x) p} dµ(x, y)

= Z min{1, f(x) g(x) p}dµX

dνn (x) dνn(x)

min 1, f g p Lp(νn)

Once again invoking standard Rademacher complexity arguments (Shalev-Shwartz & Ben-David, 2014), with probability at least 1 δ, every mapping z 7 min{1, f(z) g(z) p p} where f F and g G satisﬁes Z min{1, f(z) g(z) p p} dνn(z) 1

i=1 min{1, f(zi) g(zi) p p} + 3

n z 7 min{1, f(z) g(z) p p} : f F, g G o .

Combining these bounds and unioning the two failure events gives the ﬁrst bound.

Published as a conference paper at ICLR 2021

For the ﬁnal Rademacher complexity estimate, ﬁrst note r 7 min{1, r} is 1-Lipschitz and can be peeled off, thus

n z 7 min{1, f(z) g(z) p p} : f F, g G o

n z 7 f(z) g(z) p p : f F, g G o

= Eϵ sup f F g G

i=1 ϵi f(zi) g(zi) p p

y =1 Eϵ sup f F g G

i=1 ϵi|f(zi) g(zi)|p y

y =1 m Radm

n z 7 |f(z) g(z)|p y : f F, g G o .

Since f and g have range [0, 1]k, then (f g)y has range [ 1, 1] for every y , and since r 7 |r|p is p-Lipschitz over [ 1, 1] (for any p [1, ), combining this with the Lipschitz composition rule for Rademacher complexity and also the fact that a Rademacher random vector ϵ { 1}m is distributionally equivalent to its coordinate-wise negation ϵ, then, for every y [k],

Radm({z 7 |f(z) g(z)|p y : f F, g G})

p Radm({z 7 (f(z) g(z))y : f F, g G})

m Eϵ sup f F sup g G

i=1 ϵi(f(zi) g(zi))y

m Eϵ sup f F

i=1 ϵif(zi)y + p

m Eϵ sup g G

i=1 ϵig(zi)y

= p Radm({z 7 f(z)y : f F}) + p Radm({z 7 g(z)y : g G}).

To prove Lemma 1.1, it still remains to collect a few convenient properties of the softmax. Lemma A.2. For any v Rk and y {1, . . . , k},

2(1 φγ(v))y 1[y = arg max i vi].

Moreover, for any functions F with F f : X Rk,

Radn (x, y) 7 φγ(f(x))y : f F = e O

k γ Radn(F)

Proof. For the ﬁrst property, let v Rk be given, and consider two cases. If y = arg maxi vi, then φγ(v) [0, 1]k implies 2 1 φγ(v)

y 0 = 1[y = arg max i vi].

On the other hand, if y = arg maxi vi, then φγ(v)y 1/2, and

y 1 = 1[y = arg max i vi].

The second part follows from a multivariate Lipschitz composition lemma for Rademacher complexity due to (Foster & Rakhlin, 2019, Theorem 1); all that remains to prove is that v 7 φγ(v)y is (1/γ)-Lipschitz with respect to the ℓ norm for any v Rk and y [k]. To this end, note that

d dvy φγ(v)y = exp(v/γ)y P

j =y exp(v/γ)j γ(P

j exp(v/γ)j)2 , d dvi =y φγ(v)y = exp(v/γ)y exp(v/γ)i

j exp(v/γ)j)2 ,

Published as a conference paper at ICLR 2021

and therefore φγ(v)y 1 = 2 exp(v/γ)y P

j =y exp(v/γ)j γ(P

j exp(v/γ)j)2 1

and thus, by the mean value theorem, for any u Rk and v Rk, there exists z [u, v] such that φγ(v)y φγ(u)y = φγ(z)y, v u v u φγ(v)y 1 1

and in particular v 7 φγ(v)/y is (1/γ)-Lipschitz with respect to the ℓ norm. Applying the aforementioned Lipschitz composition rule (Foster & Rakhlin, 2019, Theorem 1),

Radn (x, y) 7 φγ(f(x))y : f F = e O

k γ Radn(F)

Lemma 1.1 now follows by combining Lemmas A.1 and A.2.

Proof of Lemma 1.1. Deﬁne ψ := 1 φγ. The bound follows by instantiating Lemma A.1 with p = 1 and the two function classes

QF := {(x, y) 7 ψ(f(x)y) : f F} and QG := {(x, y) 7 ψ(g(x)y) : g G},

combining its simpliﬁed Rademacher upper bounds with the estimates for Radm(QF) and Radm(QG) and Radn(QG) from Lemma A.2, and by using Lemma A.2 to lower bound the left hand side with

Eψ(f(x))y = E(1 φγ(f(x))y) 1

21 h arg max y f(x)y = y i ,

and lastly noting that

i=1 ψ(f(zi)) ψ(g(zi)) 1 = 1

i=1 1 φγ(f(zi)) 1 + φγ(g(zi)) 1 = Φγ,m(f, g).

To complete the proofs for Section 1.1, it remains to handle the data augmentation error, namely the term dµX/dνn . This proof uses the following result about Gaussian kernel density estimation. Lemma A.3 (See (Jiang, 2017, Theorem 2 and Remark 8)). Suppose density p is α-H older continuous, meaning |p(x) p(x )| Cα x x α for some Cα 0 and α [0, 1]. There there exists a constant C 0, depending on α, Cα, maxx Rd p(x), and the dimension, but independent of the sample size, so that with probability at least 1 1/n, the Gaussian kernel density estimate with bandwidth σ2I where σ = n 1/(2α+d) satisﬁes

sup x Rd |p(x) pn(x)| C

ln(n) n2α/(2α+d) .

The proof of Lemma 1.2 follows.

Proof of Lemma 1.2. The proposed data augmentation measure νn has a density pn,β over [0, 1]d, and it has the form pn,β(x) = β + (1 β)pn(x), where β = 1/2, and pn is the kernel density estimator as described in Lemma A.3, whereby

|pn(x) p(x)| ϵn := O

ln n nα/(2α+d)

The proof proceeds to bound dµX/dνn = p/pn,β by considering three cases.

Published as a conference paper at ICLR 2021

If x [0, 1]d, then p(x) = 0 by the assumption on the support of µX , whereas pn,β(x) pn(x)/2 > 0, thus p(x)/pn,β(x) = 0.

If x [0, 1]d and p(x) 2ϵn, then pn,β(x) (1 β)p(x) ϵn) ϵn/2, and

p(x) pn,β(x) = 1 + p(x) pn,β(x)

pn,β(x) + (1 β)|p(x) pn(x)|

1 + βp(x) (1 β)(p(x) ϵn) + (1 β)ϵn

1 + β (1 β)(1 ϵn/p(x)) + 1

If x [0, 1]d and p(x) < 2ϵn, since pn,β(x) β = 1/2, then

p(x) pn,β(x) < 2ϵn

Combining these cases, dµX/dνn = p/pn,β max{4, 4ϵn} 4 + 4ϵn.

B REPLACING SOFTMAX WITH STANDARD MARGIN (RAMP) LOSS

The proof of Lemma 1.1 was mostly a reduction to Lemma A.1, which mainly needs bounded functions; for the Rademacher complexity estimates, the Lipschitz property of φγ was used. As such, the softmax can be replaced with the (1/γ)-Lipschitz ramp loss as is standard from margin-based generalization theory (e.g., in a multiclass version as appears in (Bartlett et al., 2017a)). Speciﬁcally, deﬁne Mγ : Rk [0, 1]k for any coordinate j as

Mγ(v)j := ℓγ(vj arg max y =j vy ), where ℓγ(z) :=

γ z (0, γ), 0 z γ.

We now have 1[arg maxy f(x)y ] Mγ(f(x))y without a factor of 2 as in Lemma A.2, and can plug it into the general lemma in Lemma A.1 to obtain the following corollary. Corollary B.1. Let temperature (margin!) parameter γ > 0 be given, along with sets of multiclass predictors F and G. Then with probability at least 1 2δ over an iid draw of data ((xi, yi))n i=1 from µ and (zi)n i=1 from νn, every f F and g G satisfy

Pr[arg max y f(x)y = y] dµX

i=1 Mγ(f) Mγ(g) 1 + 1

i=1 Mγ(g(xi))yi

Radm(F) + Radm(G) +

k γ Radn(G)

Proof. Overload function composition notation to sets of functions, meaning Mγ F = (x, y) 7 Mγ(f(x))y : f F .

First note that Mγ is (2/γ)-Lipschitz with respect to the ℓ norm, and thus, applying the multivariate Lipschitz composition lemma for Rademacher complexity (Foster & Rakhlin, 2019, Theorem 1) just as in the proof for the softmax in Lemma A.2,

Radm(Mγ F) = e O

k γ Radm(F)

Published as a conference paper at ICLR 2021

with similar bounds for Radm(Mγ G) and Radn(Mγ G). The desired statement now follows by combining these Rademacher complexity bounds with Lemma 1.1 applied to Mγ F and Mγ G, and additionally using 1[arg maxy f(x)y = y] Mγ(f(x))y.

C SAMPLING TOOLS

The proofs of Lemma 3.1 and Lemma 3.2 both make heavy use of sampling. Lemma C.1 (Maurey (Pisier, 1980)). Suppose random variable V is almost surely supported on a subset S of some Hilbert space, and let (V1, . . . , Vk) be k iid copies of V . Then there exist ( ˆV1, . . . , ˆVk) Sk with EV 1

E V1,...,Vk

h E V 2 F EV 2 F i 1

k E V 2 F 1

k sup ˆV S ˆV 2 F .

Proof of Lemma C.1. The ﬁrst inequality is via the probabilistic method. For the remaining inequalities, by expanding the square multiple times,

E V1,...,Vk

E V1,...,Vk

EV Vi, EV Vj

k EV1 V1 EV 2

h E V 2 F EV 2 F i 1

k E V 2 F 1

k sup ˆV S ˆV 2 F .

A ﬁrst key application of Lemma C.1 is to sparsify products, as used in Lemma 3.2. Lemma C.2. Let matrices A Rd m and B Rn m be given, along with sampling budget k. Then there exists a selection (i1, . . . , ik) of indices and a corresponding diagonal sampling matrix M with at most k nonzero entries satisfying

M := A 2 F k

eije T ij Aeij 2 and AB T AMB T 2 1

Proof of Lemma C.2. For convenience, deﬁne columns ai := Aei and bi := Bei for i {1, . . . , m}. Deﬁne importance weighting βi := ( ai / A F)2, whereby P i βi = 1, and let V be a random variable with Pr h V = β 1 i aib T i i = βi,

i=1 β 1 i aib T iβi =

i=1 (Aei)(Bei) T = A m X

i=1 eie T i

B T = A [I] B T = AB,

i=1 β 2 i aib T i 2 F βi =

i=1 β 1 i ai 2 bi 2 =

i=1 A 2 F bi 2 = A 2 F B 2 F .

By Lemma C.1, there exist indices (i1, . . . , ik) and matrices ˆVj := β 1 ij aijb T ij with AB T 1

h A 2 F B 2 F AB 2 F i 1

k A 2 F B 2 F .

To ﬁnish, by the deﬁnition of M,

j β 1 ij (Aeij)(Beij) T = A

j β 1 ij eije T ij

B T = A [M] B T.

Published as a conference paper at ICLR 2021

A second is to cover the set of matrices W satisfying a norm bound W T 2,1 r. The proof here is more succinct and explicit than the one in (Bartlett et al., 2017a, Lemma 3.2). Lemma C.3 (See also (Bartlett et al., 2017a, Lemma 3.2)). Let norm bound r 0, X Rn d, and integer k be given. Deﬁne a family of matrices

sleile T jl Xejl : sl { 1}, il {1, . . . , n}, jl {1, . . . , d}

|M| (2nd)k, sup W T 2,1 r min ˆ W M WX T ˆWX T 2 F r2 X 2 F k .

Proof. Let W Rm d be given with W T 2,1 r. Deﬁne sij := Wij/|Wij|, and note

i,j eie T i Weje T j X T = X

i,j ei Wij(Xej) T = X

|Wij| Xej 2

r X F | {z } =:qij

r X Fsijei(Xej)T

Xej | {z } =:Uij

Note by Cauchy-Schwarz that X

i,j qij 1 r X F

j W 2 ij X F = W T 2,1 X F

potentially with strict inequality, thus q is not a probability vector. To remedy this, construct probability vector p from q by adding in, with equal weight, some Uij and its negation, so that the above summation form of WX T goes through equally with p and with q.

Now deﬁne iid random variables (V1, . . . , Vk), where

Pr[Vl = Uij] = pij,

i,j pij Uij = X

i,j qij Uij = WX T,

F r X F = |sij| ei 2

2 r X F = r X F,

i,j pij Uij 2 X

ij pijr2 X 2 F = r2 X 2 F .

By Lemma C.1, there exist ( ˆV1, . . . , ˆVk) Sk with WX T 1

k E V1 2 r2 X 2 F k .

Furthermore, the matrices ˆVl have the form

sleil(Xejl)T

sleile T jl Xejl

X T =: ˆWX T,

where ˆW M. Lastly, note |M| has cardinality at most (2nd)k.

D PROOFS FOR SECTION 1.2

The bulk of this proof is devoted to establishing the Rademacher bound for computation graphs in Lemma 3.1; thereafter, as mentioned in Section 3, it sufﬁces to plug this bound and the data augmentation bound in Lemma 1.2 into Lemma 1.1, and apply a pile of union bounds.

As mentioned in Section 3, this proof follows the scheme laid out in (Bartlett et al., 2017a), with simpliﬁcations due to the removal of reference matrices and some norm generality.

Published as a conference paper at ICLR 2021

Proof of Lemma 3.1. Let cover scale ϵ and per-layer scales (ϵ1, . . . , ϵL) be given; the proof will develop a covering number parameterized by these per-layer scales, and then optimize them to derive the ﬁnal covering number in terms of ϵ. From there, a Dudley integral will give the Rademacher bound.

Deﬁne bi := bi n for convenience. As in the statement, recursively deﬁne

X T 0 := X T, X T i := σi [WiΠi Di| Fi]X T i 1 .

The proof will recursively construct an analogous cover via

ˆX T 0 := X T, ˆX T i := σi [ ˆWiΠi Di| Fi] ˆX T i 1 ,

where the choice of ˆWi depends on ˆXi 1, and thus the total cover cardinality will product (and not simply sum) across layers. Speciﬁcally, the cover Ni for ˆWi is given by Lemma C.3 by plugging in Πi Di ˆX T i 1 F bi, and thus it sufﬁces to choose

cover cardinality k := r2 i b2 i ϵ2 i , whereby min ˆ Wi Ni WiΠi Di ˆX T i 1 ˆWiΠi Di ˆX T i 1 ϵi.

By this choice (and the cardinality estimate in Lemma C.3, the full cover N satisﬁes

i ln |Ni| X

r2 i b2 i ϵ2 i ln(2m2).

To optimize the parameters (ϵ1, . . . , ϵL), the ﬁrst step is to show via induction that

X T i ˆX T i F X

l=j+1 slρl.

The base case is simply X T 0 ˆX T = X T X T = 0, thus consider layer i > 0. Using the inductive formula for ˆXi and the cover guarantee on ˆWi, X T i ˆX T i = σi([WiΠi Di| Fi]X T i 1) σi([ ˆWiΠi Di| Fi] ˆX T i 1)

ρi [WiΠi Di| Fi]h X T i 1 [ ˆWiΠi Di| Fi] ˆX T i 1

ρi [WiΠi Di| Fi]X T i 1 [WiΠi Di| Fi] ˆX T i 1 + ρi [WiΠi Di| Fi] ˆX T i 1 [ ˆWiΠi Di| Fi] ˆX T i 1

ρi [WiΠi Di| Fi] 2

X T i 1 ˆX T i 1 + ρi [(Wi ˆWi)Πi Di ˆX T i 1| (Fi Fi) ˆX T i 1]

l=j+1 slρl + ρi (Wi ˆWi)Πi Di ˆX T i 1

l=j+1 slρl + ρiϵi X

l=j+1 slρl.

To balance (ϵ1, . . . , ϵL), it sufﬁces to minimize a Lagrangian corresponding to the cover size subject to an error constraint, meaning

where αi := r2 i b2 i ln(2m2), βi := ρi

l=i+1 slρl,

whose unique critical point for ϵ > 0 implies the choice

1/3 where Z := 1

i (2αiβ2 i )1/3,

Published as a conference paper at ICLR 2021

whereby X T L ˆX T L ϵ automatically, and

ln |N| Z2 X

r2 i b2 i ln(2m2) (2αi/βi)2/3

i r2/3 i b2/3 i β2/3 i ln(2m2)1/3

i r2/3 i b2/3 i ln(2m2)1/3β2/3 i

= 24/3 ln(2m2)

as desired, with τ introduced for convenience in what is to come.

For the Rademacher complexity estimate, by a standard Dudley entropy integral (Shalev-Shwartz & Ben-David, 2014), setting ˆτ := max{τ, 1/3} for convenience,

n Rad(G) inf ζ 4ζ n+12 Z n

ˆτϵ d ϵ = inf ζ 4ζ n+12ˆτ ln(ϵ) n ζ = inf ζ 4ζ n+12ˆτ(ln n ln ζ),

which is minimized at ζ = 3ˆτ/ n, whereby

n Rad(G) 12ˆτ + 6ˆτ ln n 12ˆτ ln(3ˆτ/ n) = 12ˆτ(1 ln(3ˆτ)) 12ˆτ 12τ + 4.

This now gives the proof of Theorem 1.3.

Proof of Theorem 1.3. With Lemma 1.1, Lemma 1.2, and Lemma 3.1 out of the way, the main work of this proof is to have an inﬁmum over distillation network hyperparameters ( b, r, s) on the right hand side, which is accomplished by dividing these hyperparameters into countably many shells, and unioning over them.

In more detail, divide ( b, r, s) into shells as follows. Divide each bi and ri into shells of radius increasing by one, meaning meaning for example the ﬁrst shell for bi has bi 1, and the jth shell has bi (j 1, j], and similarly for ri; moreover, associate the jth shell with prior weight qj(bi) := (j(j + 1)) 1, whereby P

j 1 qj(bi) = 1. Meanwhile, for si use a ﬁner grid where the ﬁrst shell has si 1/L, and the jth shell has si ((j 1)/L, j/L), and again the prior weight is qj(si) = (j(j + 1)) 1. Lastly, given a full set of grid parameters ( b, r, s), associate prior weight q( b, r, s) equal to the product of the individual prior weight, whereby the sum of the prior weights over the entire product grid is 1. Enumerate this grid in any way, and deﬁne failure probability δ( b, r, s) := δ q( b, r, s).

Next consider some ﬁxed grid shell with parameters ( b , r , s ) and let H denote the set of networks for which these parameters form the tightest shell, meaning that for any g H with parameters ( b, r, s), then ( b , r , s ) ( b + 1, r + 1, s + 1) component-wise. As such, by Lemma 1.1, with probability at least 1 δ( b , r , s ), each g H satisﬁes

Pr[arg max y f(x)y = y] 2 dµX

Φγ,m(f, g) + 2

i=1 (1 φγ(g(xi))yi

Radm(F) + Radm(H) +

k γ Radn(H)

ln(q( b , r , s )) + ln(1/δ)

Published as a conference paper at ICLR 2021

To simplify this expression, ﬁrst note by Lemma 3.1 and the construction of the shells (relying in particular on the ﬁner grid for si to avoid a multiplicative factor L) that

Radm(H) = e O

l=i+1 s lρl

(ri + 1)(bi + 1)ρi

l=i+1 (sl + 1/L)ρl

and similarly for Radm(H) (the only difference being m replaces n). Secondly, to absorb the term ln(q( b , r , s )), noting that ln(a) ln(γ2) + (a γ2)/(γ2), and also using ρi 1, then

ln(q( r , b , s )) = O

i (ri + 1)2(bi + 1)2((si + 1)L)2

L ln L + ln Y

i r2/3 i b2/3 i s2/3 i

i ln(r2/3 i b2/3 i ) + ln Y

L + ln(γ2) + 1

r2/3 i b2/3 i + Y

Pr[arg max y f(x)y = y] 2 dµX

Φγ,m(f, g) + 2

i=1 (1 φγ(g(xi))yi

Radm(F) + 6

Since h H was arbitrary, the bound may be wrapped in infg H. Similarly, unioning bounding away the failure probability for all shells, since this particular shell was arbitrary, an inﬁmum over shells can be added, which gives the ﬁnal inﬁmum over ( b, r, s). The last touch is to apply Lemma 1.2 to bound dµX/dνn .

E PROOF OF STABLE RANK BOUND, THEOREM 1.4

The ﬁrst step is to establish the sparsiﬁcation lemma in Lemma 3.2, which in turn sparsiﬁes each matrix product, cannot simply invoke Lemma C.2: pre-processing is necessary to control the elementwise magnitudes of the resulting matrix. Throughout this section, deﬁne the stable rank of a matrix W as sr(W) := W 2 F / W 2 2 (or 0 when W = 0).

Published as a conference paper at ICLR 2021

Lemma E.1. Let matrices A Rd m and B Rn m be given, along with sampling budget k. Then there exists a selection (i1, . . . , ik) of indices and a corresponding diagonal sampling matrix M with at most k nonzero entries satisfying

Zijeije T ij aij where Zij A F

k , and AB T AMB T 2 4

Proof. Let τ > 0 be a parameter to be optimized later, and deﬁne a subset of indices S := {i {1, . . . , m} : Aei τ}, with Sc := {1, . . . , m} \ S. Let Aτ denote the matrix obtained by zeroing out columns not in S, meaning

i S (Aei)e T i,

AB T AτB T F A Aτ B B s X

i Sc Aei 2 τ m B .

Applying Lemma C.2 to AτBT gives

eije T ij Aτeij 2 =

Zijeije T ij Aτeij such that AτB T AτMB T 2 1

k Aτ 2 B 2,

where Zij is speciﬁed by these equalities. To simplify, note Aτ A , and AτM = AM. Combining the two inequalities,

AB T AMB T AB T AτB T + AτB T AτMB T τ m B + 1

To ﬁnish, setting τ := A /

mk gives the bound, and ensures that the scaling term Zij satisﬁes, for any ij S,

k Aτeij A 2 F kτ = A F

With this tool in hand, the proof of Lemma 3.2 is as follows.

Proof of Lemma 3.2. Let Xj denote the network output after layer j, meaning

X T 0 := X T, X T j := σj(Wj X T j 1),

X T j F = σj(Wj X T j 1) σj(0) F Wj X T j 1 F Wj 2 X T j 1 F X F Y

The proof will inductively choose sampling matrices (M1, . . . , ML) as in the statement and construct ˆX T 0 := X T, ˆX T j := Πjσj(Wj Mj ˆX T j 1),

where Πj denotes projection onto the Frobenius-norm ball of radius X F Q i j Wi 2 (whereby Πj Xj = Xj), satisfying

Xj ˆXj F X F

which gives the desired bound after plugging in j = L.

Published as a conference paper at ICLR 2021

Proceeding with the inductive construction, the base case is direct since ˆX0 = X = X0 and X0 ˆX0 F = 0, thus consider some j > 0. Applying Lemma E.1 to the matrix multiplication

Wj ˆXj 1 with kj samples, there exists a multiset of Sj coordinates and a corresponding sampling matrix Mj, as speciﬁed in the statement, satisfying Wj ˆX T j 1 Wj Mj ˆX T j 1 F 1 p

kj Wj F ˆXj 1 F 1 p

kj Wj F X F Y

Using the choice ˆX T j := Πjσj(Wj Mj ˆX T j 1), Xj ˆXj F = σj(Wj X T j 1) Πjσj(Wj Mj ˆX T j 1) F

Wj X T j 1 Wj Mj ˆX T j 1 F

= Wj X T j 1 Wj ˆX T j 1 + Wj ˆX T j 1 Wj Mj ˆX T j 1 F

Wj X T j 1 Wj ˆX T j 1 F + Wj ˆX T j 1 Wj Mj ˆX T j 1 F

Xj 1 ˆXj 1 F + 1 p

kj Wj F X F Y

as desired.

To prove Theorem 1.4 via Lemma 3.2, the ﬁrst step is a quick tool to cover matrices element-wise.

Lemma E.2. Let A denote matrices with at most k2 nonzero rows and k1 nonzero columns, entries bounded in absolute value by b, and total number of rows and columns each at most m. Then there exists a cover set M A satisfying

|M| mk1+k2 2b k1k2

!k1k2 , and sup A A min ˆ A M A ˆA F ϵ.

Proof. Consider some ﬁxed set of k2 nonzero rows and k1 nonzero columns, and let M0 denote the covering set obtained by gridding the k1 k2 entries at scale ϵ k1k2 , whereby

For any A A with these speciﬁc nonzero rows and columns, the ˆA M0 obtained by rounding each nonzero entry of A to the nearest grid element gives

i,j (Aij ˆAij)2 X

1 k1k2 = ϵ2.

The ﬁnal cover M is now obtained by unioning copies of M0 for all m k1 m k2 mk1+k2 possible submatrices of size k2 k1.

The proof of Theorem 1.4 now carefully combines the preceding pieces.

Published as a conference paper at ICLR 2021

Proof of Theorem 1.4. The proof proceeds in three steps, as follows.

1. A covering number is estimate for sparsiﬁed networks, as output by Lemma 3.2.

2. A covering number for general networks is computed by balancing the error terms from Lemma 3.2 and its cover computed here.

3. This covering number is plugged into a Dudley integral to obtain the desired Rademacher bound.

Proceeding with this plan, let ( ˆX T 0, . . . , ˆX T L) be the layer outputs (and network input) exactly as provided by Lemma 3.2. Additionally, deﬁne diagonal matrices Dj := P

l ! Sj+1 ele T l (with DL =

I, where the ! denotes unique inclusion; these matrices capture the effect of the subsequent sparsiﬁcation, and can be safely inserted after each Wj without affecting ˆX T j , meaning

ˆX T j = Πjσj(Wj Mj ˆX T j 1) = Πjσj(Dj Wj Mj ˆX T j 1).

Let per-layer cover precisions (ϵ1, . . . , ϵL) be given, which will be optimized away later. This proof will inductively construct X T 0 := X T, X T j := Πjσj( Wj X T j 1),

where Wj is a cover element for Dj Wj Mj, and inductively satisfying

ˆX T j X T j X Fmj/2 X

To construct the per-layer cover elements Wj, ﬁrst note by the form of Mj (and the scaling Zi provided by Lemma 3.2) that

b := max i,l (Dj Wj Mj)l,i max i Wj Mjei Zi Wjei Wjei Wj F r m

Consequently, by Lemma E.2, there exists a cover Cj of matrices of the form Dj Wj Mj satisfying

|Cj| mkj+kj 1 2b p

!kjkj 1 mkj+kj 1 2 Wj F p

and the closest cover element Wj Cj to Dj Wj Mj satisﬁes Dj Wj Mj Wj F ϵj.

Proceeding with the induction, the base case has ˆX T 0 X T 0 = X T X T = 0, thus consider j > 0. The ﬁrst step is to estimate the spectral norm of Dj Wj Mj, which can be coarsely upper bounded via

Dj Wj Mj 2 2 Dj Wj Mj 2 F X

i Wj Mjei 2 X

i Wj 2 F m kj 1 Wj 2 F m.

By the form of ˆXj and Xj,

ˆX T j X T j = Πjσj(Dj Wj Mj ˆX T j 1) Πjσj( Wj X T j 1)

Dj Wj Mj ˆX T j 1 Wj X T j 1

Dj Wj Mj ˆX T j 1 Dj Wj Mj X T j 1 + Dj Wj Mj X T j 1 Wj X T j 1

Dj Wj Mj 2 ˆX T j 1 X T j 1 F + Dj Wj Mj Wj 2 X T j 1 F

m Wj F h X Fm(j 1)/2 X

Wj F i + ϵj X F Y

Published as a conference paper at ICLR 2021

which establishes the desired bound on the error.

The next step is to optimize kj. Let ϵ > 0 be arbitrary, and set ϵ 1 j := ϵ 12L m X F Q

i =j Wi F, whereby

ˆX T L X T L F ϵ

2, |Cj| mkj+kj 1 4m L p

The overall network cover N is the product of the covers for all layers, and thus has cardinality satisfying

j ln |Cj| 2 X

j kj ln m + X

j kjkj 1 ln

j kj ln m + X

To choose (k1, . . . , k L), letting X T L denote the output of the original unsparsiﬁed network, note ﬁrstly that the full error bound satisﬁes

X T L X T L X T L ˆX T L + ˆX T L X T L

2, where αi := X F

To choose ki, the approach here is to minimize a Lagrangian corresponding to the cover cardinality, subject to the total cover error being ϵ. Simplifying the previous expressions and noting 2kjkj 1 k2 j + k2 j 1, whereby the dominant term in ln |N| is P

j k2 j, consider Lagrangian

L(k1, . . . , kl, λ) := X

which has critical points when each ki satisﬁes

k5/2 i αi = λ

thus ki := α2/5 i /Z with Z := ϵ2/(2 P

j α4/5 j )2. As a sanity check (since it was baked into the Lagrangian), plugging this into the cover error indeed gives

X T L X T L X

i α4/5 i + ϵ

To upper bound the cover cardinality, ﬁrst note that X

i α4/5 i = 4

i α4/5 i 5 ,

ln |N| = e O

i k2 i i h X

i ln Wi F i

ϵ4 where β = e O X 4 F h Y

j Wj 4 2 i h X

i sr(Wi)2/5i5 h X

i ln Wi F i .

The ﬁnal step is to apply a Dudley entropy integral (Shalev-Shwartz & Ben-David, 2014), which gives

n Rad(F) = inf ζ

4ζ n + 12 Z n

4ζ n + 12 1

Published as a conference paper at ICLR 2021

Dropping the negative term gives an expression of the form aζ + b/ζ, which is convex in ζ > 0 and has critical point at ζ2 = b/a, which after plugging back in gives an upper bound 2

ab, meaning

n Rad(F) 2 4 n 12 p

Dividing by n and expanding the deﬁnition of β gives the ﬁnal Rademacher complexity bound.