# relative_flatness_and_generalization__4144c146.pdf

Relative Flatness and Generalization

Henning Petzka Lund University, Sweden henning.petzka@math.lth.se

Michael Kamp CISPA Helmholtz Center for Information Security, Germany and Monash University, Australia michael.kamp@monash.edu

Linara Adilova Ruhr University Bochum, Germany and Fraunhofer IAIS

Cristian Sminchisescu Lund University, Sweden and Google Research, Switzerland

Mario Boley Monash University, Australia

Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models, in particular neural networks. While it has been empirically observed that ﬂatness measures consistently correlate strongly with generalization, it is still an open theoretical problem why and under which circumstances ﬂatness is connected to generalization, in particular in light of reparameterizations that change certain ﬂatness measures but leave generalization unchanged. We investigate the connection between ﬂatness and generalization by relating it to the interpolation from representative data, deriving notions of representativeness, and feature robustness. The notions allow us to rigorously connect ﬂatness and generalization and to identify conditions under which the connection holds. Moreover, they give rise to a novel, but natural relative ﬂatness measure that correlates strongly with generalization, simpliﬁes to ridge regression for ordinary least squares, and solves the reparameterization issue.

1 Introduction

Flatness of the loss curve has been identiﬁed as a potential predictor for the generalization abilities of machine learning models [6, 10, 11]. In particular for neural networks, it has been repeatedly observed that generalization performance correlates with measures of ﬂatness, i.e., measures that quantify the change in loss under perturbations of the model parameters [4, 8, 16, 21, 34, 39, 41, 44]. In fact, Jiang et al. [14] perform a large-scale empirical study and ﬁnd that ﬂatness-based measures have a higher correlation with generalization than alternatives like weight norms, margin-, and optimization-based measures. It is an open problem why and under which circumstances this correlation holds, in particular in the light of negative results on reparametrizations of Re LU neural networks [5]: these reparameterizations change traditional measures of ﬂatness, yet leave the model function and its generalization unchanged, making these measures unreliable. We present a novel and rigorous approach to understanding the connection between ﬂatness and generalization by relating it to the interpolation from representative samples. Using this theory we, for the ﬁrst time, identify conditions under which ﬂatness explains generalization. At the same time, we derive a measure of relative ﬂatness that simpliﬁes to ridge/Tikhonov regularization for ordinary least squares [36], and resolves

equal contribution

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

hidden layers input layer

output layer

Figure 1: Decomposition of f = ψ φ into a feature extractor φ and a model ψ for neural networks.

Figure 2: Overview: We theoretically connect a notion of representative data with a notion of feature robustness and a novel measure of ﬂatness of the loss surface.

the reparametrization issue for Re LU networks [5] by appropriately taking the norm of parameters into account as suggested by Neyshabur et al. [25].

Formally, we connect ﬂatness of the loss surface to the generalization gap Egen(f, S) = E(f) Eemp(f, S) of a model f : X Y from a model class H with respect to a twice differentiable loss function ℓ: Y Y R+ and a ﬁnite sample set S X Y, where

E(f) = E(x,y) D h ℓ(f(x), y) i and Eemp(f, S) = 1

(x,y) S ℓ(f(x), y) .

That is, Egen(f, S) is the difference between the risk E(f) and the empirical risk Eemp(f, S) of f on a ﬁnite sample set S drawn iid. according to a data distribution D on X Y. To connect ﬂatness to generalization, we start by decomposing the generalisation gap into two terms, a representativeness term that quantiﬁes how well a distribution D can be approximated using distributions with local support around sample points and a feature robustness term describing how small changes of feature values affect the model s loss. Here, feature value refers to the implicitly represented features by the model, i.e., we consider models that can be expressed as f(x) = ψ(w, φ(x)) = g(wφ(x)) with a feature extractor φ and a model ψ (which includes linear and kernel models, as well as most neural networks, see Fig. 1). With this decomposition, we measure the generalization ability of a particular model by how well its interpolation between samples in feature space ﬁts the underlying data distribution. We then connect feature robustness (a property of the feature space) to ﬂatness (a property of the parameter space) using the following key identity: Multiplicative perturbations in feature space by arbitrary matrices A Rm m correspond to perturbations in parameter space, i.e.,

ψ(w, φ(x) + Aφ(x)) = g(w(φ(x) + Aφ(x))) = g((w + w A)φ(x)) = ψ(w + w A, φ(x)) . (1)

Using this key equation, we show that feature robustness is approximated by a novel, but natural, loss Hessian-based relative ﬂatness measure under the assumption that the distribution can be approximated by locally constant labels. Under this assumption and if the data is representative, then ﬂatness is the main predictor of generalization (see Fig. 2 for an illustration).

This offers an explanation for the correlation of ﬂatness with generalization on many real-world data distributions for image classiﬁcation [14, 21, 26], where the assumption of locally constant labels is reasonable (the deﬁnition of adversarial examples [35] even hinges on this assumption). This dependence on locally constant labels has not been uncovered by previous theoretical analysis [26, 37]. Moreover, we show that the resulting relative ﬂatness measure is invariant to linear reparameterization and has a stronger correlation with generalization than other ﬂatness measures [14, 21, 26]. Other measures have been proposed that similarly achieve invariance under reparameterizations [21, 37], but the Fisher-Rao norm [21] is lacking a strong theoretical connection to generalization, our measure sustains a more natural form than normalized sharpness [37] and for neural networks, it considers only a single layer, given by the decomposition of f (including the possibility of choosing the input layer when φ = id X ). An extended comparison to related work is provided in Appdx. A.

The limitations of our analysis are as follows. We assume a noise-free setting where for each x X there is a unique y = y(x) Y such that Px,y D(y|x) = 1, and this assumption is also extended to

the feature space of the given model, i.e., we assume that φ(x) = φ(x ) implies y(x) = y(x ) for all x, x X and write y(x) = y(φ(x)). Moreover, we assume that the marginal distribution DX is described by a density function p D(x), that f(x) = ψ(w, φ(x)) = g(wφ(x)) is a local minimizer of the empirical risk on S, and that g, ψ, φ are twice differential. Quantifying the representativeness of a dataset precisely is challenging since the data distribution is unknown. Using results from density estimation, we derive a worst-case bound on representativeness for all data distributions that fulﬁll mild regularity assumptions in feature space φ(X), i.e., a smooth density function pφ(D) such that R

z φ(X) 2 pφ(D)(z)||z||2 dz and R

z φ(X) pφ(D)(z)/||z||m dz are well-deﬁned and ﬁnite. This yields a generalization bound incorporating ﬂatness. In contrast to the common bounds of statistical learning theory, the bound depends on the feature dimension. The dimension-dependence is a result of the interpolation approach (applying density estimation uniformly over all distributions that satisfy the mild regularity assumptions). The bound is consistent with the no-free-lunch theorem and the convergence rate derived by Belkin et al. [2] for a model based on interpolations. In practical settings, representativeness can be expected to be much smaller than the worst-case bound, which we demonstrate by a synthetic example in Sec. 6. Generally, it is a bound that remains meaningful in the interpolation regime [1, 3, 24], where traditional measures of generalization based on the empirical risk and model class complexity are uninformative [22, 42].

Contribution. In summary, this paper rigorously connects ﬂatness of the loss surface to generalization and shows that this connection requires feature representations such that labels are (approximately) locally constant, which is also validated in a synthetic experiment (Sec. 6). The empirical evaluation shows that this ﬂatness and an approximation to representativeness can tightly bound the generalization gap. Our contributions are: (i) the rigorous connection of ﬂatness and generalization; (ii) novel notions of representativeness and feature robustness that capture the extent to which a model s interpolation between samples ﬁts the data distribution; and (iii) a novel ﬂatness measure that is layerand neuron-wise reparameterization invariant, reduces to ridge regression for ordinary least squares, and outperforms state-of-the-art ﬂatness measures on CIFAR10.

2 Representativeness

In this section, we formalize when a sample set S is representative for a data distribution D.

Partitioning the input space. We choose a partition {Vi |i = 1, . . . , |S|} of X such that each element of this partition Vi contains exactly one of the samples xi from S. The distribution can then be described by a set of densities pi(x) = 1 αi p D(x) 1Vi(x) with support contained in Vi (where 1Vi(x) = 1 if x Vi and 0 otherwise) and with normalizing factor αi = R

Vi p D(x)dx. Then the risk

decomposes as E(f) = P|S| i=1 αi Ex pi[ℓ(f(x), y(x))]. Since xi Vi for each i, we can change variables and consider density functions λ i (ξ) = pi(xi + ξ) with support in a neighborhood around the origin of X. The risk then decomposes as

i=1 αi Eξ λ i [ℓ(f(xi + ξ), y(xi + ξ))] . (2)

Starting from this identity, we formalize an approximation to the risk: In a practical setting, the distribution p D is unknown and hence, in the decomposition (2), we have unknown densities λ i and unknown normalization factors αi. We assume that each neighborhood contributes equally to the loss, i.e., we approximate each αi with 1 |S|. Then, given a sample set S and an |S|-tuple Λ = (λi)1 i |S| of local probability density functions on X with support supp(λi) in a neighborhood around the origin 0X , we call the pair (S, Λ) ϵ-representative for D with respect to a model f and loss ℓif |ERep(f, S, Λ)| ϵ, where

ERep(f, S, Λ) = E(f)

1 |S| Eξ λi [ℓ(f(xi + ξ), y(xi + ξ))] . (3)

If the partitions Vi and the distributions λi are all chosen optimal so that the approximation αi = 1 |S| is exact and λi = λ i , then ERep(f, S, Λ) = 0 by (2). If the support of each λi is decreased to the origin so that λi = δ0 is a Dirac delta function, then ERep(f, S, Λ) = Egen(f, S) equals the

generalization gap. For density functions with an intermediate support, the generalization gap can be decomposed into representativeness and the expected deviation of the loss around the sample points:

Egen(f, S) = ERep(f, S, Λ) +

1 |S| Eξ λi[ℓ(f(xi + ξ), y(xi + ξ)) ℓ(f(xi), yi)]

The main idea of our approach to understand generalization is to use this equality and to control both representativeness and expected loss deviations for a suitable |S|-tuple of distributions Λ.

From input to feature space. An interesting aspect of ϵ-representativeness is that it can be considered in a feature space instead of the input space. For a model f = (ψ φ) : X Y, we can apply our notion to the feature space φ(X) (see Fig. 1 for an illustration). This leads to the notion of ϵrepresentativeness in feature space deﬁned for an |S|-tuple Λφ = (λφ i )1 i |S| of densities on φ(X) by replacing xi with φ(xi) in (3), which we denote by Eφ Rep(f, S, Λφ). By measuring representativeness in a feature space, this becomes a notion of both data and feature representation. In particular, it assumes that a target output function y(φ(x)) also exists for the feature space. We can then decompose the generalization gap Egen(f) of f = (ψ φ) into

Eφ Rep(f, S, Λφ) +

i=1 Eξ λφ i [ℓ(ψ(φ(xi) + ξ), y(φ(xi) + ξ)) ℓ(f(xi), yi)]

The second term is determined by how the loss changes under small perturbations in the feature space for the samples in S. As before, for λi = δ0 the term in the bracket vanishes and Eφ Rep(f, S, Λφ) = Egen. But the decomposition becomes more interesting for distributions with support of nonzero measure around the origin. If the true distribution can be interpolated efﬁciently in feature space from the samples in S with suitable λφ i so that Eφ Rep(f, S, Λφ) 0, then the term in the bracket approximately equals the generalization gap and the generalization gap can be estimated from local properties in feature space around sample points.

3 Feature Robustness

Having decomposed the generalisation gap into a representativness and a second term of loss deviation, we now develop a novel notion of feature robustness that is able to bound the second term for speciﬁc families of distributions Λ using key equation (1). Our deﬁnition of feature robustness for a model f = (ψ φ) : X Y depends on a small number δ > 0, a sample set S and a feature selection deﬁned by a matrix A Rm m of operator norm ||A|| 1. With feature perturbations φA(x) = (I + A)φ(x) and

Eφ F(f, S, A) := 1

h ℓ(ψ(φA(xi)), y[φA(xi)]) ℓ(f(xi), yi) i , (4)

the deﬁnition of feature robustness is given as follows. Deﬁnition 1. Let ℓ: Y Y R+ denote a loss function, ϵ and δ two positive (small) real numbers, S X Y a ﬁnite sample set, and A Rm m a matrix. A model f(x) = (ψ φ)(x) with

φ(X) Rm is called ((δ, S, A), ϵ) ((δ, S, A), ϵ) ((δ, S, A), ϵ)-feature robust, if Eφ F(f, S, αA) ϵ for all 0 α δ. More generally, for a probability distribution A on perturbation matrices in Rm, we deﬁne

Eφ F(f, S, A) = EA A h Eφ F(f, S, A) i ,

and call the model ((δ, S, A), ϵ) ((δ, S, A), ϵ) ((δ, S, A), ϵ)-feature robust on average over A, if Eφ F(f, S, αA) ϵ for 0 α δ.

Given a feature extractor φ, feature robustness measures the performance of ψ when feature values are perturbed (with constant feature extractor φ). This local robustness at sample points differs from the robustness of Xu and Mannor [40] that requires a data-independent partitioning of the input space. The matrix A in feature robustness determines which feature values shall be perturbed. For

each sample, the perturbation is linear in the expression of the feature. Thereby, we only perturb features that are relevant for the output for a given sample and leave feature values unchanged that are not expressed. For φ mapping into an intermediate layer of a neural network, traditionally, the activation values of a neuron are considered as feature values, which corresponds to a choice of A as a projection matrix. However, it was shown by Szegedy et al. [35] that, for any other direction v Rm, ||v|| = 1, the values φ(x), v obtained from the projection φ(x) onto v, can be likewise semantically interpreted as a feature. This motivates the consideration of general feature matrices A.

Distributions on feature matrices induce distributions on the feature space Feature robustness is deﬁned in terms of feature matrices (suitable for an application of (1) to connect perturbations of features with perturbations of weights), while the approach exploiting representative data from Section 2 considers distributions on feature vectors, cf. (3). To connect feature robustness to the notion of ϵ-representativeness, we specify for any distribution A on matrices A Rm m an |S|-tuple ΛA = (λi) of probability density functions λi on the feature space Rm with support containing the origin. Multiplication of a feature matrix with a feature vector φ(xi) deﬁnes a feature selection Aφ(xi), and for each z Rm there is some feature matrix A with φ(xi) + z = φ(xi) + Aφ(xi) (unless φ(xi) = 0). Our choice for distributions λi on Rm are therefore distributions that are induced via multiplication of feature vectors φ(xi) Rm with matrices A Rm m sampled from a distribution on feature matrices A . Formally, we assume that a Borel measure µA is deﬁned by a probability distribution A on matrices Rm m. We then deﬁne Borel measures µi on Rm by µi(C) = µA({A | Aφ(xi) C}) for Borel sets C Rm. Then λi is the probability density function deﬁned by the Borel measure µi. As a result, we have for each i that

EA A h ℓ(ψ(φA(xi)), y(φA(xi))) i = Ez λi h ℓ(ψ(φ(xi) + z), y(φ(xi) + z)) i

Feature robustness and generalization. With this construction and a distribution ΛA on the feature space induced by a distribution A on feature matrices, we have that

E(f) = Eemp(f, S) + Eφ Rep(f, S, ΛA) + Eφ F(f, S, A) (5)

Here, A can be any distribution on feature matrices, which can be chosen suitably to control how well the corresponding mixture of local distributions approximates the true distribution. The third term then measures how robust the model is in expectation over feature changes for A A. In particular, if Eφ Rep(f, S, ΛA) 0, then Egen(f, S) Eφ F(f, S, A) and the generalization gap is determined by feature robustness. We end this section by illustrating how distributions on feature matrices induce natural distributions on the feature space. The example will serve in Sec. 5 to deduce a bound on Eφ Rep(f, S, ΛA) from kernel density estimation.

Example: Truncated isotropic normal distributions are induced by a suitable distribution on feature matrices. We consider probability distributions Kδ||φ(xi)|| on feature vectors z Rm in the feature space deﬁned by densities kδ||φ(xi)||(0, z) with smooth rotation-invariant kernels, bounded support and bandwidth h:

kh(zi, z) = 1 hm k ||zi z||

1||zi z||<h (6)

with 1||zi z||<h = 1 when ||z zi|| < h and 0 otherwise, and such that R

z Rm kh(z0, z) dz = 1 for all z0. An example for such a kernel is a truncated isotropic normal distribution with variance h2σ2I, kh(zi, z) = N(zi, h2σ2)(z). The following result states that the densities in (6) can indeed be induced by distributions on feature matrices, which will enable us to connect feature robustness with ϵ-representativeness. Proposition 2. Let Sφ = {φ(xi) |xi S} be a set of feature vectors in Rm. With kh deﬁned as in (6), let λi(z) = kδ||φ(xi)||(0, z) deﬁne an |S|-tuple Λδ of densities. Then there exists a distribution Aδ on matrices in Rm m of norm less than δ such that for each i = 1, . . . , |S|,

EA Aδ h ℓ(ψ(φA(xi)), y(φA(xi))) i = Eξ λi h ℓ(ψ(φ(xi) + ξ), y(φ(xi) + ξ)) i

The technical proof is deferred to the appendix, but we describe the distribution Aδ on matrices for later use: The desired distribution is deﬁned on the set of matrices of the form r O for a real number r

and an orthogonal matrix O (i.e. OOT = OT O = I) as a product measure combining the (unique) Haar measure on the set of orthogonal matrices O(m) with a suitable distribution on R. The Haar measure on O(m) induces the uniform measure on a sphere of radius r via multiplication with a vector of length r [17], and we choose a measure on R to match the radial change of the kernel kh.

4 Relative Flatness of the Loss Surface

Flatness is a property of the parameter space quantifying the change in loss under small parameter perturbations, classically measured by the trace of the loss Hessian Tr(H), where H is the matrix containing the partial second derivatives of the empirical risk with respect to all parameters of the model. In order to connect feature robustness (a property of the feature space) to ﬂatness, we present how key equation (1) translates to the empirical risk: For a model f(x, w) = ψ(w, φ(x)) = g(wφ(x)) with parameters w Rd m and g : Rd Y a function on a matrix product of parameters w and a feature representation φ : X Rm and any feature matrix A Rm m we have that

Eemp(w+w A, φ(S)) = 1

i=1 ℓ(ψ(w + w A, φ(xi)), yi)

i=1 ℓ(ψ(w, φ(xi) + Aφ(xi)), yi) = 1

i=1 ℓ(ψ(w, φA(xi)), yi)

Subtracting Eemp(w, φ(S)) = 1 |S| P|S| i=1 ℓ(ψ(w, φ(xi)), yi), we can recognize feature robustness (4) on the right side of this equality when labels are constant under perturbations of the features, i.e. y(φA(xi)) = yi. In other words, ﬂatness Eemp(w + v, φ(S)) Eemp(w, φ(S)) describes the performance of a model function on perturbed feature vectors while holding labels constant. We proceed to introduce a novel, but natural, loss Hessian-based ﬂatness measure that approximates feature robustness, given that the underlying data distribution D satisﬁes the assumption of locally constant labels.

With ws = (ws,t)t R1 m denoting the s-th row of the parameter matrix w, we let Hs,s (w, φ(S)) Rm m denote the Hessian matrix containing all partial second derivatives of the empirical risk Eemp(w, φ(S)) with respect to weights in rows ws and ws , i.e.

Hs,s (w, φ(S)) = 2Eemp(w, φ(S))

1 t,t m . (8)

Deﬁnition 3. For a model f(x, w) = g(wφ(x)), w Rd m, with a twice differentiable function g, a twice differentiable loss function ℓand a sample set S, relative ﬂatness is deﬁned by

κφ T r(w) :=

s,s =1 ws, ws Tr(Hs,s (w, φ(S))), (9)

where Tr denote the trace and ws, ws = wsw T s the scalar product of two row vectors.

Properties of relative ﬂatness (i) Relative ﬂatness simpliﬁes to ridge regression for linear models f(x, w) = wx R (X = Rd, g = id and φ = id) and squared loss: To see this, note that for any loss function ℓ, the second derivatives with respect to the parameters w Rd computes to 2ℓ wi wj = 2ℓ (f(x,w))2 xixj. For ℓ(ˆy, y) = ||ˆy y||2 the squared loss function, 2ℓ/ ˆy2 = 2 and the Hessian is independent of the parameters w. In this case, κid T r = c ||w||2 with a constant c = P

x S 2Tr(xx T ), which is the well-known Tikhonov (ridge) regression penalty.

(ii) Invariance under reparameterization: We consider neural network functions

f(x) = w Lσ(. . . σ(w2σ(w1x + b1) + b2) . . .) + b L (10)

of a neural network of L layers with nonlinear activation function σ. By letting φl(x) denote the composition of the ﬁrst l 1 layers, we obtain a decomposition f(x, wl) = gl(wlφl(x)) of the network. Using (9) we obtain a relative ﬂatness measure κl T r(w) for the chosen layer.

For a well-deﬁned Hessian of the loss function, we require the network function to be twice differentiable. With the usual adjustments (equations only hold almost everywhere in parameter space), we can also consider neural networks with Re LU activation functions. In this case, Dinh et al. [5] noted that the network function and with it the generalization performance remains unchanged under linear reparameterization, i.e., multiplying layer l with α > 0 and dividing layer k = l by α, but common measures of the loss Hessian change. Our measure ﬁxes this issue in relating ﬂatness to generalization since the change of the loss Hessian is compensated by multiplication with the scalar products of weight matrices and is therefore invariant under layer-wise reparameterizations [cf. 26]. It is also invariant to neuron-wise reparameterizations, i.e., multiplying all incoming weights into a neuron by a positive number α and dividing all outgoing weights by α [23], except for neuron-wise reparameterizations of the feature layer φl. Using a simple preprocessing step (a neuron-wise reparameterization with the variance over the sample), our proposed measure becomes independent of all neuron-wise reparameterizations. Theorem 4. Let σi denote the variance of the i-th coordinate of φl(x) over samples x S and V = diag σ1, . . . , σnl 1 . If the relative ﬂatness measure κl T r is applied to the representation

f(x) = w Lσ(. . . σ(wl V σ(V 1wl 1σ(. . . σ(w1x + b1)) . . .) + V 1bl 1) + bl) . . .) + b L

then κl T r is invariant under all neuron-wise (and layer-wise) reparameterizations

We now connect ﬂatness with feature robustness: Relative ﬂatness approximates feature robustness for a model at a local minimum of the empirical risk, when labels are approximately constant in neighborhoods of the training samples (φ(x), y) φ(S) in feature space. Theorem 5. Consider a model f(x, w) = g(wφ(x)) as above, a loss function ℓand a sample set S, and let Om Rm m denote the set of orthogonal matrices. Let δ be a positive (small) real number and w = ω Rd m denote parameters at a local minimum of the empirical risk on a sample set S. If the labels satisfy that y(φδA(xi)) = y(φ(xi)) = yi for all (xi, yi) S and all ||A|| 1, then f(x, ω) is ((δ, S, Om), ϵ)-feature robust on average over Om for ϵ = δ2 2mκφ T r(ω) + O(δ3).

Applying the theorem to Eq. 5 implies that if the data is representative, i.e., Eφ Rep(f, S, ΛAδ) 0

for the distribution Aδ of Prop. 2, then Egen(f( , ω), S) δ2 2mκφ T r(ω) + O(δ3). The assumption on locally constant labels in Thm. 5 can be relaxed to approximately locally constant labels without unraveling the theoretical connection between ﬂatness and feature robustness. Appendix B investigates consequences from even dropping the assumption of approximately locally constant labels.

5 Flatness and Generalization

Combining the results from sections 2 4, we connect ﬂatness to the generalization gap when the distribution can be represented by smooth probability densities on a feature space with approximately locally constant labels. By approximately locally constant labels we mean that, for small δ, the loss in δ||φ(xi)||-neighborhoods around the feature vector of a training sample xi is approximated (on average over all training samples) by the loss for constant label y(xi) on these neighborhoods. This and the following theorem connecting ﬂatness and generalization are made precise in Appendix D.4. Theorem 6 (informal). Consider a model f(x, w) = g(wφ(x)) as above, a loss function ℓand a sample set S, let m denote the dimension of the feature space deﬁned by φ and let δ be a positive (small) real number. Let ω denote a local minimizer of the empirical risk on a sample set S. If the distribution D has a smooth density pφ D on the feature space Rm with approximately locally constant labels around the points x S, then it holds with probability 1 over sample sets S that

Egen(f( , ω), S) |S| 2 4+m

2m + C1(pφ D, L) + C2(pφ D, L)

up to higher orders in |S| 1 for constants C1, C2 that depend only on the distribution in feature space pφ D induced by φ, the chosen |S|-tuple Λδ and the maximal loss L.

To prove Theorem 6 we bound both ϵ-representativeness and feature robustness in Eq. 5. For that, the main idea is that the family of distributions considered in Proposition 2 has three key properties: (i) it

Figure 3: The correlation between ﬂatness and generalization increases with the degree of locally constant labels.

Figure 4: Approximation of representativeness via KDE together with relative ﬂatness leads to a tight generalization bound.

provides an explicit link between the distributions on feature matrices Aδ used in feature robustness and the family of distributions Λδ of ϵ-representativeness (Proposition 2) (ii) it allows us to bound feature robustness using Thm. 5; and (iii) it is simple enough that it allows us to use standard results of kernel density estimation (KDE) to bound representativeness.

Our bound suffers from the curse of dimensionality, but for the chosen feature space instead of the (usually much larger) input space. The dependence on the dimension is a result of using KDE uniformly over all distributions satisfying mild regularity assumptions. In practice, for a given distribution and sample set S, representativeness can be much smaller, which we showcase in a toy example in Sec. 6. In the so-called interpolation regime, where datasets with arbitrarily randomized labels can be ﬁt by the model class, the obtained convergence rate is consistent with the no free lunch theorem and the convergence rate derived by Belkin et al. [2] for an interpolation technique using nearest neighbors.

A combination of our approach with prior assumptions on the hypotheses or the algorithm in accordance to statistical learning theory could potentially achieve faster convergence rate. Our herein presented theory is instead based solely on interpolation and aims to understand the role of ﬂatness (a local property) in generalization: If the data is representative in feature layers and if the distribution can be approximated by locally constant labels in these layers, then ﬂatness of the empirical risk surface approximates the generalization gap. Conversely, Equation 7 shows that ﬂatness measures the performance under perturbed features only when labels are kept constant. As a result, we offer an explanation for the often observed correlation between ﬂatness and generalization: Real-world data distributions for classiﬁcation are benign in the sense that small perturbations in feature layers do not change the target class, i.e., they can be approximated by locally constant labels. (Note that the deﬁnition of adversarial examples hinges on this assumption of locally constant labels.) In that case, feature robustness is approximated by ﬂatness of the loss surface. If the given data and its feature representation are further ϵ-representative for small ϵ 0, then ﬂatness becomes the main contributor to the generalization gap leading to their noisy, but steady, correlation.

6 Empirical Validation

We empirically validate the assumptions and consequences of the theoretical results derived above 2. For that, we ﬁrst show on a synthetic example that the empirical correlation between ﬂatness and generalization decreases if labels are not locally constant, up to a point when they are not correlated anymore. We then show that the novel relative ﬂatness measure correlates strongly with generalization, also in the presence of reparameterizations. Finally, we show in a synthetic experiment that while representativeness cannot be computed without knowing the true data distribution, it can in practice be approximated. This approximation although technically not a bound anymore tightly bounds the generalization gap. Synthetic data distributions for binary classiﬁcation are generated by sampling 4 Gaussian distributions in feature space (two for each class) with a given distance between their means (class separation). We then sample a dataset in feature space Sφ, train a linear classiﬁer

2 Code is available at https://github.com/kampmichael/relative Flatness Generalization.

ψ on the sample, randomly draw the weights of a 4-layer MLP φ, and generate the input data as S = (φ 1(Sφ x), Sφ y ). This yields a dataset S and a model f = φ ψ such that φ(S) has a given class separation. Details on the experiments are provided in Appdx. C.

Locally constant labels: To validate the necessity of locally constant labels, we measure the correlation between the proposed relative ﬂatness measure and the generalization gap for varying degrees of locally constant labels, as measured by the class separation on the synthetic datasets. For each chosen class separation, we sample 100 random datasets of size 500 on which we measure relative ﬂatness and the generalization gap. Fig. 3 shows the average correlation for different degrees of locally constant labels, showing that the higher the degree, the more correlated ﬂatness is with generalization. If labels are not locally constant, ﬂatness does not correlate with generalization.

Figure 5: The generalization gap for various local minima correlates stronger with relative ﬂatness than standard ﬂatness, Fisher Rao norm, Pac Bayes based measure and weights norm (points corresp. to local minima).

Approximating representativeness: While representativeness cannot be calculated without knowing the data distribution, it can be approximated from the training sample S by the error of a density estimation on that sample. For that, we use multiple random splits of S into a training set Strain and a test set Stest, train a kernel density estimation on Strain and measure its error on Stest. Again, details can be found in Appx. C. The lower the class separation of the synthetic datasets, the harder the learning problem and the less representative a random sample will be. For each sample and its distribution, we compute the generalization gap and the approximation to the generalization bound. The results in Fig. 4 show that the approximated generalization bound tightly bounds the generalization error (note that this approximation is technically not a bound anymore). Moreover, as expected, the bound decreases the easier the learning problems become.

Relative ﬂatness correlates with generalization: We validate the correlation of relative ﬂatness to the generalization gap in practice by measuring it for 110 different local minima achieved via different learning setups, such as initialization, learning rate, batch size, and optimization algorithm of Le Net5 [19] on CIFAR10 [18]. We compare this correlation to the classical Hessian-based ﬂatness measures using the trace of the loss-Hessian, the Fisher-Rao norm [21], the PACBayes ﬂatness measure that performed best in the extensive study of Jiang et al. [14] and the L2-norm of the weights. The results in Fig. 5 show that indeed relative ﬂatness has higher correlation than all the competing measures. Of these measures, only the Fisher-Rao norm is reparameterization invariant but shows the weakest correlation in the experiment. In Appdx C we show how reparameterizations of the network signiﬁcantly reduce the correlation for non-reparameterization invariant measures.

7 Discussion and Conclusion

Contributing to the trustworthiness of machine learning, this paper provides a rigorous connection between ﬂatness and generalization. As to be expected for a local property, our association between ﬂatness and generalization requires the samples and its representation in feature layers to be representative for the target distribution. But our derivation uncovers a second, usually overlooked condition. Flatness of the loss surface measures the performance of a model close to training points when labels are kept locally constant. If a data distribution violates this, then ﬂatness cannot be a good indicator for generalization.

Whenever we consider feature representations other than the input features, the derivation of our results makes one strong assumption: the existence of a target output function y(φ(x)) on the feature space φ(X). By moving assumptions on the distribution from the input space to the feature space, we achieve a bound based on interpolation that depends on the dimension of the feature layer instead of the input space. Hence, we assume that the feature representation is reasonable and does not lose information that is necessary for predicting the output. To achieve faster convergence rates independent of any involved dimensions, future work could aim to combine our approach of interpolation with a prior-based approach of statistical learning theory.

Our measure of relative ﬂatness may still be improved in future work. Better estimates for the generalization gap are possible by improving the representativeness of local distributions in two ways: The support shape of the local distributions can be improved and their volume-parameter δ can be optimally chosen. Both improvements will affect the derivation of the measure of relative ﬂatness as an estimation of feature robustness for the corresponding distributions on feature matrices. Whereas different support shapes change the trace to a weighted average of the Hessian eigenvalues, the volume parameter can provide a correcting scaling factor. Both approaches seem promising to us, as our relative measure from Deﬁnition 3 already outperforms the competing measures of ﬂatness in our empirical validation.

Acknowledgements

Cristian Sminchisescu was supported by the European Research Council Consolidator grant SEED, CNCS-UEFISCDI (PN-III-P4-ID-PCE-2016-0535, PN-III-P4-ID-PCCF-2016-0180), the EU Horizon 2020 grant DE-ENIGMA (688835), and SSF.

Mario Boley was supported by the Australian Research Council (under DP210100045).

We would like to thank Julia Rosenzweig, Dorina Weichert, Jilles Vreeken, Thomas Gärtner, Asja Fischer, Tatjana Turova and Alexandru Aleman for the great discussions.

[1] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overﬁtting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

[2] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overﬁtting or perfect ﬁtting? risk bounds for classiﬁcation and regression rules that interpolate. In Advances in Neural Information Processing Systems, pages 2300 2311, 2018.

[3] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

[4] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann Le Cun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In Proceedings of the International Conference of Learning Representations, 2017.

[5] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1019 1028. JMLR. org, 2017.

[6] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. AAAI, 2017.

[7] Gintare Karolina Dziugaite and Daniel M Roy. Data-dependent pac-bayes priors via differential privacy. In Advances in Neural Information Processing Systems, pages 8430 8441, 2018.

[8] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efﬁciently improving generalization. In Proceedings of the International Conference on Learning Representations, 2021.

[9] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 249 256. PMLR, 2010.

[10] Sepp Hochreiter and Jürgen Schmidhuber. Simplifying neural nets by discovering ﬂat minima. In Advances in Neural Information Processing Systems, pages 529 536, 1995.

[11] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1 42, 1997.

[12] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artiﬁcial Intelligence, 2018.

[13] Stanisław Jastrz ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors inﬂuencing minima in sgd. ar Xiv preprint ar Xiv:1711.04623, 2017.

[14] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to ﬁnd them. In Proceedings of the International Conference on Learning Representations, 2020.

[15] MC Jones, IJ Mc Kay, and T-C Hu. Variable location and scale kernel density estimation. Annals of the Institute of Statistical Mathematics, 46(3):521 535, 1994.

[16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In Proceedings of the International Conference on Learning Representatiosn, 2017.

[17] Steven G. Krantz and Harold R. Parks. Geometric integration theory. Springer Science and Business Media, 2008.

[18] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[19] Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. Technical report, AT&T Labs, 2010.

[20] Yann Le Cun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a backpropagation network. In Advances in Neural Information Processing Systems, pages 396 404, 1990.

[21] Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2019.

[22] Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems, pages 11611 11622, 2019.

[23] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, volume 28, 2015.

[24] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. Workshop contribution at the International Conference on Learning Representations, 2015.

[25] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376 1401, 2015.

[26] Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947 5956, 2017.

[27] Roman Novak, Yasaman Bahri, Daniel A Abolaﬁa, Jeffrey Pennington, and Jascha Sohl Dickstein. Sensitivity and generalization in neural networks: an empirical study. In Proceedings of the International Conference on Learning Representations, 2018.

[28] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035, 2019.

[29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

[30] Henning Petzka, Linara Adilova, Michael Kamp, and Cristian Sminchisescu. A reparameterization-invariant ﬂatness measure for deep neural networks. In Workshop on Science meets Engineering of Deep Learning at Neur IPS, 2019.

[31] Akshay Rangamani, Nam H. Nguyen, Abhishek Kumar, Dzung T. Phan, Sang H. Chin, and Trac D. Tran. A scale invariant ﬂatness measure for deep network minima. ar Xiv preprint ar Xiv:1902.02434, 2019.

[32] Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. Workshop contribution at the International Conference on Learning Representations, 2019.

[33] Bernard W. Silverman. Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. Chapman and Hall, 1986.

[34] Xu Sun, Zhiyuan Zhang, Xuancheng Ren, Ruixuan Luo, and Liangyou Li. Exploring the vulnerability of deep neural networks: A study of parameter corruption. ar Xiv preprint ar Xiv:2006.05620, 2020.

[35] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

[36] Andrey Nikolayevich Tikhonov, A. Goncharsky, V. V. Stepanov, and Anatolij Grigorevic Yagola. Numerical Methods for the Solution of Ill-Posed Problems. Mathematics and Its Applications. Springer Netherlands, 1995.

[37] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Normalized ﬂat minima: Exploring scale invariant deﬁnition of ﬂat minima for neural networks using PAC-Bayesian analysis. In Proceedings of the 37th International Conference on Machine Learning, pages 9636 9647, 2020.

[38] Huan Wang, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. Identifying generalization properties in neural networks. ar Xiv preprint ar Xiv:1809.07402, 2018.

[39] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. In Advances in Neural Information Processing Systems, volume 33, pages 2958 2969, 2020.

[40] Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391 423, 2012.

[41] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems, volume 32, 2019.

[42] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations, 2017.

[43] Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, and Tomaso Poggio. Theory of deep learning iib: Optimization properties of sgd. ar Xiv preprint ar Xiv:1801.02254, 2018.

[44] Yaowei Zheng, Richong Zhang, and Yongyi Mao. Regularizing neural networks via adversarial model perturbation. ar Xiv preprint ar Xiv:2010.04925, 2020.