# relative_flatness_and_generalization__4144c146.pdf Relative Flatness and Generalization Henning Petzka Lund University, Sweden henning.petzka@math.lth.se Michael Kamp CISPA Helmholtz Center for Information Security, Germany and Monash University, Australia michael.kamp@monash.edu Linara Adilova Ruhr University Bochum, Germany and Fraunhofer IAIS Cristian Sminchisescu Lund University, Sweden and Google Research, Switzerland Mario Boley Monash University, Australia Flatness of the loss curve is conjectured to be connected to the generalization ability of machine learning models, in particular neural networks. While it has been empirically observed that flatness measures consistently correlate strongly with generalization, it is still an open theoretical problem why and under which circumstances flatness is connected to generalization, in particular in light of reparameterizations that change certain flatness measures but leave generalization unchanged. We investigate the connection between flatness and generalization by relating it to the interpolation from representative data, deriving notions of representativeness, and feature robustness. The notions allow us to rigorously connect flatness and generalization and to identify conditions under which the connection holds. Moreover, they give rise to a novel, but natural relative flatness measure that correlates strongly with generalization, simplifies to ridge regression for ordinary least squares, and solves the reparameterization issue. 1 Introduction Flatness of the loss curve has been identified as a potential predictor for the generalization abilities of machine learning models [6, 10, 11]. In particular for neural networks, it has been repeatedly observed that generalization performance correlates with measures of flatness, i.e., measures that quantify the change in loss under perturbations of the model parameters [4, 8, 16, 21, 34, 39, 41, 44]. In fact, Jiang et al. [14] perform a large-scale empirical study and find that flatness-based measures have a higher correlation with generalization than alternatives like weight norms, margin-, and optimization-based measures. It is an open problem why and under which circumstances this correlation holds, in particular in the light of negative results on reparametrizations of Re LU neural networks [5]: these reparameterizations change traditional measures of flatness, yet leave the model function and its generalization unchanged, making these measures unreliable. We present a novel and rigorous approach to understanding the connection between flatness and generalization by relating it to the interpolation from representative samples. Using this theory we, for the first time, identify conditions under which flatness explains generalization. At the same time, we derive a measure of relative flatness that simplifies to ridge/Tikhonov regularization for ordinary least squares [36], and resolves equal contribution 35th Conference on Neural Information Processing Systems (Neur IPS 2021). hidden layers input layer output layer Figure 1: Decomposition of f = ψ φ into a feature extractor φ and a model ψ for neural networks. Figure 2: Overview: We theoretically connect a notion of representative data with a notion of feature robustness and a novel measure of flatness of the loss surface. the reparametrization issue for Re LU networks [5] by appropriately taking the norm of parameters into account as suggested by Neyshabur et al. [25]. Formally, we connect flatness of the loss surface to the generalization gap Egen(f, S) = E(f) Eemp(f, S) of a model f : X Y from a model class H with respect to a twice differentiable loss function ℓ: Y Y R+ and a finite sample set S X Y, where E(f) = E(x,y) D h ℓ(f(x), y) i and Eemp(f, S) = 1 (x,y) S ℓ(f(x), y) . That is, Egen(f, S) is the difference between the risk E(f) and the empirical risk Eemp(f, S) of f on a finite sample set S drawn iid. according to a data distribution D on X Y. To connect flatness to generalization, we start by decomposing the generalisation gap into two terms, a representativeness term that quantifies how well a distribution D can be approximated using distributions with local support around sample points and a feature robustness term describing how small changes of feature values affect the model s loss. Here, feature value refers to the implicitly represented features by the model, i.e., we consider models that can be expressed as f(x) = ψ(w, φ(x)) = g(wφ(x)) with a feature extractor φ and a model ψ (which includes linear and kernel models, as well as most neural networks, see Fig. 1). With this decomposition, we measure the generalization ability of a particular model by how well its interpolation between samples in feature space fits the underlying data distribution. We then connect feature robustness (a property of the feature space) to flatness (a property of the parameter space) using the following key identity: Multiplicative perturbations in feature space by arbitrary matrices A Rm m correspond to perturbations in parameter space, i.e., ψ(w, φ(x) + Aφ(x)) = g(w(φ(x) + Aφ(x))) = g((w + w A)φ(x)) = ψ(w + w A, φ(x)) . (1) Using this key equation, we show that feature robustness is approximated by a novel, but natural, loss Hessian-based relative flatness measure under the assumption that the distribution can be approximated by locally constant labels. Under this assumption and if the data is representative, then flatness is the main predictor of generalization (see Fig. 2 for an illustration). This offers an explanation for the correlation of flatness with generalization on many real-world data distributions for image classification [14, 21, 26], where the assumption of locally constant labels is reasonable (the definition of adversarial examples [35] even hinges on this assumption). This dependence on locally constant labels has not been uncovered by previous theoretical analysis [26, 37]. Moreover, we show that the resulting relative flatness measure is invariant to linear reparameterization and has a stronger correlation with generalization than other flatness measures [14, 21, 26]. Other measures have been proposed that similarly achieve invariance under reparameterizations [21, 37], but the Fisher-Rao norm [21] is lacking a strong theoretical connection to generalization, our measure sustains a more natural form than normalized sharpness [37] and for neural networks, it considers only a single layer, given by the decomposition of f (including the possibility of choosing the input layer when φ = id X ). An extended comparison to related work is provided in Appdx. A. The limitations of our analysis are as follows. We assume a noise-free setting where for each x X there is a unique y = y(x) Y such that Px,y D(y|x) = 1, and this assumption is also extended to the feature space of the given model, i.e., we assume that φ(x) = φ(x ) implies y(x) = y(x ) for all x, x X and write y(x) = y(φ(x)). Moreover, we assume that the marginal distribution DX is described by a density function p D(x), that f(x) = ψ(w, φ(x)) = g(wφ(x)) is a local minimizer of the empirical risk on S, and that g, ψ, φ are twice differential. Quantifying the representativeness of a dataset precisely is challenging since the data distribution is unknown. Using results from density estimation, we derive a worst-case bound on representativeness for all data distributions that fulfill mild regularity assumptions in feature space φ(X), i.e., a smooth density function pφ(D) such that R z φ(X) 2 pφ(D)(z)||z||2 dz and R z φ(X) pφ(D)(z)/||z||m dz are well-defined and finite. This yields a generalization bound incorporating flatness. In contrast to the common bounds of statistical learning theory, the bound depends on the feature dimension. The dimension-dependence is a result of the interpolation approach (applying density estimation uniformly over all distributions that satisfy the mild regularity assumptions). The bound is consistent with the no-free-lunch theorem and the convergence rate derived by Belkin et al. [2] for a model based on interpolations. In practical settings, representativeness can be expected to be much smaller than the worst-case bound, which we demonstrate by a synthetic example in Sec. 6. Generally, it is a bound that remains meaningful in the interpolation regime [1, 3, 24], where traditional measures of generalization based on the empirical risk and model class complexity are uninformative [22, 42]. Contribution. In summary, this paper rigorously connects flatness of the loss surface to generalization and shows that this connection requires feature representations such that labels are (approximately) locally constant, which is also validated in a synthetic experiment (Sec. 6). The empirical evaluation shows that this flatness and an approximation to representativeness can tightly bound the generalization gap. Our contributions are: (i) the rigorous connection of flatness and generalization; (ii) novel notions of representativeness and feature robustness that capture the extent to which a model s interpolation between samples fits the data distribution; and (iii) a novel flatness measure that is layerand neuron-wise reparameterization invariant, reduces to ridge regression for ordinary least squares, and outperforms state-of-the-art flatness measures on CIFAR10. 2 Representativeness In this section, we formalize when a sample set S is representative for a data distribution D. Partitioning the input space. We choose a partition {Vi |i = 1, . . . , |S|} of X such that each element of this partition Vi contains exactly one of the samples xi from S. The distribution can then be described by a set of densities pi(x) = 1 αi p D(x) 1Vi(x) with support contained in Vi (where 1Vi(x) = 1 if x Vi and 0 otherwise) and with normalizing factor αi = R Vi p D(x)dx. Then the risk decomposes as E(f) = P|S| i=1 αi Ex pi[ℓ(f(x), y(x))]. Since xi Vi for each i, we can change variables and consider density functions λ i (ξ) = pi(xi + ξ) with support in a neighborhood around the origin of X. The risk then decomposes as i=1 αi Eξ λ i [ℓ(f(xi + ξ), y(xi + ξ))] . (2) Starting from this identity, we formalize an approximation to the risk: In a practical setting, the distribution p D is unknown and hence, in the decomposition (2), we have unknown densities λ i and unknown normalization factors αi. We assume that each neighborhood contributes equally to the loss, i.e., we approximate each αi with 1 |S|. Then, given a sample set S and an |S|-tuple Λ = (λi)1 i |S| of local probability density functions on X with support supp(λi) in a neighborhood around the origin 0X , we call the pair (S, Λ) ϵ-representative for D with respect to a model f and loss ℓif |ERep(f, S, Λ)| ϵ, where ERep(f, S, Λ) = E(f) 1 |S| Eξ λi [ℓ(f(xi + ξ), y(xi + ξ))] . (3) If the partitions Vi and the distributions λi are all chosen optimal so that the approximation αi = 1 |S| is exact and λi = λ i , then ERep(f, S, Λ) = 0 by (2). If the support of each λi is decreased to the origin so that λi = δ0 is a Dirac delta function, then ERep(f, S, Λ) = Egen(f, S) equals the generalization gap. For density functions with an intermediate support, the generalization gap can be decomposed into representativeness and the expected deviation of the loss around the sample points: Egen(f, S) = ERep(f, S, Λ) + 1 |S| Eξ λi[ℓ(f(xi + ξ), y(xi + ξ)) ℓ(f(xi), yi)] The main idea of our approach to understand generalization is to use this equality and to control both representativeness and expected loss deviations for a suitable |S|-tuple of distributions Λ. From input to feature space. An interesting aspect of ϵ-representativeness is that it can be considered in a feature space instead of the input space. For a model f = (ψ φ) : X Y, we can apply our notion to the feature space φ(X) (see Fig. 1 for an illustration). This leads to the notion of ϵrepresentativeness in feature space defined for an |S|-tuple Λφ = (λφ i )1 i |S| of densities on φ(X) by replacing xi with φ(xi) in (3), which we denote by Eφ Rep(f, S, Λφ). By measuring representativeness in a feature space, this becomes a notion of both data and feature representation. In particular, it assumes that a target output function y(φ(x)) also exists for the feature space. We can then decompose the generalization gap Egen(f) of f = (ψ φ) into Eφ Rep(f, S, Λφ) + i=1 Eξ λφ i [ℓ(ψ(φ(xi) + ξ), y(φ(xi) + ξ)) ℓ(f(xi), yi)] The second term is determined by how the loss changes under small perturbations in the feature space for the samples in S. As before, for λi = δ0 the term in the bracket vanishes and Eφ Rep(f, S, Λφ) = Egen. But the decomposition becomes more interesting for distributions with support of nonzero measure around the origin. If the true distribution can be interpolated efficiently in feature space from the samples in S with suitable λφ i so that Eφ Rep(f, S, Λφ) 0, then the term in the bracket approximately equals the generalization gap and the generalization gap can be estimated from local properties in feature space around sample points. 3 Feature Robustness Having decomposed the generalisation gap into a representativness and a second term of loss deviation, we now develop a novel notion of feature robustness that is able to bound the second term for specific families of distributions Λ using key equation (1). Our definition of feature robustness for a model f = (ψ φ) : X Y depends on a small number δ > 0, a sample set S and a feature selection defined by a matrix A Rm m of operator norm ||A|| 1. With feature perturbations φA(x) = (I + A)φ(x) and Eφ F(f, S, A) := 1 h ℓ(ψ(φA(xi)), y[φA(xi)]) ℓ(f(xi), yi) i , (4) the definition of feature robustness is given as follows. Definition 1. Let ℓ: Y Y R+ denote a loss function, ϵ and δ two positive (small) real numbers, S X Y a finite sample set, and A Rm m a matrix. A model f(x) = (ψ φ)(x) with φ(X) Rm is called ((δ, S, A), ϵ) ((δ, S, A), ϵ) ((δ, S, A), ϵ)-feature robust, if Eφ F(f, S, αA) ϵ for all 0 α δ. More generally, for a probability distribution A on perturbation matrices in Rm, we define Eφ F(f, S, A) = EA A h Eφ F(f, S, A) i , and call the model ((δ, S, A), ϵ) ((δ, S, A), ϵ) ((δ, S, A), ϵ)-feature robust on average over A, if Eφ F(f, S, αA) ϵ for 0 α δ. Given a feature extractor φ, feature robustness measures the performance of ψ when feature values are perturbed (with constant feature extractor φ). This local robustness at sample points differs from the robustness of Xu and Mannor [40] that requires a data-independent partitioning of the input space. The matrix A in feature robustness determines which feature values shall be perturbed. For each sample, the perturbation is linear in the expression of the feature. Thereby, we only perturb features that are relevant for the output for a given sample and leave feature values unchanged that are not expressed. For φ mapping into an intermediate layer of a neural network, traditionally, the activation values of a neuron are considered as feature values, which corresponds to a choice of A as a projection matrix. However, it was shown by Szegedy et al. [35] that, for any other direction v Rm, ||v|| = 1, the values φ(x), v obtained from the projection φ(x) onto v, can be likewise semantically interpreted as a feature. This motivates the consideration of general feature matrices A. Distributions on feature matrices induce distributions on the feature space Feature robustness is defined in terms of feature matrices (suitable for an application of (1) to connect perturbations of features with perturbations of weights), while the approach exploiting representative data from Section 2 considers distributions on feature vectors, cf. (3). To connect feature robustness to the notion of ϵ-representativeness, we specify for any distribution A on matrices A Rm m an |S|-tuple ΛA = (λi) of probability density functions λi on the feature space Rm with support containing the origin. Multiplication of a feature matrix with a feature vector φ(xi) defines a feature selection Aφ(xi), and for each z Rm there is some feature matrix A with φ(xi) + z = φ(xi) + Aφ(xi) (unless φ(xi) = 0). Our choice for distributions λi on Rm are therefore distributions that are induced via multiplication of feature vectors φ(xi) Rm with matrices A Rm m sampled from a distribution on feature matrices A . Formally, we assume that a Borel measure µA is defined by a probability distribution A on matrices Rm m. We then define Borel measures µi on Rm by µi(C) = µA({A | Aφ(xi) C}) for Borel sets C Rm. Then λi is the probability density function defined by the Borel measure µi. As a result, we have for each i that EA A h ℓ(ψ(φA(xi)), y(φA(xi))) i = Ez λi h ℓ(ψ(φ(xi) + z), y(φ(xi) + z)) i Feature robustness and generalization. With this construction and a distribution ΛA on the feature space induced by a distribution A on feature matrices, we have that E(f) = Eemp(f, S) + Eφ Rep(f, S, ΛA) + Eφ F(f, S, A) (5) Here, A can be any distribution on feature matrices, which can be chosen suitably to control how well the corresponding mixture of local distributions approximates the true distribution. The third term then measures how robust the model is in expectation over feature changes for A A. In particular, if Eφ Rep(f, S, ΛA) 0, then Egen(f, S) Eφ F(f, S, A) and the generalization gap is determined by feature robustness. We end this section by illustrating how distributions on feature matrices induce natural distributions on the feature space. The example will serve in Sec. 5 to deduce a bound on Eφ Rep(f, S, ΛA) from kernel density estimation. Example: Truncated isotropic normal distributions are induced by a suitable distribution on feature matrices. We consider probability distributions Kδ||φ(xi)|| on feature vectors z Rm in the feature space defined by densities kδ||φ(xi)||(0, z) with smooth rotation-invariant kernels, bounded support and bandwidth h: kh(zi, z) = 1 hm k ||zi z|| 1||zi z|| 0 and dividing layer k = l by α, but common measures of the loss Hessian change. Our measure fixes this issue in relating flatness to generalization since the change of the loss Hessian is compensated by multiplication with the scalar products of weight matrices and is therefore invariant under layer-wise reparameterizations [cf. 26]. It is also invariant to neuron-wise reparameterizations, i.e., multiplying all incoming weights into a neuron by a positive number α and dividing all outgoing weights by α [23], except for neuron-wise reparameterizations of the feature layer φl. Using a simple preprocessing step (a neuron-wise reparameterization with the variance over the sample), our proposed measure becomes independent of all neuron-wise reparameterizations. Theorem 4. Let σi denote the variance of the i-th coordinate of φl(x) over samples x S and V = diag σ1, . . . , σnl 1 . If the relative flatness measure κl T r is applied to the representation f(x) = w Lσ(. . . σ(wl V σ(V 1wl 1σ(. . . σ(w1x + b1)) . . .) + V 1bl 1) + bl) . . .) + b L then κl T r is invariant under all neuron-wise (and layer-wise) reparameterizations We now connect flatness with feature robustness: Relative flatness approximates feature robustness for a model at a local minimum of the empirical risk, when labels are approximately constant in neighborhoods of the training samples (φ(x), y) φ(S) in feature space. Theorem 5. Consider a model f(x, w) = g(wφ(x)) as above, a loss function ℓand a sample set S, and let Om Rm m denote the set of orthogonal matrices. Let δ be a positive (small) real number and w = ω Rd m denote parameters at a local minimum of the empirical risk on a sample set S. If the labels satisfy that y(φδA(xi)) = y(φ(xi)) = yi for all (xi, yi) S and all ||A|| 1, then f(x, ω) is ((δ, S, Om), ϵ)-feature robust on average over Om for ϵ = δ2 2mκφ T r(ω) + O(δ3). Applying the theorem to Eq. 5 implies that if the data is representative, i.e., Eφ Rep(f, S, ΛAδ) 0 for the distribution Aδ of Prop. 2, then Egen(f( , ω), S) δ2 2mκφ T r(ω) + O(δ3). The assumption on locally constant labels in Thm. 5 can be relaxed to approximately locally constant labels without unraveling the theoretical connection between flatness and feature robustness. Appendix B investigates consequences from even dropping the assumption of approximately locally constant labels. 5 Flatness and Generalization Combining the results from sections 2 4, we connect flatness to the generalization gap when the distribution can be represented by smooth probability densities on a feature space with approximately locally constant labels. By approximately locally constant labels we mean that, for small δ, the loss in δ||φ(xi)||-neighborhoods around the feature vector of a training sample xi is approximated (on average over all training samples) by the loss for constant label y(xi) on these neighborhoods. This and the following theorem connecting flatness and generalization are made precise in Appendix D.4. Theorem 6 (informal). Consider a model f(x, w) = g(wφ(x)) as above, a loss function ℓand a sample set S, let m denote the dimension of the feature space defined by φ and let δ be a positive (small) real number. Let ω denote a local minimizer of the empirical risk on a sample set S. If the distribution D has a smooth density pφ D on the feature space Rm with approximately locally constant labels around the points x S, then it holds with probability 1 over sample sets S that Egen(f( , ω), S) |S| 2 4+m 2m + C1(pφ D, L) + C2(pφ D, L) up to higher orders in |S| 1 for constants C1, C2 that depend only on the distribution in feature space pφ D induced by φ, the chosen |S|-tuple Λδ and the maximal loss L. To prove Theorem 6 we bound both ϵ-representativeness and feature robustness in Eq. 5. For that, the main idea is that the family of distributions considered in Proposition 2 has three key properties: (i) it Figure 3: The correlation between flatness and generalization increases with the degree of locally constant labels. Figure 4: Approximation of representativeness via KDE together with relative flatness leads to a tight generalization bound. provides an explicit link between the distributions on feature matrices Aδ used in feature robustness and the family of distributions Λδ of ϵ-representativeness (Proposition 2) (ii) it allows us to bound feature robustness using Thm. 5; and (iii) it is simple enough that it allows us to use standard results of kernel density estimation (KDE) to bound representativeness. Our bound suffers from the curse of dimensionality, but for the chosen feature space instead of the (usually much larger) input space. The dependence on the dimension is a result of using KDE uniformly over all distributions satisfying mild regularity assumptions. In practice, for a given distribution and sample set S, representativeness can be much smaller, which we showcase in a toy example in Sec. 6. In the so-called interpolation regime, where datasets with arbitrarily randomized labels can be fit by the model class, the obtained convergence rate is consistent with the no free lunch theorem and the convergence rate derived by Belkin et al. [2] for an interpolation technique using nearest neighbors. A combination of our approach with prior assumptions on the hypotheses or the algorithm in accordance to statistical learning theory could potentially achieve faster convergence rate. Our herein presented theory is instead based solely on interpolation and aims to understand the role of flatness (a local property) in generalization: If the data is representative in feature layers and if the distribution can be approximated by locally constant labels in these layers, then flatness of the empirical risk surface approximates the generalization gap. Conversely, Equation 7 shows that flatness measures the performance under perturbed features only when labels are kept constant. As a result, we offer an explanation for the often observed correlation between flatness and generalization: Real-world data distributions for classification are benign in the sense that small perturbations in feature layers do not change the target class, i.e., they can be approximated by locally constant labels. (Note that the definition of adversarial examples hinges on this assumption of locally constant labels.) In that case, feature robustness is approximated by flatness of the loss surface. If the given data and its feature representation are further ϵ-representative for small ϵ 0, then flatness becomes the main contributor to the generalization gap leading to their noisy, but steady, correlation. 6 Empirical Validation We empirically validate the assumptions and consequences of the theoretical results derived above 2. For that, we first show on a synthetic example that the empirical correlation between flatness and generalization decreases if labels are not locally constant, up to a point when they are not correlated anymore. We then show that the novel relative flatness measure correlates strongly with generalization, also in the presence of reparameterizations. Finally, we show in a synthetic experiment that while representativeness cannot be computed without knowing the true data distribution, it can in practice be approximated. This approximation although technically not a bound anymore tightly bounds the generalization gap. Synthetic data distributions for binary classification are generated by sampling 4 Gaussian distributions in feature space (two for each class) with a given distance between their means (class separation). We then sample a dataset in feature space Sφ, train a linear classifier 2 Code is available at https://github.com/kampmichael/relative Flatness Generalization. ψ on the sample, randomly draw the weights of a 4-layer MLP φ, and generate the input data as S = (φ 1(Sφ x), Sφ y ). This yields a dataset S and a model f = φ ψ such that φ(S) has a given class separation. Details on the experiments are provided in Appdx. C. Locally constant labels: To validate the necessity of locally constant labels, we measure the correlation between the proposed relative flatness measure and the generalization gap for varying degrees of locally constant labels, as measured by the class separation on the synthetic datasets. For each chosen class separation, we sample 100 random datasets of size 500 on which we measure relative flatness and the generalization gap. Fig. 3 shows the average correlation for different degrees of locally constant labels, showing that the higher the degree, the more correlated flatness is with generalization. If labels are not locally constant, flatness does not correlate with generalization. Figure 5: The generalization gap for various local minima correlates stronger with relative flatness than standard flatness, Fisher Rao norm, Pac Bayes based measure and weights norm (points corresp. to local minima). Approximating representativeness: While representativeness cannot be calculated without knowing the data distribution, it can be approximated from the training sample S by the error of a density estimation on that sample. For that, we use multiple random splits of S into a training set Strain and a test set Stest, train a kernel density estimation on Strain and measure its error on Stest. Again, details can be found in Appx. C. The lower the class separation of the synthetic datasets, the harder the learning problem and the less representative a random sample will be. For each sample and its distribution, we compute the generalization gap and the approximation to the generalization bound. The results in Fig. 4 show that the approximated generalization bound tightly bounds the generalization error (note that this approximation is technically not a bound anymore). Moreover, as expected, the bound decreases the easier the learning problems become. Relative flatness correlates with generalization: We validate the correlation of relative flatness to the generalization gap in practice by measuring it for 110 different local minima achieved via different learning setups, such as initialization, learning rate, batch size, and optimization algorithm of Le Net5 [19] on CIFAR10 [18]. We compare this correlation to the classical Hessian-based flatness measures using the trace of the loss-Hessian, the Fisher-Rao norm [21], the PACBayes flatness measure that performed best in the extensive study of Jiang et al. [14] and the L2-norm of the weights. The results in Fig. 5 show that indeed relative flatness has higher correlation than all the competing measures. Of these measures, only the Fisher-Rao norm is reparameterization invariant but shows the weakest correlation in the experiment. In Appdx C we show how reparameterizations of the network significantly reduce the correlation for non-reparameterization invariant measures. 7 Discussion and Conclusion Contributing to the trustworthiness of machine learning, this paper provides a rigorous connection between flatness and generalization. As to be expected for a local property, our association between flatness and generalization requires the samples and its representation in feature layers to be representative for the target distribution. But our derivation uncovers a second, usually overlooked condition. Flatness of the loss surface measures the performance of a model close to training points when labels are kept locally constant. If a data distribution violates this, then flatness cannot be a good indicator for generalization. Whenever we consider feature representations other than the input features, the derivation of our results makes one strong assumption: the existence of a target output function y(φ(x)) on the feature space φ(X). By moving assumptions on the distribution from the input space to the feature space, we achieve a bound based on interpolation that depends on the dimension of the feature layer instead of the input space. Hence, we assume that the feature representation is reasonable and does not lose information that is necessary for predicting the output. To achieve faster convergence rates independent of any involved dimensions, future work could aim to combine our approach of interpolation with a prior-based approach of statistical learning theory. Our measure of relative flatness may still be improved in future work. Better estimates for the generalization gap are possible by improving the representativeness of local distributions in two ways: The support shape of the local distributions can be improved and their volume-parameter δ can be optimally chosen. Both improvements will affect the derivation of the measure of relative flatness as an estimation of feature robustness for the corresponding distributions on feature matrices. Whereas different support shapes change the trace to a weighted average of the Hessian eigenvalues, the volume parameter can provide a correcting scaling factor. Both approaches seem promising to us, as our relative measure from Definition 3 already outperforms the competing measures of flatness in our empirical validation. Acknowledgements Cristian Sminchisescu was supported by the European Research Council Consolidator grant SEED, CNCS-UEFISCDI (PN-III-P4-ID-PCE-2016-0535, PN-III-P4-ID-PCCF-2016-0180), the EU Horizon 2020 grant DE-ENIGMA (688835), and SSF. Mario Boley was supported by the Australian Research Council (under DP210100045). We would like to thank Julia Rosenzweig, Dorina Weichert, Jilles Vreeken, Thomas Gärtner, Asja Fischer, Tatjana Turova and Alexandru Aleman for the great discussions. [1] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020. [2] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems, pages 2300 2311, 2018. [3] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019. [4] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann Le Cun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In Proceedings of the International Conference of Learning Representations, 2017. [5] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1019 1028. JMLR. org, 2017. [6] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. AAAI, 2017. [7] Gintare Karolina Dziugaite and Daniel M Roy. Data-dependent pac-bayes priors via differential privacy. In Advances in Neural Information Processing Systems, pages 8430 8441, 2018. [8] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In Proceedings of the International Conference on Learning Representations, 2021. [9] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249 256. PMLR, 2010. [10] Sepp Hochreiter and Jürgen Schmidhuber. Simplifying neural nets by discovering flat minima. In Advances in Neural Information Processing Systems, pages 529 536, 1995. [11] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1 42, 1997. [12] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence, 2018. [13] Stanisław Jastrz ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. ar Xiv preprint ar Xiv:1711.04623, 2017. [14] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In Proceedings of the International Conference on Learning Representations, 2020. [15] MC Jones, IJ Mc Kay, and T-C Hu. Variable location and scale kernel density estimation. Annals of the Institute of Statistical Mathematics, 46(3):521 535, 1994. [16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In Proceedings of the International Conference on Learning Representatiosn, 2017. [17] Steven G. Krantz and Harold R. Parks. Geometric integration theory. Springer Science and Business Media, 2008. [18] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [19] Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. Technical report, AT&T Labs, 2010. [20] Yann Le Cun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a backpropagation network. In Advances in Neural Information Processing Systems, pages 396 404, 1990. [21] Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. International Conference on Artificial Intelligence and Statistics (AISTATS), 2019. [22] Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems, pages 11611 11622, 2019. [23] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, volume 28, 2015. [24] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. Workshop contribution at the International Conference on Learning Representations, 2015. [25] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376 1401, 2015. [26] Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947 5956, 2017. [27] Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl Dickstein. Sensitivity and generalization in neural networks: an empirical study. In Proceedings of the International Conference on Learning Representations, 2018. [28] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035, 2019. [29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [30] Henning Petzka, Linara Adilova, Michael Kamp, and Cristian Sminchisescu. A reparameterization-invariant flatness measure for deep neural networks. In Workshop on Science meets Engineering of Deep Learning at Neur IPS, 2019. [31] Akshay Rangamani, Nam H. Nguyen, Abhishek Kumar, Dzung T. Phan, Sang H. Chin, and Trac D. Tran. A scale invariant flatness measure for deep network minima. ar Xiv preprint ar Xiv:1902.02434, 2019. [32] Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. Workshop contribution at the International Conference on Learning Representations, 2019. [33] Bernard W. Silverman. Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. Chapman and Hall, 1986. [34] Xu Sun, Zhiyuan Zhang, Xuancheng Ren, Ruixuan Luo, and Liangyou Li. Exploring the vulnerability of deep neural networks: A study of parameter corruption. ar Xiv preprint ar Xiv:2006.05620, 2020. [35] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. [36] Andrey Nikolayevich Tikhonov, A. Goncharsky, V. V. Stepanov, and Anatolij Grigorevic Yagola. Numerical Methods for the Solution of Ill-Posed Problems. Mathematics and Its Applications. Springer Netherlands, 1995. [37] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using PAC-Bayesian analysis. In Proceedings of the 37th International Conference on Machine Learning, pages 9636 9647, 2020. [38] Huan Wang, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. Identifying generalization properties in neural networks. ar Xiv preprint ar Xiv:1809.07402, 2018. [39] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. In Advances in Neural Information Processing Systems, volume 33, pages 2958 2969, 2020. [40] Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391 423, 2012. [41] Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems, volume 32, 2019. [42] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations, 2017. [43] Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, and Tomaso Poggio. Theory of deep learning iib: Optimization properties of sgd. ar Xiv preprint ar Xiv:1801.02254, 2018. [44] Yaowei Zheng, Richong Zhang, and Yongyi Mao. Regularizing neural networks via adversarial model perturbation. ar Xiv preprint ar Xiv:2010.04925, 2020.