# towards_robust_outofdistribution_generalization_bounds_via_sharpness__959800f8.pdf

Published as a conference paper at ICLR 2024

TOWARDS ROBUST OUT-OF-DISTRIBUTION GENERALIZATION BOUNDS VIA SHARPNESS

Yingtian Zou, Kenji Kawaguchi, Yingnan Liu, Jiashuo Liu , Mong-Li Lee, Wynne Hsu School of Computing, National University of Singapore Institute of Data Science, National University of Singapore Department of Computer Science & Technology, Tsinghua University, China

Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, the sharpness of learned minima influences OOD generalization. To bridge this gap between optimization and OOD generalization, we study the effect of sharpness on how a model tolerates data change in domain shift which is usually captured by "robustness" in generalization. In this paper, we give a rigorous connection between sharpness and robustness, which gives better OOD guarantees for robust algorithms. It also provides a theoretical backing for "flat minima leads to better OOD generalization". Overall, we propose a sharpness-based OOD generalization bound by taking robustness into consideration, resulting in a tighter bound than non-robust guarantees. Our findings are supported by the experiments on a ridge regression model, as well as the experiments on deep learning classification tasks.

1 INTRODUCTION

Machine learning systems are typically trained on a given distribution of data and achieve good performance on new, unseen data that follows the same distribution as the training data. Out-of Distribution (OOD) generalization requires machine learning systems trained in the source domain to generalize to unseen data or target domains with different distributions from the source domain. A myriad of algorithms (Sun & Saenko, 2016; Arjovsky et al., 2019; Sagawa et al., 2019; Koyama & Yamaguchi, 2020; Pezeshki et al., 2021; Ahuja et al., 2021) aim to learn the invariant components along the distribution shifting. Optimization-based methods such as (El Ghaoui & Lebret, 1997; Duchi & Namkoong, 2018; Liu et al., 2021; Rame et al., 2022) focus on maximizing robustness by optimizing for worst-case error over an uncertainty distribution set. While these methods are sophisticated, they do not always perform better than Empirical Risk Minimization (ERM) when evaluated across different datasets (Gulrajani & Lopez-Paz, 2021; Wiles et al., 2022). This raises the question of how to understand the OOD generalization of algorithms and which criteria should be used to select models that are provably better (Gulrajani & Lopez-Paz, 2021). These questions highlight the need for more theoretical research in the field of OOD generalization (Ye et al., 2021).

To characterize the generalization gap between the source domain and the target domain, a canonical method (Blitzer et al., 2007) from domain adaptation theory decouples this gap into an In-Distribution (ID) generalization and a hypothesis-specific Out-of-Distribution (OOD) distance. However, this distance is based on the notion of VC-dimension (Kifer et al., 2004), resulting in a loose bound due to the large size of the hypothesis class in the modern overparameterized neural networks. Subsequent works improve the bound based on Rademacher Complexity (Du et al., 2017), whereas Germain et al. (2016) improves the bound based on PAC-Bayes. Unlike the present paper, these works did not consider algorithmic robustness, which has natural interpretation and advantages for distribution shifts In this work, we consider algorithmic robustness to derive the OOD generalization bound. The key idea is to partition the input space into K non-overlapping subspaces such that the error difference in the model s performance between any pair of points in each subspace is bounded by some constant ϵ. Within each subspace, any distributional shift is considered subtle for the robust

Published as a conference paper at ICLR 2024

model thus leading to less impact on OOD generalization. Figure 1 illustrates this with the two distributions where the target domain has a distributional shift from the source domain. Compared to existing non-robust OOD generalization bounds Zhao et al. (2018), our new generalization error does not depend on hypothesis size, which is more reliable in the overparameterized regime. Our goal is to measure the generalizability of a model by considering how it is robust to this shift and achieves a tighter bound than existing works.

Dicretize into

different robust areas

Continuous Density Discrete Density

d' (S,T) = 0

Figure 1: An example of a target distribution (red) directly translated from a source distribution (blue). The 1D density reflects the marginal distribution. Unlike existing works (left), we divide the distributions into disjoint partitions as a small change in distribution for a robust model is negligible (right). The sharpness of the model will decide the tolerance of change thus affecting the partitions. If two sub-distributions S, T have small shifts such that they fall into the same partition (red grid), their distance measure d (S, T) by considering robustness will be zero.

Although robustness captures the tolerance to distributional shift, it is intractable to compute robustness constant ϵ due to the inaccessibility of target distribution. The robustness definition in Xu & Mannor (2012) indicates that the loss landscape induced by the model s parameters is closely tied to its robustness. To gain a deeper understanding of robustness, we further study the learned model from an optimization perspective. As shown in (Lyu et al., 2022; Petzka et al., 2021), when the loss landscape is "flat", there is a good generalization, which is also observed in OOD settings (Izmailov et al., 2018; Cha et al., 2021). However, the relationship between robustness and this geometric property of the loss landscape, termed Sharpness, remains an open question. In this paper, we establish a provable dependence between robustness and sharpness for Re LU random neural network classes. It allows us to replace robustness constant ϵ with the sharpness of a learned model which is only computed from the training dataset that addresses the problem mentioned above. Our result of the interplay between robustness and sharpness can be applied to both ID and OOD generalization bounds. We also show an example to generalize our result beyond our assumption and validate it empirically.

Our main contributions can be summarized as follows:

We proposed a new framework for Out-of-distribution/ Out-of-domain generalization bounds. In this framework, we use robustness to capture the tolerance of distribution shift which leads to tighter upper bounds generally.

We reveal the underlying connection between the robustness and sharpness of the loss landscape and use this connection to enrich our robust OOD bounds under one-hidden layer Re LU NNs. This is the first optimization-based bound in Out-of-Distribution/Domain generalization.

We studied two cases in ridge regression and classification which support and generalize our main theorem well. All the experimental results corroborate our findings well.

2 PRELIMINARY

Notations We use [n] to denote the integers set {i}n i=1. We use to denote the ℓ2-norm (Euclidean norm). In vector form, wi denotes the i-th instance while the wj is the j-th element of the vector w and we use |w| for the element-wise absolute value of vector w. We use n, d for the training set size and input dimension. O is the Big-O notation.

2.1 PROBLEM FORMULATION

Consider a source domain and a target domain of the OOD generalization problem where we use DS, DT to represent the source and target distribution respectively. Let each D be the probability

Published as a conference paper at ICLR 2024

measure of sample z from sample space Z = X Y with X Rd. In the source domain, we have a training set S = {zi}n i=1, i [n], zi DS while the target is to learn a model f F with S and parameters θ Θ where f : Θ X 7 R generalizes well. Given loss function ℓ: R R R+, which is for short, the expected risk over the source distribution DS will be

LS(fθ) Ez DS[ℓθ(z)] = Ez DS[ℓ(f(θ, x), y)], b LS(fθ) 1

zi S[ℓθ(zi)].

We use ℓθ(z) as the shorthand. The OOD generalization is to measure between target domain expected risk LT (fθ) and the source domain empirical risk b LS(fθ) which involves two parts: (1) In-domain generalization error gap between empirical risk and expected risk LS(fθ) in the source domain. (2) Out-of-Domain distance between source and target domains. A model-agnostic example in Zhao et al. (2018) gave the following uniform bound: Proposition 2.1 (Zhao et al. Zhao et al. (2018) Theorem 2 & 3.2). With hypothesis class F and pseudo dimension Pdim(F) = d , unlabeled empirical datasets from source and target distribution b DS and b DT with size n each, then with probability at least 1 δ, for all f F,

LT (f) b LS(f) + 1

2d F F b DT ; b DS + O p

where d F F( b DT ; b DS) := 2 sup Af A{f(x) f (x):f,f F}

P b DS[Af] P b DT [Af] and is the XOR

operator. Specific form of O( p

|F|/n) is defined in Appendix C.6.

2.2 ALGORITHMIC ROBUSTNESS

Definition 2.2 (Robustness, Xu & Mannor (2012)). A learning model fθ on training set S is (K, ϵ( ))- robust, for K N, if Z can be partitioned into K disjoint sets, denoted by {Ci}K i=1, such that s S we have s, z Ci, i [K] |ℓθ(s) ℓθ(z)| ϵ(S).

This definition captures the robustness of the model in terms of the input. Within each partitioned set Ci, the loss difference between any sample z belonging to Ci and training sample s Ci will be upper bounded by the robustness constant ϵ(S). The generalization result given by Xu & Mannor (2012) provides a framework to bound the empirical risk with algorithmic robustness which has been stated in Appendix C. Based on this framework, we are able to reframe the existing OOD generalization theory.

3 MAIN RESULTS

In this section, we propose a new Out-of-Distribution (OOD) generalization bound for robust algorithms that have not been extensively studied yet. We then compare our result to the existing domain shift bound in Proposition 2.1 and discuss its implications for OOD and domain generalization problems by considering algorithmic robustness. To further explain the introduced robustness, we connect it to the sharpness of the minimum (a widely concerned geometric property in optimization) by showing a rigorous dependence between robustness and sharpness. This interplay will give us a better understanding of the OOD generalization problem, and meanwhile, provide more information on the final generalization bound. Detailed assumptions are clarified in Appendix B.1.

3.1 ROBUST OOD GENERALIZATION BOUND

The main concern in OOD generalization is to measure the domain shift of a learned model. However, existing methods fail to consider the intrinsic property of the model, such as robustness. Definition 2.2 gives us a new robustness measurement to the model trained on dataset S where the training set S is a collection of i.i.d. data pair (x, y) sampled from source distribution DS with size n. The measurement provides an intuition that if a test sample from the target domain is similar to a specific group of training samples, their losses will be similar as well. In other words, the model s robustness is a reflection of its ability to generalize to unseen data. Our first theorem shows that by utilizing the "robustness" measurement, we can more effectively handle domain shifts by setting a tolerance

Published as a conference paper at ICLR 2024

range for distribution changes. This robustness measurement, therefore, provides a useful tool for addressing OOD generalization.

Theorem 3.1. Let ˆDT be the empirical distribution of size n drawn from DT . Assume that the loss ℓ is upper bounded by M. With probability at least 1 δ (over the choice of the samples S), for every fθ trained on S satisfying (K, ϵ(S))-robust, we have

LT (fθ) b LS(fθ) + Md(ϵ,K)(S, ˆDT ) + 2ϵ(S) + 3M

2K ln 2 + 2 ln(2/δ)

where the total variation distance d(ϵ,K) for discrete empirical distributions is defined by:

i [n], ni(S) := #(z S Ci), d(ϵ,K)(S, ˆDT ) :=

n ni( ˆDT )

and ni(S), ni( ˆDT ) are the number of samples from S and ˆDT that fall into the set Ci, respectively. Remark. The result can be decomposed into in-domain generalization and out-domain distance |LT (fθ) LS(fθ)| (please refer to Lemma C.1). Both of them depend on robustness ϵ(S).

See proof in Appendix C. The last three terms on the RHS of (1) are distribution distance, robustness constant, and error term, respectively. Unlike traditional distribution measures, we partition the sample space and the distributional shift separately in the K sub-groups instead of measuring it point-wisely. We argue that the d(ϵ,K)(S, ˆDT ) can be zero measure if all small changes happen within the same partition where a 2D illustrative case is shown in Figure 1. Under the circumstances, our distribution distance term will be significantly smaller than Proposition 2.1 as the target distribution is essentially a perturbation of the source distribution. As a robust OOD generalization measure, our bound characterizes how robust the learned model is to negligible distributional perturbations. To prevent a bound that expands excessively, we also propose an alternate solution tailored for non-robust algorithms (K ) as follows. Corollary 3.2. If K , the domain shift bound |LT (fθ) LS(fθ)| can be replaced to the distribution distance in Proposition 2.1 where

|LT (fθ) LS(fθ)| 1

2d F F(S; DT ) 1

2d F F(S; b DT ) + O( p

where the pseudo dimension Pdim(F) = d . The proof is in Appendix C.1. As dictated in Theorem 3.1, when K , the use infinite number of partitions on the data distribution leads to meaningless robustness. However, Corollary C.7 suggests that our bound can be replaced by d F F(DS; DT ) in the limit of infinite K. This avoids computing a vacuous bound for non-robust algorithms. In summary, Theorem 3.1 presents a novel approach for quantifying distributional shifts by incorporating the concept of robustness. Our framework is particularly beneficial when a robust algorithm is able to adapt to local shifts in the distribution. Additionally, our data-dependent result remains valid and useful in the overparameterized regime, since K does not depend on the model size.

3.2 SHARPNESS AND ROBUSTNESS

Clearly, robustness is inherently tied to the optimization properties of a model, particularly the curvature of the loss landscape. One direct approach to characterize this geometric curvature, referred to as "sharpness," involves analyzing the Hessian matrix (Foret et al., 2020; Cohen et al., 2021). Recent research (Petzka et al., 2021) has shown that the concept of relative flatness", the sharpness in this paper, has a strong correlation with model generalization. However, the impact of relative flatness on OOD generalization remains uncertain, even within the convex setting. To address this problem, we aim to investigate the interplay between robustness and sharpness. With the following definition of sharpness, we endeavor to establish an OOD generalization bound rooted in optimization principles. Definition 3.3 (Sharpness, Petzka et al. (2021)). For a twice differentiable loss function L(w) = P

s S ℓw(s), w Rm with a sample set S, the sharpness is defined by

κ(w, S, A) := w, w tr (HS,A(w)) (4)

where HS,A is the Hessian matrix of loss L(w) w.r.t. w with hypothesis set A and input set S.

Published as a conference paper at ICLR 2024

As per Definition 3.3, sharpness is characterized by the sum of all the eigenvalues of the Hessian matrix, scaled by the parameter norm. Each eigenvalue of the Hessian reflects the rate of change of the loss derivative in the corresponding eigenspace. Therefore the smaller value of κ indicates a flatter minimum. In Cha et al. (2021), they suggest that flatter minima will improve the OOD generalization, but fail to deliver an elaborate analysis of the Hessian matrix. In this section, we begin with the random Re LU Neural Networks parameterized by θ = ({ai}i [m], w) where w = [w1, ..., wm] is the trainable parameter. Let A = [a1, ..., am], the whole function class is defined as

f(w, A, x) 1

i=1 wiσ (x, ai) : wi R, ai Unif(Sd 1(

d)), i [m] (5)

where σ( ) is the Re LU activation function and a are random vectors uniformly distributed on n-dim hypersphere whose surface is a n 1 manifold. We then define any convex loss function ℓ(f(w, A, x), y) : R R R+. The corresponding empirical minimizer in the source domain will be: ˆ w = arg minw 1

n Pn i=1 ℓ(f(w, A, xi), yi). With ˆ w, we are interested in loss geometry over the sample domain (ℓˆ w,A(z) for short). Intuitively, a flatter minimum on the loss landscape is expected to be more robust to varying input. Suppose the sample space Z can be partitioned into K disjoint sets. For each set Ci, i [K], the loss difference is upper bounded by ϵ(S, A). Given z S, we have

ϵ(S, A) max i [K] sup z,z Ci |ℓˆ w,A(z) ℓˆ w,A(z )| . (6)

As an alternative form of robustness, the ϵ(S, A) in (6) captures the "maximum" loss difference between any two samples in each partition and depends on the convexity and smoothness of the loss function in the input domain. Given a training set S and any initialization w0, the robustness ϵ(S, A) of a learned model f ˆ w will be determined. It explicitly reflects the smoothness of the loss function in each (pre-)partitioned set. Nevertheless, its connection to the sharpness of the loss function in parameter space still remains unclear. In order to address this gap, we establish a connection between sharpness and robustness in Theorem 3.4. Notably, this interplay holds implications not only for OOD but also for in-distribution generalization.

Theorem 3.4. Assume for any A, the loss function ℓˆ w,A(z) w.r.t. sample z satisfies the LHessian Lipschitz continuity (refer to Definition B.2) within every set Ci, i [K]. Let zi(A) = arg maxz Ci S ℓˆ w,A(z). Define Mi to be the set of global minima in Ci, suppose z i (A) Mi such that for some ρi(L) > 0, zi(A) z i (A) ρi(L)

L almost surely, then let ρmax(L) = max{ρi(L), i [K]}, x 2 R(d) and n n, N+, w.p

p = min 2 π arccos R(d) 1

πR(d)e 1 4d 9

over {ai}m i=1 Unif(Sd 1(

d)) we have

ϵ(S, A) ρmax(L)2

κ( ˆ w, S, A) + 4ρmax(L)

Remark. Given the training set S, we can estimate factor ˆn that n ˆn by comparing the maximum Hessian norm w.r.t. zj to the sum of all the Hessian norms over {zi}i [n]. Note that the smoothness condition only applies to every partitioned set (locally) where it is much weaker than the global requirement for the loss function to be satisfied We also discuss the difference between our results and Petzka et al. (2021) in Appendix F. The chosen family of loss functions that applied to our theorem can be found in Appendix B.1.

Corollary 3.5. Let ˆwmin be the minimum value of | ˆ w|. Suppose x Unif(Sd 1(

d)) and | 2ℓ(f( ˆ w, A, x), y)/ f 2| is bounded by [ M1, M2]. If m = Poly(d), d > 2, ρmax(L) < ( ˆw2 min M1 σ(d, m))/(2d) taking expectation over all xj S, j [n] and all ai A Unif(Sd 1(

d)) i [m], we have

ES,A [ϵ(S, A)] ES,A 7ρmax(L)2

n κ( ˆ w, S, A) + M2 . (8)

where σ(d, m) = Ea Unif(Sd 1(

d))λmin(Pm i=1 aia i Gii) > 0 is the minimum eigenvalue and Gii is product constant of Gegenbauer polynomials (definition can be founded in Appendix B).

Published as a conference paper at ICLR 2024

See proof in Appendix D. From Theorem 3.4 we can see, the robustness constant ϵ(S, A) is (pointwise) upper bounded by the sharpness of the learned model, as measured by the quantity κ( ˆ w, S, A), and the parameter ρmax(L). It should be noted that the parameter ρmax(L) depends on the partition, and as the number of partitions increases, the region ρmax(L) of the input domain becomes smaller, thereby making sharpness the dominant term in reflecting the model s robustness within the small local area. In Corollary 3.5, we show a stronger connection when the partition satisfies some conditions. Overall, this bound states that the larger the sharpness of the model κ( ˆ w, S, A), the larger the upper bound on the robustness parameter ϵ(S, A). This result aligns with the intuition that a sharper model is more prone to overfitting the training domain and is less robust in the unseen domain. While the dependency is not exact, it still can be regarded as an alternative approach that avoids the explicit computation of the intractable robustness term. By substituting this upper bound for ϵ(S, A) into Theorem 3.1, we derive a sharpness-based OOD generalization bound. This implies that the OOD generalization error will have a high probability of being small if the learned model is flat enough. Unlike existing works, our generalization bound provides more information about how optimization property influences performance when generalizing to OOD data. It bridges the gap between robustness and sharpness which can also be generalized to non-OOD learning problems. Moreover, we provide a better theoretical grounding for an empirical observation that a flat minimum improves domain generalization (Cha et al., 2021) by pinpointing a clear dependence on sharpness.

3.3 CASE STUDY

To better demonstrate the relationship between sharpness and robustness, we provide two specific examples: (1) linear ridge regression; (2) two-layer diagonal neural networks for classification.

Example 3.6. In ridge regression models, ϵ(S, A) has a reverse relationship to the regularization parameter β. β , the more probably flatter minimum κ and less sensitivity ϵ of the learned model could be. Following the previous notation, we have c1 > 0 such that ϵ(S, A) c1κ(ˆθ, S) + od where od has a smaller order than κ(ˆθ, S) for large d (proof refer to Appendix E.1).

As suggested in Ali et al. (2019), let s consider a generic response model y|θ (Xθ , σ2I) where X Rn d, d > n. The least-square empirical minimizer of the ridge regression problem will be:

ˆθ = arg min θ 1 2n Xθ y 2 + β

2 θ 2 = (X X + nβIn) 1X y = M 1X y (9)

Let S be the training set. It s trivial to get the sharpness of a quadratic loss function where

κ(ˆθ, S) = ˆθ 2 tr(X X/n + βIn) = ˆθ 2 tr(M) (10)

It s obvious that both of the above two equations depend on the same matrix M = X X/n + βIn. For fixed training samples X, we have κ(ˆθ, S) = O(β 1) in the limit of β. Then it s clear that a higher penalty β leads to a flatter minimum. This intuition is rigorously proven in Appendix E.1. According to Theorem 3.1 and Theorem 3.4, a flatter minimum probably associates with lower robustness constant ϵ(S, A). Thus it enjoys a lower OOD generalization error gap. In ridge regression, this phenomenon can be reflected by the regularization coefficient β. Therefore, in general, the larger β is, the lower the sharpness κ(ˆθ, S) and variance are. As a consequence, larger β learns a more robust model resulting in a lower OOD generalization error gap. This idea is later verified in the distributional shift experiments, shown as Figure 2.

Example 3.7. We consider a classification problem using a 2-layer diagonal linear network with exp-loss. The robustness ϵ(S, A) has a similar relationship in Theorem 3.4. Given training set S, after iterations t > Tϵ, c2 > 0, ϵ(S, A) c2 supt Tϵ κ(θ(t), S).

In addition to the regression and linear models, we have obtained a similar relationship for 2-layer diagonal linear networks, which are commonly used in the kernel and rich regimes as well as in intermediate settings (Moroshko et al., 2020). Example 3.7 demonstrates that the relationship also holds true when the model is well-trained, even exp-loss does not satisfy the PŁ condition. By extending our theorems to these more complex frameworks, we go beyond our initial assumptions and offer insights into broader applications. Later experiments on non-linear NN also support our statements. However, we still need a unified theorem for general function classes with fewer assumptions.

Published as a conference paper at ICLR 2024

4 RELATED WORK

Despite various methods (Sun & Saenko, 2016; Sagawa et al., 2019; Shi et al., 2021; Shafieezadeh Abadeh et al., 2015; Li et al., 2018; Cha et al., 2021; Du et al., 2020; Zhang et al., 2022) that have been proposed to overcome the poor generalization brought by unknown distribution shifts, the underlying principles and theories still remain underexplored. As pointed out in Redko et al. (2020); Miller et al. (2021), different tasks that address distributional shifts, such as domain adaptation, OOD, and domain generalization, are collectively referred to as "transfer transductive learning" and share similar generalization theories. In general, the desired generalization bound will be split into In-Distribution/Domain (ID) generalization error and Out-of-Distribution/Domain (OOD) distance. Since Blitzer et al. (2007) establish a VC-dimension-based framework to estimate the domain shift gap by a divergence term, many following works make the effort to improve this term in the following decades, such as Discrepancy (Mansour et al., 2009), Wasserstein measurement (Courty et al., 2017; Shen et al., 2018), Integral Probability Metrics (IPM) (Zhang et al., 2019b; Ye et al., 2021) and β-divergence (Germain et al., 2016). Among them, new generalization tools like PAC-Bayes, Rademacher Complexity, and Stability are also applied. However, few of them discuss how the sharpness reacts to data distributional shifts.

Beyond this canonical framework, Ye et al. (2021) reformulate the OOD generalization problem and provide a generalization bound using the concepts of "variation" and "informativeness." The causal framework proposed in Peters et al. (2017); Rojas-Carulla et al. (2018) focuses on the impact of interventions on robust optimization over test distributions. However, none of these frameworks consider the optimization process of a model and how it affects OOD generalization. Inspired by previous investigation on the effect of sharpness on ID generalization (Lyu et al., 2022; Petzka et al., 2021), recent work in Cha et al. (2021) found that flatter minima can also improve OOD generalization. Nevertheless, they lack a sufficient theoretical foundation for the relationship between the "sharpness" of a model and OOD generalization, but end with a union bound of Blitzer et al. (2007) s result. In this paper, we aim to provide a rigorous examination of this relationship.

5 EXPERIMENTS

In light of space constraints, we present only a portion of our experimental results to support the validity of our theorems and findings. For comprehensive results, please refer to the Appendix G

5.1 RIDGE REGRESSION IN DISTRIBUTIONAL SHIFTING

Following Duchi & Namkoong (2021), we investigated the ridge regression on distributional shift. We randomly generate θ 0 Rd in spherical space, and data from the following generating process: X iid N(0, 1), y = Xθ 0. To simulate distributional shift, we randomly generate a perpendicular unit vector θ 0 to θ 0. Let θ 0 , θ 0 be the basis vectors, then shifted ground-truth will be computed from the basis by θ α = θ 0 cos(α) + θ 0 sin(α). For the source domain, we use θ 0 as our training distribution. We randomly sample 50 data points and train a linear classifier with a gradient descent of 3000 iterations. By minimizing the objective function in (9), we can get the empirical optimum ˆθ. Then we gradually shift the distribution by increasing α to get different target domains. Along distribution shifting, the test loss ℓ(ˆθ, yα) will increase. As shown in Figure 2, the test loss will culminate in around 3 rads due to the maximum distribution shifting. Comparing different levels of regularization, we found that the larger L2-penalty β brings lower OOD generalization error which is shown as darker purple lines. This plot bears out our intuition in the previous section. As stated in the aforementioned case, the sharpness of ridge regression should inversely depend on β. Correspondingly, we compute sharpness using the definition equation (4) by averaging ten different results. For each trial, we use the same training and test data for every β. The sharpness of each ridge regressor is shown in the legend of Figure 2. As we can see, larger β leads to less sharpness.

5.2 SHARPER MINIMUM HURTS OOD GENERALIZATION

In our results, we proved that the upper bound of OOD generalization error involves the sharpness of the trained model. Here we empirically verified our theoretical insight. We follow the experiment setting in Domain Bed (Gulrajani & Lopez-Paz, 2021). To easily compute the sharpness, we choose

Published as a conference paper at ICLR 2024

Figure 2: OOD test losses increase along distributional shifting. The X-axis is the shifting angle α and the Y-axis is the test loss of the model which is trained on distribution α = 0. Lines are average test losses and shadows are variances of 10 trials. Larger regularization β (darker color) causes a lower increase in test loss but smaller sharpness.

Figure 3: The relationship between out-ofdomain test accuracy and model sharpness on Rotated MNIST dataset. Here we show 4 different OOD environments: 15 , 30 , 45 , 60 rotation as the OOD test set respectively. Each marker denotes a minimum of an algorithm with a specific seed. The marker style means the models trained in the same environment.

the 4-layer MLP on Rotated MNIST dataset where Rotated MNIST is a rotation of MNIST handwritten digit dataset (Le Cun, 1998) with different angles ranging from [0 , 15 , 30 , 45 , 60 , 75 ]. In this codebase, each environment refers to selecting a domain (a specific rotation angle) as the test domain/OOD test dataset while training on all other domains. After getting the trained model of each environment, we compute the sharpness using all domain training sets based on the implementation of Petzka et al. (2021). To this end, we plot the performances of Empirical Minimization Risk (ERM), SWAD (Cha et al., 2021), Mixup (Yan et al., 2020) and Group DRO (Sagawa et al., 2019) with 6 seeds of each. Then we measure the sharpness of all these minima. Figure 3 shows the relationship between model sharpness and out-of-domain accuracy. The tendency is clear that flat minima give better OOD performances. In general, different environments can not be plotted together due to different training sets. However, we found the middle 4 environments are similar tasks and thus plot them together for a clearer trend. In addition, different algorithms lead to different feature scales which may affect the scale of the sharpness. To address this, we align their scales when putting them together. For more individual results, please refer to Figure 8 in the appendix.

5.3 COMPARISON OF GENERALIZATION BOUNDS

To analyze our generalization bounds, we follow the toy example experiments in Sagawa et al. (2020). In this experiment, the distribution shift terms and generalization error terms can be explicitly computed. Furthermore, their synthetic experiment considers the spurious correlation across distribution shifts which is now a general formulation of OOD generalization (Wald et al., 2021; Aubin et al., 2021; Yao et al., 2022). Consider data x = [xcore, xspu] that consist of two features: core feature and spurious feature. The features are generated from the following rule:

xcore | y N y1, σ2 core Id xspu | a N a1, σ2 spu Id

where y { 1, 1} is the label, and a { 1, 1} is the spurious attribute. Data with y = a forms the majority group of size nmaj, and data with y = a forms minority group of size nmin. Total number of training points n = nmaj + nmin. The spurious correlation probability, pmaj = nmaj

n defines the probability of y = a in training data. In testing, we always have pmaj = 0.5. The metric, worst-group error Sagawa et al. (2019) is defined as

Errwg(w) := max i [4] Ex,y|gi [ℓ0 1(w; (x, y))]

where ℓ0 1 is the 0 1 loss in binary classification. Here we compare the robustness of our proposed OOD generalization bound and the baseline in Proposition 2.1. We also give the comparison to other baselines, like PAC-Bayes DA bound in the Appendix G.

Published as a conference paper at ICLR 2024

Figure 4: Spurious feature synthetic experiment. Each dot represents a trained model. The dash curves are the smoothed function fit by the test data points. The baseline is Proposition 2.1. (a),(d): the generalization error of the logistic regression models with increasing the model size/correlation probability. (b): concentration error term in domain shift bound. (e): comparison of distribution distance bounds. (c),(f): comparisons of generalization bounds. Note that model size > 500 is the overparameterized regime. The further the correlation probability is from 0.5, the greater the distributional shift is.

Along model size We plot the generalization error of the random feature logistic regression along the model size increases in Figure 4(a). In this experiment, we follow the hyperparameter setup of Sagawa et al. (2020) by setting the number of points n = 500, data dimension 2d = 200 with 100 on each feature. majority fraction pmaj = 0.9 and noises σ2 spu = 1, σ2 core = 100. The worst-group error turns out to be nearly the same as the model size increases. However, in Figure 4(b), the error term in domain shift bound Proposition 2.1(A) will keep increasing when the model size is increasing. In contrast, our domain shift bound at order

K is independent of the model size which addresses the limitation of their bound. We follow Kawaguchi et al. (2022) to compute K in an inverse image of the ϵ-covering in a randomly projected space (see details in appendix). We set the same value K = 1, 000 in our experiment. Different from the baseline, K is data dependent and leads to a constant concentration error term along with model size increases. Analogously, our OOD generalization bound will not explode as model size increases (shown in Figure 4(c)).

Along distribution shift In addition, we are interested in characterizing OOD generalization when test distribution shifts from train distribution by varying the correlation probability pmaj during data generation. As shown in Figure 4(d), when pmaj = 0.5, there is no distributional shift between training and test data due to no spurious features correlated to training data. Thus, the training and test distributions align closer and closer when pmaj < 0.5 and increase, resulting in an initial decrease in the test error for the worst-case group. However, as pmaj > 0.5 and deviates from 0.5, introducing the spurious features, a shift in the distribution occurs. This deviation is likely to impact the worst-case group differently, leading to an increase in the test error. As displayed in Figure 4(e) and Figure 4(f), our distribution distance and generalization bound can capture the distribution shifts but are tighter than the baseline.

6 CONCLUSION

In this paper, we provide a more interpretable and informative theory to understand Out-of Distribution (OOD) generalization Based on the notion of robustness, we propose a robust OOD bound that effectively captures the algorithmic robustness in the presence of shifting data distributions. In addition, our in-depth analysis of the relationship between robustness and sharpness further illustrates that sharpness has a negative impact on generalization. Overall, our results advance the understanding of OOD generalization and the principles that govern it.

Published as a conference paper at ICLR 2024

7 ACKNOWLEDGEMENT

This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-GC-2019-001-2A), (AISG Award No: AISG2-TC2023-010-SGIL) and the Singapore Ministry of Education Academic Research Fund Tier 1 (Award No: T1 251RES2207). We would appreciate the contributions of Fusheng Liu, Ph.D. student at NUS.

Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-ofdistribution generalization. Advances in Neural Information Processing Systems, 34:3438 3450, 2021.

Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuous-time view of early stopping for least squares regression. In The 22nd international conference on artificial intelligence and statistics, pp. 1370 1378. PMLR, 2019.

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Benjamin Aubin, Agnieszka Słowik, Martin Arjovsky, Leon Bottou, and David Lopez-Paz. Linear unit-tests for invariance discovery. ar Xiv preprint ar Xiv:2102.10867, 2021.

Peter Bandi, Oscar Geessink, Quirine Manson, Marcory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Paeng, Aoxiao Zhong, et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE Transactions on Medical Imaging, 2018.

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79:151 175, 2010.

John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. Advances in neural information processing systems, 20, 2007.

Jean-Christophe Bourin, Eun-Young Lee, and Minghua Lin. Positive matrices partitioned into a small number of hermitian blocks. Linear Algebra and its Applications, 438(5):2591 2598, 2013.

Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405 22418, 2021.

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. ar Xiv preprint ar Xiv:2103.00065, 2021.

Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. Advances in Neural Information Processing Systems, 2017.

Simon S Du, Jayanth Koushik, Aarti Singh, and Barnabás Póczos. Hypothesis transfer learning via transformation functions. Advances in neural information processing systems, 30, 2017.

Yingjun Du, Xiantong Zhen, Ling Shao, and Cees GM Snoek. Metanorm: Learning to normalize few-shot batches across domains. In International Conference on Learning Representations, 2020.

John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018.

John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378 1406, 2021.

Laurent El Ghaoui and Hervé Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on matrix analysis and applications, 1997.

Published as a conference paper at ICLR 2024

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. ar Xiv preprint ar Xiv:2010.01412, 2020.

Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A pac-bayesian approach for domain adaptation with specialization to linear classifiers. In International conference on machine learning. PMLR, 2013.

Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A new pac-bayesian perspective on domain adaptation. In International conference on machine learning. PMLR, 2016.

Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In International Conference on Learning Representations, 2021.

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. ar Xiv preprint ar Xiv:1803.05407, 2018.

Kenji Kawaguchi, Zhun Deng, Kyle Luh, and Jiaoyang Huang. Robustness implies generalization via data-dependent generalization bounds. In International Conference on Machine Learning. PMLR, 2022.

Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In VLDB, volume 4, pp. 180 191. Toronto, Canada, 2004.

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp. 5637 5664. PMLR, 2021.

Masanori Koyama and Shoichiro Yamaguchi. Out-of-distribution generalization with maximal invariant predictor. 2020.

Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542 5550, 2017.

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, 2018.

Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. Heterogeneous risk minimization. In International Conference on Machine Learning, pp. 6804 6814. PMLR, 2021.

Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction. In Advances in Neural Information Processing Systems, 2022.

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. ar Xiv preprint ar Xiv:0902.3430, 2009.

Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75 (4):667 766, 2022.

John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning. PMLR, 2021.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

Published as a conference paper at ICLR 2024

Edward Moroshko, Blake E Woodworth, Suriya Gunasekar, Jason D Lee, Nati Srebro, and Daniel Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in neural information processing systems, 33:22182 22193, 2020.

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.

Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative flatness and generalization. Advances in Neural Information Processing Systems, 34:18420 18432, 2021.

Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34:1256 1272, 2021.

Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for outof-distribution generalization. In International Conference on Machine Learning, pp. 18347 18377. PMLR, 2022.

Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees. ar Xiv preprint ar Xiv:2004.11829, 2020.

Herbert Robbins. A remark on stirling s formula. The American mathematical monthly, 62(1):26 29, 1955.

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 2018.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning. PMLR, 2020.

Soroosh Shafieezadeh Abadeh, Peyman M Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. Advances in Neural Information Processing Systems, 2015.

Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.

Yuge Shi, Jeffrey Seely, Philip HS Torr, N Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. ar Xiv preprint ar Xiv:2104.09937, 2021.

Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pp. 443 450. Springer, 2016.

Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. Advances in neural information processing systems, 2021.

Olivia Wiles, Sven Gowal, Florian Stimberg, Sylvestre-Alvise Rebuffi, Ira Ktena, Krishnamurthy Dj Dvijotham, and Ali Taylan Cemgil. A fine-grained analysis on distribution shift. In International Conference on Learning Representations, 2022.

Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391 423, 2012.

Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training. ar Xiv preprint ar Xiv:2001.00677, 2020.

Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. ar Xiv preprint ar Xiv:2201.00299, 2022.

Published as a conference paper at ICLR 2024

Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization. Advances in Neural Information Processing Systems, 2021.

Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pp. 1524 1534. PMLR, 2019a.

Xingxuan Zhang, Linjun Zhou, Renzhe Xu, Peng Cui, Zheyan Shen, and Haoxin Liu. Towards unsupervised domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4910 4920, 2022.

Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning. PMLR, 2019b.

Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. Advances in neural information processing systems, 31, 2018.

Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pp. 4140 4149. PMLR, 2017.

Published as a conference paper at ICLR 2024

A ADDITIONAL EXPERIMENTS

A.1 SHARPNESS V.S. OOD GENERALIZATION ON PACS AND WILDS-CAMELYON17

Figure 5: The relationship between out-ofdistribution (OOD) test accuracy on the test environment and model sharpness (of last FC layer) on the Wilds-Camelyon17 dataset. Each marker denotes a model trained using ERM with different seed and hyperparameters.

Figure 6: The relationship between OOD test accuracy and model sharpness on the PACS dataset. Each marker denotes a model trained using ERM with different seeds and hyperparameters. The marker style shows the out-ofdistribution test environment of the model.

To evaluate our theorem more deeply, we examine the relationship between our defined sharpness and OOD generalization error on larger-scale real-world datasets, Wilds-Camelyon17 Bandi et al. (2018); Koh et al. (2021) and PACS Li et al. (2017). Wilds-Camelyon17 dataset includes 455,954 tumor and normal tissue slide images from five hospitals (environments). One of the hospitals is assigned as the test environment by the dataset publisher. Distribution shift arises from variations in patient population, slide staining, and image acquisition. PACS dataset contains 9,991 images of 7 objects in 4 visual styles (environments): art painting, cartoon, photo, and sketch. Following the common setup in Gulrajani & Lopez-Paz (2021), each environment is used as a test environment in turn. We follow the practice in Petzka et al. (2021) to compute the sharpness using the Hessian matrix from the last Fully-Connected (FC) layer of each model. For the Wilds-Camelyon17 dataset, we test the sharpness of 18 ERM models trained with different random seeds and hyperparameters. Figure 5 shows the result. For the PACS dataset, we run 60 ERM models with different random seeds and hyperparameters for each test environment. To get a clearer correlation, we align the points from 4 environments by their mean performance. Figure 6 shows the result. From the two figures, we can observe a clear correlation between sharpness and out-of-distribution (OOD) accuracy. Sharpness tends to hurt the OOD performance of the model. The result is consistent with what we report in Figure 3. It shows that the correlation between sharpness and OOD accuracy can also be observed on large-scale datasets.

B NOTATIONS AND DEFINITIONS

Notations We use [n] denote the integers set {i}n i=1. represents ℓ2-norm 2 for short. Without loss of generality, we use ℓ(f(θ, x), y) for the loss function of model fθ on data pair z = (x, y), which is denoted as ℓθ(z)) and we use n, d for training set size and input dimension. Note that we generally follow the notations in the original papers.

LS, LT : expected risk of the source domain and target domain, respectively. The corresponding empirical version will be b LS, b LT

{Ci}K i=1: K partitions on sample space and Ci denotes each partitioned set.

DS, DT : distributions of source and target domain. Their sampled dataset will be denoted as ˆDS, ˆDT accordingly.

Published as a conference paper at ICLR 2024

θ: In our setting, θ = (w, {a}m i ) denotes the model parameters where w is the trainable parameters and {ai}m are the random features (we also use A = [a1, ..., am] for short notation in many places). ˆ w is the minimizer of empirical loss.

M: upper bound of the loss function.

S = {(xi, yi)}n i /X = [x1, ...xn]: training data of size n.

Definition B.1 (Robustness, Xu & Mannor (2012)). A learning algorithm A on training set S is (K, ϵ( ))-robust, for K N, if Z can be partitioned into K disjoint sets, denoted by {Ck}K k=1, such that for all s S, z Z we have

s, z Ci, k [K], |ℓ(AS, s) ℓ(AS, z)| ϵ(S, A).

Definition B.2 (Hessian Lipschitz continuous). For a twice differentiable function f : Rn R, it has L-Lipschitz continuous Hessian for domain x, y are vectors in Ci if 2f(y) 2f(x) Li y x

where Li > 0 depends on input domain Ci and is L2 norm. Then for all K domains K i=1Ci, let L := max{Li|i [K]} be the uniform Lipschitz constant, so we have 2f(y) 2f(x) Li y x L y x , i [K], (x, y) Ci

which is uniformly bounded with L.

Lemma B.3 (Hessian Lipschitz Lemma). If f is twice differentiable and has L-Lipschitz continuous Hessian, then f(y) f(x) f(x), y x 1

2 2f(x)(y x), (y x) L

Gegenbauer Polynomials

We briefly define Gegenbauer polynomials here whose details can be found in Appendix of Mei & Montanari (2022). First, we denote Sd 1(r) = {x Rd : x = r} as the uniform spherical distribution with the radius r on d 1 manifold. Let τd be the probability measure on Sd 1 and and the inner product in functional space L2([ d, d], µd) denoted as , L2 and L2 :

d) f(x)g(x)µd( dx).

For any function σ L2([ d, d], τd), where τd is the distribution of x, y /

d (x, y Sd 1(

d)), the orthogonal basis {Qd t } forms the Gegenbauer polynomial of degree t(t 0), its spherical harmonics coefficients λd,t(σ) can be expressed as:

λd,t(σ) = Z

d] σ(x)Q(d) t (

then the Gegenbauer generating function holds in L2 [

d], τd sense

k=0 λd,t(σ)Nd,t Q(d) t (

where Nd,t is the normalized factor depending on the norm of input.

B.1 ASSUMPTIONS

We discuss and list all assumptions we used in our theorems. The purposes are to offer clarity regarding the specific assumptions required for each theorem and ensure that the assumptions made in our theorems are well-founded and reasonable, reinforcing the validity and reliability of our results.

Published as a conference paper at ICLR 2024

OOD Generalization

(Setting): Given a full sample space Z, source and target distributions are two different measures over this whole sample domain Z. The purpose is to study the robust algorithms in the OOD generalization setting.

(Assumptions): For any sample z Z, the loss function is bounded ℓθ(z) [0, M]. This assumption generally follows the original paper Xu & Mannor (2012). While it is possible to relax this assumption and derive improved bounds, our primary objective is to formulate a framework for robust OOD generalization and establish a clear connection with the optimization properties of the model.

Robustness and Sharpness

(Setting): In order to give a fine-grained analysis, we follow the common choice where a two-layer Re LU Neural Network function class is widely analyzed in most literature, i.e. Neural Tangent Kernel, non-kernel (rich) regime Moroshko et al. (2020) and random feature models. Among them, we select the following random feature models as our function class:

f(w, A, x) 1

i=1 wiσ (x, ai) : wi R, ai Unif(Sd 1(

where m is the hidden size and A = [a1, ..., am] contains random vectors uniformly distributed on n-dim hypersphere whose surface is a n 1 manifold. σ(a x) = (a x)I{a x} denotes the Re LU activation function and I is the indicator function. w = [w1, ..., wm] is the trainable parameter. We choose the common loss functions: (1) Homogeneity in regression; (2) (Binary) Cross-Entropy Loss; (3) Negative Log Likelihood (NLL) loss;

(Assumptions):

(i) Let Ci, i [K] be any set from whole partitions K i=1Ci, we assume z Ci, the loss function ℓˆ w,A(z) satisfies L-Hessian Lipschitz for all i [K] (See details in Definition B.2). Note that we only require this assumption to hold within each partition, instead of holding globally. In general, the smoothness and convexity condition is actually equivalent to locally convex which is a weak assumption for most function classes.

(ii) Consider an optimization problem in each partition Ci, i [K]. Let one of the training points zi(A) S Ci be the initial point and z i (A) Mi is the corresponding nearest local minima where Mi is the local minima set of partition Ci. For some ρi(L) > 0, we assume zi(A) z i (A) ρi(L)/L holds a.s.. It ensures the hessian norm H(zi(A)) has the lower bound. Similar conditions and estimations can be found in Zhang et al. (2019a).

(iii) To simplify the computation of probability, we assume ξi = a i x obeys a rotationally invariant distribution.

(iv) For Corollary 3.5, we make additional assumptions that loss function ℓ() satisfied a bounded condition where the second derivative x Unif(Sd 1(

d)), | 2ℓ(f( ˆ w, A, x), y)/ f 2| with respect to its argument f( ˆ w, A, x) should be bounded by [ M1, M2]. Note, we consider the case where the data x Unif(Sd 1(

d)) while m = Poly(d) is to ensure the positive definiteness of Pm i=1 aia i Rd d almost surely.

C PROOF TO DOMAIN SHIFT

Lemma C.1. Let ˆDT be the empirical distribution of size n drawn from DT . The loss ℓis upper bounded by M. With probability at least 1 δ (over the choice of the samples), for every fθ trained on S, we have

LT (fθ) LS(fθ) + Md(ϵ,K)(S, ˆDT ) + ϵ(S) + 2M

2K ln 2 + 2 ln(1/δ)

i [n], ni(S) := #(z S Ci), d(ϵ,K)(S, ˆDT ) :=

n ni( ˆDT )

Published as a conference paper at ICLR 2024

and ni(S), ni( ˆDT ) are the number of samples from S and ˆDT that fall into the set Ci, respectively.

Proof. In the following generalization statement, we use ℓ(fθ, z) to denote the error obtained with input z and hypothesis function fθ for better illustration. By definition we have,

LT (fθ) LS(fθ) := Ez DT ℓ(fθ, z ) Ez DSℓ(fθ, z). (13)

Then we make the K partitions for source distribution DS. Let ni be the size of collection set of points x fall into the partition Ci where ni is the i.i.d.multinomial random variable with (ps(C1), ..., ps(CK)). We use parallel notation for target distribution DT with S i (pt(C1), ..., pt(CK)). Since

Ez DSℓ(fθ, z) =

i=1 Ez DS(ℓ(fθ, z )|z Ci)ps(Ci)

Ez DT ℓ(fθ, z ) =

i=1 Ez DT (ℓ(fθ, z )|z Ci)pt(Ci)

and thus we have

LT (fθ) LS(fθ) =

i=1 E(ℓ(fθ, z )|z Ci)pt(Ci) E(ℓ(fθ, z)|z Ci)ps(Ci)

E(ℓ(fθ, z )|z Ci)ps(Ci)

i=1 (E(ℓ(fθ, z )|z Ci)) (pt(Ci) ps(Ci))

i=1 [E(ℓ(fθ, z )|z Ci) E(ℓ(fθ, z)|z Ci)] ps(Ci)

i=1 (E(ℓ(fθ, z )|z Ci)) (pt(Ci) ps(Ci)) + ϵ(S, A).

If we sample empirical distribution S, ˆDT of size n each drawn from DS and DT , respectively. (n1, ..., n K) are the i.i.d. random variables belongs to Ci. We use the parallel notation n i for target distribution.

d(ϵ,K)(S, ˆDT ) :=

n ni( ˆDT )

Further, we have

i=1 (E(ℓ(fθ, z )|z Ci)) (pt(Ci) ps(Ci))

i=1 (E(ℓ(fθ, z )|z Ci))

i=1 (E(ℓ(fθ, z )|z Ci))

pt(Ci) ni( ˆDT )

i=1 (E(ℓ(fθ, z )|z Ci)) ps(Ci) ni(S)

pt(Ci) ni( ˆDT )

ps(Ci) ni(S)

With Breteganolle-Huber-Carol inequality we have

2K ln 2 + 2 ln(1/δ)

Published as a conference paper at ICLR 2024

To integrate these two inequalities, we have

i=1 (E(ℓ(fθ, z )|z Ci))(pt(Ci) ps(Ci))

i=1 (E(ℓ(fθ, z )|z Ci))

2K ln 2 + 2 ln(1/δ)

Md(ϵ,K)(S, ˆDT ) + 2M

2K ln 2 + 2 ln(1/δ)

In summary with probability 1 δ we have

LT (fθ) LS(fθ) + Md(ϵ,K)(S, ˆDT ) + ϵ(S) + 2M

2K ln 2 + 2 ln(1/δ)

which completes the proof.

With the result of the domain (distribution) shift and the relationship between sharpness and robustness, we can move forward to the final OOD generalization error bound. First, we state the context of ID robustness bound in Xu & Mannor (2012) as follows.

Lemma C.2 (Xu et al.Xu & Mannor (2012)). Assume that for all h H and z Z, the loss is upper bounded by M i.e., ℓ(h, z) M. If the learning algorithm A is (K, ϵ( ))-robust, then for any δ > 0, with probability at least 1 δ over an iid draw of n samples S = (zi)n i=1, it holds that:

Ez [ℓ(AS, z)] 1

i=1 ℓ(AS, zi) + ϵ(S) + M

2K ln 2 + 2 ln(1/δ)

As the conclusive results, we briefly prove the following result by summarizing Lemma C.2 and Lemma C.1.

Theorem C.3 (Restatement of Theorem 3.1). Let ˆDT be the empirical distribution of size n drawn from DT . The loss ℓis upper bounded by M. With probability at least 1 δ (over the choice of the samples), for every fθ trained on S, we have

LT (θ) b LS(θ) + Md(ϵ,K)(S, ˆDT ) + 2ϵ(S) + 3M

2K ln 2 + 2 ln(2/δ)

Proof. Firstly, with Lemma C.1 and probability as least 1 δ

LT (fθ) LS(fθ) + Md(ϵ,K)( ˆDS, ˆDT ) + 2M

2K ln 2 + 2 ln(2/δ)

Secondly, with Lemma C.2 (Xu & Mannor (2012) Theorem 3) and probability as least 1 δ

LS(fθ) ˆLS(fθ) ϵ(S) + M

2K ln 2 + 2 ln(2/δ)

By taking the union bound, we conclude our final result that with probability at least 1 δ

LT (fθ) ˆLS(fθ) + 3M

2K ln 2 + 2 ln(2/δ)

n + 2ϵ(S) + Md(ϵ,K)(S, ˆDT ) (22)

Here ϵ(S) is the robustness constant that we can replace with any sharpness measure.

Published as a conference paper at ICLR 2024

C.1 PROOF TO COROLLARY 3.2

Definition C.4. d F F (DT ; DS) := 2 sup A(f) AF F | Pr DS(A(f)) Pr DT (A(f))| and F F is defined as: F F := {f(x) f (x) : f, f F} where is the XOR operator e.g. I(f (x) = f(x)). Lemma C.5 (Lemma 2 Zhao et al. (2018)). If Pdim(F) = d , then VCdim(F F) 2d .

Proposition C.6 (Zhao et al. (2018)). Let F be a hypothesis class with pseudo dimension Pdim(F) = d . If b DS is the empirical distributions generated with n i.i.d.. samples from source domain, and b DT is the empirical distribution on the target domain generated from n samples without labels, then with probability at least 1 δ, for all f F, we have:

LT (f) b LS(f) + E +

2d F F b DT ; b DS + 4

2d ln(2n) + ln 4

δ n | {z } (Empirical div Error)

where E = b LS(f ) + b LT (f ) is the total error of best hypothesis f over source and target domain.

Proof. With Lemma 4 (Zhao et al., 2018), we have

LT (f) b LS(f) + 1

where E = inf f F LS(f ) + LT (f ).

Lemma 6 (Zhao et al., 2018), which is actually Lemma 1 in (Ben-David et al., 2010), shows the following results

d F F (DT ; DS) d F F b DT ; b DS + 4

VCdim(F F) ln(2n) + ln 2

As suggested in Zhao et al. (2018), VCdim(F F) is at most 2d . Further, with Theorem 2 (Ben David et al., 2010), we have at probability at least 1 δ

LT (f) LS(f) + 1

2d F F b DT ; b DS + 4

VCdim(F F) ln(2n) + ln 2

2d F F b DT ; b DS + 4

2d ln(2n) + ln 2

Using in-domain generalization error Lemma 11.6 (Mohri et al., 2018), with probability at least 1 δ

2 the result is

LS(f) b LS(f) + M

δ 2n Note in Zhao et al. (2018), the M = 1 for the normalized regression loss. Combine them all, we conclude the proof.

Corollary C.7. If K , M = 1, domain shift bound |LT (fθ) LS(fθ)| will be reduced to (Empirical div Error) in Proposition C.6 where

|LT (fθ) LS(fθ)| 1

2d F F(DS; DT ) (Empirical div Error) (25)

Published as a conference paper at ICLR 2024

Proof. According to Theorem C.3, we have

LT (fθ) LS(fθ) =

i E(ℓ(fθ, z )|z Ci)pt(Ci) E(ℓ(fθ, z)|z Ci)ps(Ci) (26)

If K , let s define a domain that U := S i=1 Ci. The equation (26) will be

LT (fθ) LS(fθ) = Z

z U ℓ(fθ, z )pt(z )dz Z

z U ℓ(fθ, z)ps(z)dz

z DT ℓ(fθ, z )pt(z )dz Z

z DS ℓ(fθ, z)ps(z)dz

= Ez DT ℓ(fθ, z ) Ez DSℓ(fθ, z).

In this case, we have,

|LT (fθ) LS(fθ)|

= |Ez DT ℓ(fθ, z ) Ez DSℓ(fθ, z)|

0 |Pr DT (ℓ(f(θ, x ), y ) > t) dt Pr DS (ℓ(f(θ, x), y) > t) dt|

0 |Pr DT (ℓ(f(θ, x ), y ) > t) Pr DS (ℓ(f(θ, x), y) > t)| dt (M = 1)

sup t [0,1] sup f(θ, ) F |Pr DT (ℓ(f(θ, x ), y ) > t) Pr DS (ℓ(f(θ, x), y) > t)|

sup A(f) AF F |Pr DT (A(f)) Pr DS(A(f))|

2d F F(DS; DT ) (Empirical div error)

where AF F represents a learning algorithm under the hypothesis F F = {f(x) f (x) : f, f F}, which completes the proof.

D SHARPNESS AND ROBUSTNESS

Lemma D.1 (positive definiteness of Hessian). Let ˆwmin be the minimum value of | ˆ w| and x = arg minx Unif(Sd 1(

d)) ℓ(f( ˆ w, A, x), y). For any A = (a1, ..., am), ai Unif(Sd 1(

d)) i [m] denote σ(d, m) = λmin(Pm i=1 aia i Gii) > 0 be the minimum eigenvalue, where Gij = P t=0 λ2 d,t(σ)N 2 d,t Q(d) t ( ai, aj /

d) is the polynomial product constant. If m = Poly(d), the hessian H(x ) can be lower bound by

Ex Unif(Sd 1(

d))H(x ) ˆw2 min σ(d, m) M1 d Id. (29)

Proof. As suggested in Lemma D.6 of (Zhong et al., 2017), we have a similar result to bound the local positive definiteness of Hessian. By previous definition, the Hessian w.r.t. x has a following partial order

H(x ) = D2 f(x , y )

j=1 ˆwi ˆwjaia j σ (a i x )σ (a j x )

j=1 ˆwi ˆwjaia j σ (a i x )σ (a j x )

ˆw2 min M1 d

j=1 ˆwi ˆwjaia j σ (a i x )σ (a j x )

Published as a conference paper at ICLR 2024

For the Re LU activation function, we further have

σ(a x ) σ (a x ) (31)

We extend the σ L2([

d], τd) (where τd is the distribution of x1, x2 /

d) by Gegenbauer polynomials that

t=0 λd,t(σ)Nd,t Q(d) t (

Let A = (a1, ..., am) Rm d. We assume x Unif(Sd 1(

d)). Lemma C.7 in (Mei & Montanari, 2022), suggests that

U = Ex Unif(Sd 1(

d))[σ( ai, x /

d)σ( ai, x /

i,j [m] Rm m (33)

which shows matrix U is a positive definite matrix. Similarly, taking the expectation over x , terms in RHS of (30) bracket can be rewritten as

Ex Unif(Sd 1(

j=1 aia j σ (a i x )σ (a j x )

j=1 aia j Ex [σ(a i x /

d)σ(a j x /

Besides, we have the following property of Gegenbauer polynomials,

1. For x, y Sd 1(

d) D Q(d) j ( x, ), Q(d) k ( y, ) E

d),γd) = 1 Nd,k δjk Q(d) k ( x, y ).

2. For x, y Sd 1(

Q(d) k ( x, y ) = 1 Nd,k

i=1 Y (d) ki (x)Y (d) ki (y).

where spherical harmonics {Y (d) lj }1 j Nd,l forms an orthonormal basis which gives the following results

Ex [σ(a i x /

d)σ(a j x /

t=0 λ2 d,t(σ)N 2 d,t Ex Q(d) t ( ai, x /

d)Q(d) t ( aj, x /

t=0 λ2 d,t(σ)N 2 d,t Q(d) t ( ai, aj /

d) = Gij < .

(35) Hence, we have

j=1 aia j Ex [σ(a i x /

d)σ(a j x /

i=1 aia i Gii + O(1/d)Var(a) (36)

Since m = Poly(d), and {a}i [m] are i.i.d, then rank Pm i=1 aia i Gii = rank AA = d. Let σ(d, m) = Eaλmin(Pm i=1 aia i Gii) > 0 we have

Ex H(x ) ˆw2 min σ(d, m) M1 d Id (37)

Published as a conference paper at ICLR 2024

Lemma D.2. Let SK k=1 Ck be the whole domain, the notion of (ϵ, K)-robustness is described by

ϵ(S, A) max Ci SK k=1 Ck sup z,z i Ci,z S |ℓˆ w,A(z) ℓˆ w,A(z i)| .

Define Mi be the set of global minima in Ci, where

Mi {z(A)|z(A) = min z Ci ℓˆ w,A(z)}

suppose for some maximum training loss point

zi(A) max z Ci S ℓˆ w,A(z) ℓˆ w,A(z i (A))

there z i (A) where z i (A) arg min z Mi z zi(A)

such that zi(A) z (A) ρi(L)

L almost surely hold for any A Unif(Sd 1(

d)) and for any A, ℓˆ w,A(z) is L-Hessian Lipschitz continuous. Then the ϵ(S, A) can be bounded by

ϵ(S, A) max i [K] ρi(L)2

2ℓˆ w,A(zi(A)) + 4ρi(L)

Proof. Let z S be a collection of (x, y) from the training set S and z i denote any collection from the set Ci. We define local minima set Mi (which is the global minima set of Ci). Assume that for some maximum point zi(A) maxz Ci S ℓˆ w,A(z), there exists a z i (A) Mi almost surely for all A Unif(Sd 1(

d)) such that

z i (A) = arg min z Mi f := {z Mi : zi(A) z } s.t. zi(A) z ρi(L)

By definition, ϵ(S, A) can be rewritten as

ϵ(S, A) = max i [K] sup z,z i Ci,z S |ℓˆ w,A(z) ℓˆ w,A(z i)|

= max i [K] sup z Ci S,z Mi ℓˆ w,A(z) ℓˆ w,A(z )

= max i [K] ℓˆ w,A(zi(A)) ℓˆ w,A(z i (A)).

According to Lemma B.3, we have

ϵ(S, A) = max i [K] ℓˆ w,A(zi(A)) ℓˆ w,A(z i (A))

(i) max i [K] ℓˆ w,A(z i (A)), zi(A) z i (A)

2 2ℓˆ w,A(z i (A))(zi(A) z i (A)), zi(A) z i (A) + L

6 zi(A) z 3

= max i [K] 1 2 2ℓˆ w,A(z i (A))(zi(A) z i (A)), zi(A) z i (A) + L

6 zi(A) z i (A) 3

max i [K] 1 2

2ℓˆ w,A(z i (A)) zi(A) z i (A) 2

6 zi(A) z i (A) 3 (Cauchy-Schwarz) (40) where (i) support by the fact ℓˆ w,A(z ) = 0. With Lipschitz continuous Hessian we have

2ℓˆ w,A(z i (A)) L zi(A) z i (A) + 2ℓˆ w,A(zi(A)) . (41)

Published as a conference paper at ICLR 2024

Overall, we have

ϵ(S, A) max i [K] 1 2

2ℓˆ w,A(z i (A)) zi(A) z i (A) 2 + L

6 zi(A) z i (A) 3

max i [K] 1 2 2ℓˆ w,A(zi(A)) + L zi(A) z i (A) zi(A) z i (A) 2

6 zi(A) z i (A) 3

max i [K] 1 2 2ℓˆ w,A(zi(A)) + ρi(L) ρi(L)2

L2 + ρi(L)3

= max i [K] ρi(L)2

2L2 2ℓˆ w,A(zi(A)) + 2ρi(L)3

which completes the proof.

Lemma D.3 (Lemma 2.1 Bourin et al. (2013)). For every matrix in M+ n+m partitioned into blocks, we have a decomposition A X X B

= U A 0 0 0

U + V 0 0 0 B

for some unitaries U, V Mn+m. Lemma D.4. Then, given an arbitrary partitioned positive semi-definite matrix,

s A s + B s

for all symmetric norms.

Proof. In lemma D.3 we have A X X B

= U A 0 0 0

U + V 0 0 0 B

for some unitaries U, V Mn+m. The result then follows from the simple fact that symmetric norms are non-decreasing functions of the singular values where f = s : M 7 R, we have

f U A 0 0 0

U + f V 0 0 0 B

Lemma D.5. For a Unif(Sd 1(

d)) and x are some vector Rd with norm x p

R(d) d, we have

P( x, a 2 a 2) min

πR(d) exp 1 4d 9

Proof. We can replace the unit vector of a with e by

P( x, a 2 a 2) = P( x, e 2 1) (44)

Similarly, we can replace x by unit vector s such that

P( x, e 2 1) = P s, e 2 1 R(d)

Solving s, e 2 = 1 R(d), we get

s, e 2 = cos2 ϕ = 1 R(d) ϕ = arccos 1 p

Published as a conference paper at ICLR 2024

In this case, the probability will converge to 1 as R(d) increases. As is known to us, surface area of Sd 1 equals

Ad = rd 1 2πd/2

An area Cd of the spherical cap equals

Acap d (r) = Z ϕ

0 Ad 1(r sin θ)rdθ = 2π(d 1)/2

0 sind 2 θdθ. (48)

where Γ n 1

2 = (2(n 1))! 22(n 1)(n 1)! π.

1) When d = 1, almost surely we have

P((x e)2 e2) = P(x2 1) = 1. (49)

2) When d = 2, we have a S2 where S2 is a circle r = 1 and the probability is the angle between s, e how much the vectors span within the circle where

P s, e 2 1 R(d)

= 2 R ϕ 0 rdθ

3) When d 3, the probability equals

| s, e | 1 p

= Acap d (r) 1 2Ad = 1 Acap d (r) 1 2Ad (51)

where Acap d (r) is the remaining area of cutting the hyperspherical caps in half of the sphere,

Acap d (r) = 2π(d 1)/2

ϕ sind 2 θdθ

= 2π(d 1)/2

3-a) If d is even then Γ d

2 = (d 2)! π 2d 2 d

2 1 !2 = π 2d 2

Robbins bounds (Robbins, 1955) imply that for any positive integer d

πd exp 1 8d 1

πd exp 1 8d + 1

So we have 2Acap d (r) Ad 2Γ d

2d 2 exp 1 4d 9 cos ϕ

2d 4 π exp 1 4d 9 cos ϕ.

Published as a conference paper at ICLR 2024

So the probability will be

P s, e 2 1 R(d)

2d 4 π exp 1 4d 9

cos ϕ. (56)

Suppose R(d) = kd, k > 1, we have

πkd exp 1 4d 9 =

2 kπ . (57)

3-b) Similarly, if d is odd, then Γ d 1

2 = 2d 2 d 3

(d 2)! π = 2d 2

If d = 3 then

P s, e 2 1 R(d)

2 π cos ϕ = 1 1 p

R(d) . (59)

If d > 3 then Robbins bounds imply that πΓ d 1

d 2 exp 1 4d 11

Thus, the probability will be at least

P s, e 2 1 R(d)

π(d 3) exp 1 4d 11

cos ϕ. (61)

To simplify the result, we compare the minimum probability that for d 3

P s, e 2 1 R(d)

π(d 3) exp 1 4d 11

π(d 3) exp 1 4d 11

π(d 3)R(d) exp 1 4d 11

πR(d) exp 1 4d 9

Overall, we have d N+,

P s, e 2 1 R(d)

πR(d) exp 1 4d 9

D.1 PROOF TO THEOREM 3.4

Proof. Let Ad = (ai)i [d] i.i.d. Unif(Sd 1(

d)). We consider the random Re LU NN function class to be

Frelu(Ad) =

f(w, A, x) = 1

i=1 wiσ x ai : wi R, i [m]

where A = [a1, ..., am] Rd m. The empirical minimizer of the source domain is

ˆ w = min w Rd 1 n

xi,yi S ℓ(f(w, A, xi), yi) = 1

zi S ℓˆ w,A(zi). (64)

Published as a conference paper at ICLR 2024

Then with the chain rule, the first derivative of any input x at ˆ w will be

xℓ(f( ˆ w, A, x), y) = ℓ(f( ˆ w, A, x), y) f( ˆ w, A, x) ℓf( ˆ w, A, x) σ(Ax) σ(Ax)

= D1 f(x, y)

i=1 ˆwiai I a i x 0

= D1 f(x, y)

d ˆ w σ (Ax)

where a short notation D1 f(x, y) denotes the first order directional derivative of f( ˆ w, A, x) σ (Ax) Rm d is the Jacobian matrix w.r.t. input x. Apparently, the second order derivative is represented as D2 f(x, y), thus the Hessian will be

2 xℓ(f( ˆ w, A, x), y) = D2 f(x, y)

j=1 ˆwi ˆwjaia j I a i x 0 , a j x 0

= D2 f(x, y)

d σ (Ax) ˆ w ˆ w σ (Ax)

Similarly, we have

2 yℓ(f( ˆ w, A, x), y) = D2 y(x, y) (sgn(y))2 D2 f(x, y) (67)

where sgn(y) is the sign function. holds under our choice of the family of loss functions.

1. Homogeneity in regression, i.e. L1, MSE, MAE, Huber Loss, we have |D2 y(x, y)| = |D2 f(x, y)|;

2. (Binary) Cross-Entropy Loss:

D2 y(x, y) = 2

c=1 exp(xc)

3. Negative Log Likelihood (NLL) loss: D2 y(x, y) = 0.

Besides, as a convex loss function, D2 y(x, y) 0. Hence, the range of D2 y(x, y) will be [0, D2 f(x, y)].

To combine with robustness, we denote z = (x; y), Rd+1. Therefore, the Hessian of z will be

H(z|S, A) :=

2 xℓ(f( ˆ w, A, x), y) 2ℓ(f( ˆ w,A,x),y)) y x 2ℓ(f( ˆ w,A,x),y) y x 2 y(ℓ(f( ˆ w, A, x), y))

With Lemma D.4, the spectral norm of Hessian z will be bounded by

H(z) 2 xℓ(f( ˆ w, A, x), y) + 2 yℓ(f( ˆ w, A, x), y) . (69)

The first term in (69) can be further bounded by D2 f(x, y)σ (Ax) ˆ w ˆ w σ (Ax) |D2 f(x, y)| ˆ w ˆ w σ (Ax)σ (Ax)

= D2 f(x, y) ˆ w 2 σ (Ax)σ (Ax) (70)

where the convexity of loss functions x, y, D2 f(x, y) 0 supports the last equation. The right term has the facts that

σ (Ax)σ (Ax) σ (Ax)σ (Ax) F = tr σ (Ax)σ (Ax) . (71)

Published as a conference paper at ICLR 2024

In summary, we have the following inequality:

H(z) D2 f(x, y) 1

d ˆ w 2 σ (Ax)σ (Ax) + 1

D2 f(x, y) 1

d ˆ w 2 tr σ (Ax)σ (Ax) + 1

= D2 f(x, y)

d ˆ w 2 m X

j=1 aj 2 I a j x 0 + 1

In Lemma D.2, it depends on some zi = (xi, yi) S Ci that

ϵ(S, A) max i [K] ρi(L)2

H(zi(A)) + 4ρi(L)

max i [K] ρi(L)2

D2 f(xi, yi)

j=1 aj 2 I a j xt 0 + 1

D2 f(xk, yk)

d ˆ w 2 m X

j=1 aj 2 I a j xk 0 + 4ρmax(L)

where ρmax(L) = max{ρi(L)}K i=0, the oκ = O(ℓ (f, xk, yk) ˆ w 2 Pm j=1 aj 2 I a j xk 0 d/m) is a smaller order term compared to first term, since m d. Last equality, the maximum can be taken as we find maximum (xk, yk) ˆC {Ci}i [K]. Because xk, yk S is one of the training sample k [n], there must exist n [0, n] that

D2 f(xk, yk) ˆ w 2 m X

j=1 aj 2 I a j xk 0

D2 f(xk, yk)

j=1 aj 2I{a j xk 0}.

Recall that the sharpness of parameter ˆ w is defined by

κ( ˆ w, S, A) := ˆ w 2 tr[HS,A( ˆ w)]

j=1 D2 f(xj, yj) tr σ(Axj)σ(Axj)

D2 f(xj, yj)

i=1 (a i xj)2I a i xj 0 .

Let the ξi = a i x D(ξ) and the expectation of E(ξi > 0) = qi where D(ξ) is some rotationally invariant distribution, i.e. uniform or normal distribution. Under this circumstance, the sample mean of ξi still obeys the same family distribution as D(tξ). Thus, we have

j=1 ξi I {ξi 0}

j=1 aj 2I {ξi 0}

= P((a x)2 a 2) = Exp(x)

With Lemma D.5, we have at least a probability at

P((a x)2 a 2) = min

( 2 π arccos R(d) 1

πR(d) e 1 4d 9

Published as a conference paper at ICLR 2024

the following inequality holds,

ϵ(S, A) ρmax(L)2

n κ( ˆ w, S, A) + 4ρmax(L)

κ( ˆ w, S, A) + 4

3ρmax(L) (78)

D.2 PROOF TO COROLLARY 3.5

Corollary D.6 (Restatement of Corollary 3.5). Let ˆwmin be the minimum value of | ˆ w|. Suppose x Unif(Sd 1(

d)) and | 2ℓ(f( ˆ w, A, x), y)/ f 2| is bounded by [ M1, M2]. If m > d, ρmax(L) < ( ˆw2 min M1 σ(d, m))/(2d) for any A = (a1, ..., am), ai Unif(Sd 1(

d)) i [m], taking expectation over all xj Unif(Sd 1(

d)) in S, we have

ES,A [ϵ(S, A)] ES,A 7ρmax(L)2

n κ( ˆ w, S, A) + M2 . (79)

where σ(d, m) = Eaλmin(Pm i=1 aia i Gii) > 0 is the minimum eigenvalue and Gii is product constant of Gegenbauer polynomials

t=0 λ2 d,t(σ)N 2 d,t Q(d) t ( ai, aj /

Proof. In our main theorem, with some probability, we have the following relation

ϵ(S, A) ρmax(L)2

n κ( ˆ w, S, A) + 4ρmax(L)

So, we are concerned about the relation between the κ( ˆ w, S, A) second term. If ρi(L) < κ( ˆ w, S, A), we may say the RHS is dominated by sharpness term κ( ˆ w, S, A) as well as the main effect is taken by the sharpness. As suggested in Lemma D.1, we have

Ex Unif(Sd 1(

d))H(x ) ˆw2 min M1 σA(m, d)

where x is the global minimum over the whole set. As defined in (38), the following condition holds true

z i (A) Mi, zi(A) z i (A) ρi(L)

and with Hessian Lipschitz, the relation is almost surely for arbitrary x that

Ez i (A) H(zi(A)) H(z i (A)) L zi(A) z i (A) ρi(L)

ˆw2 min M1 σ(d, m)

Ez i (A) 1 2 H(z i (A)) .

Obviously, H(z i (A)) > 2ρi(L). Following Lemma A.2 of Zhang et al. (2019a), for z i (A), z(A) Ci, we have a similar result that

σmin(H(z(A))) σmin(H(z i (A)) H(z(A)) H(z i (A)) ρi(L) (83)

where σmin denotes the minimum singular value. With Lemma D.2, we know that

ES,Aϵ(S, A) ES,A max i [K] ρi(L)2

H(zi(A)) + 4ρi(L)

Published as a conference paper at ICLR 2024

We also have another condition that

|D2 f(x, y)| := 2ℓ(f( ˆ w, A, x), y)

[ M1, M2], x, y. (85)

Combine all these results, we finally have

ES,Aϵ(S, A) ES,A max i [K] ρi(L)2

ES,A 7ρmax(L)2

D2 f(xk, yk)

d ˆ w 2 m X

j=1 aj 2 I a j xk 0 + M2

Recall the definition of κ( ˆ w, S, A) in the main theorem that

ES,Aκ( ˆ w, S, A) E{x}n ˆ w 2 1

D2 f(xj, yj)

i=1 (a i xj)2I a i xj 0 (87)

Look at the second sum, we have

E{xj}n,{ai}m

i=1 (a i xj)2I a i xj 0 =

i=1 Exj Eai ai 2 xj 2 cos2(β)I a i xj 0

i=1 Exj Eai ai 2 I a i xj 0 d cos2(β).

Suppose x and a are i.i.d. from Unif(Sd 1(1)), let u = x, a , we have a well-known result that

Eu[u2] = E[ x, a 2]

= E[ x 2 a 2 cos2(β)]

= E[cos2(β)] = 1

d, x, a Rd (89)

Therefore, in (88), m X

i=1 Exj Eai ai 2 I a i xj 0 d cos2(β) =

i=1 Exj Eai ai 2 I a i xj 0 (90)

and we have (based on proof of main theorem), ES,Aϵ(S, A)

D2 f(xj, yj)

i=1 Exj Eai ai 2I{a i xj 0} + M2

ES,A 7ρmax(L)2

n κ( ˆ w, S, A) + M2

E CASE STUDY

To better illustrate our theorems, we here give two different cases for clearly picturing intuition. The first case is the very basic model, ridge regression. As is known to us, ridge regression provides a straightforward way (by punishing the ℓ2 norm of the weights) to reduce the "variance" of the model in order to avoid overfitting. In this case, this mechanism is equivalent to reducing the model s sharpness. Example E.1. In ridge regression models, the robustness constant ϵ has a reverse relationship to regularization parameter β where β , the more probably flatter minimum κ and less sensitivity ϵ of the learned model could be. Follow the previous notation that ϵ(S, A) denotes the robustness and κ(ˆθ, S) is the sharpness on training set S, then we have

τ > 0, c (0, n], ϵ(S, A) cκ(ˆθ, S) + od where od is a much smaller order than κ(ˆθ, S).

Published as a conference paper at ICLR 2024

E.1 RIDGE REGRESSION

We consider a generic response model as stated in Ali et al. (2019).

y|θ (Xθ , σ2I)

ridge regression minimization problem is defined by

min θ 1 2n Xθ y 2 + β

2 θ 2, X Rn d, n < d. (92)

The least-square solution of ridge regression is

ˆθ = X X + nβI 1 X y. (93)

With minimizer ˆθ, we now focus on its geometry w.r.t. x. Let S = {z}n i = (X, y) be the training set, (Z, Σ, ρ) be a measure space. Consider the bounded sample set Z such that

M > 0, ρ(Z) < + . (94)

The Z can be partitioned into K disjoint sets {Ci}i [K]. By definition, we have robustness defined by each partition Ci, z, z Ci, ℓ(ˆθ, z) ℓ(ˆθ, z ) ϵ(S, A). (95)

For this convex function ℓ(ˆθ, z), we have the following upper bound in the whole sample domain

ϵ(S) = max i [K] sup z,z Ci

ℓ(ˆθ, z) ℓ(ˆθ, z )

= max i [K] sup z Ci ℓ(ˆθ, z) ℓ(ˆθ, z i Ci)

sup zj Z S ℓ(ˆθ, zj) ℓ(ˆθ, z )

where the z i , z are the global minimum point in Ci and whole domain Z = SK i Ci, respectively. zj is a training data point that has the maximum loss difference from the optimum. Specifically, it as well as the augmented form of ˆθ can be expressed as

z = [x1, ..., xd, y] , ˆθ+ = [ˆθ1, ..., ˆθd, 1] , Rd+1.

Then the loss difference can be rewritten as

ℓˆθ+(z) = (ˆθ +z)2 = (ˆθ x y)2 H(ℓˆθ+(z))is P.S.D matrix.

It is a convex function with regards to z such that

ϵ(S) sup zj S ℓˆθ+(zj) ℓˆθ+(z )

= sup zj S ℓˆθ+(z ) (zj z ) + 1

2(zj z ) H(ℓˆθ(z ))(zj z )

1 2(zj z ) H(ℓˆθ+(z ))(zj z )

1 2 H(ℓˆθ+(z )) zj z 2

where the second equality is supported by convexity and the third equality is due to ℓˆθ+(zj) = 0. Further, with Lemma D.4, we have

H(ℓˆθ+(z )) =

ˆθˆθ 2ℓˆ θ+(zj))

y x 2ℓˆ θ+(zj))

ˆθˆθ + 1. (98)

Published as a conference paper at ICLR 2024

In (97), we can also bound the norm of the input difference by

zj z 2 xj x 2 + (yj y )2 (99)

and then for simplicity, assume x 2 xj 2 we have

ϵ(S) sup xj S

ˆθˆθ + 1 xj x 2 + (yj y )2

ˆθ 2 + 1 xj 2 + x 2 + (yj y )2

ˆθ 2 xj 2 + x 2 + O(d)

ˆθ 2 xj 2 + O(d)

where ˆθ 2 xj 2 = O(d2) is the dominate term for large d. Now, let s look at the relation to sharpness. By definition,

κ(ˆθ, S) = ˆθ 2 tr Hˆθ(ℓ(ˆθ, S)) = ˆθ 2 tr X X

n + βI . (101)

n + βI = tr X X

+ tr (βI) = tr XX

j xj 2 + β, (102)

κ(ˆθ, S) = ˆθ 2 1

j xj 2 + β ˆθ 2. (103)

As is known to us, the "variance" of ridge estimator Ali et al. (2019) can be defined by

Var(ˆθ) tr ˆθˆθ = tr XT X + nβI 1 XT yy X XT X + nβI 1 . (104)

Note that ˆθˆθ is a PSD with rank(ˆθˆθ ) = 1, thus it has only one eigenvalue λ(ˆθˆθ ) = ˆθ 2 > 0.

Var(θ) = tr ˆθˆθ = ˆθ 2 = O(β 2) (105)

By definition, the covariance matrix E[yy ] is a diagonal matrix with entries of σ2. Averagely, we have tr ˆθˆθ = σ2 tr h X X + nβI 1 X X X X + nβI 1i

(λi(X X/n) + β)2

n + βI 1 are simultaneously diagonalizable and commutable. Therefore,

the greater β is, the smaller tr ˆθˆθ is.

Conclusions From our above analysis, we have the following conditions.

Upper bound of robustness ϵ(S) supxj S ˆθ 2 xj 2 + O(d).

Sharpness expression κ(ˆθ, S) = ˆθ 2 1 n P

xj,yj S xj 2 + β .

Published as a conference paper at ICLR 2024

Variance expression Var(θ) = ˆθ 2.

1) First, let s discuss how β influences sharpness.

κ(ˆθ, S) = ˆθ 2 tr Hˆθ(ℓ(ˆθ, S))

= y X X X + nβI 2 X y

xj,yj S xj 2 + β

As dictated in above equation, the κ(ˆθ, S) = O(β 1) where sharpness holds an inverse relationship to β.

2) Now, it s trivial to get the relationship between robustness and sharpness by combining the first two points. Because xj, yj S in supermum is one of the training samples there exists a constant c < n and od = o(d2) such that

ˆθ 2 xj 2 + od

ˆθ 2 xj 2 + cβ ˆθ 2 + od

= cκ(ˆθ, S) + od.

This relation is consistent with Theorem 3.4 where the robustness is upper bounded by n times sharpness κ(ˆθ, S). Besides, the relation is simpler here, where robustness only depends on sharpness without other coefficients before κ(ˆθ, S).

3) Finally, the relation between β and robustness will be

Var(θ) xj 2 + od = cσ2

(λi(X X/n) + β)2

(λi(X X/n) + β)2

Pd j λj(X X/n)

where the ϵ(S) somehow is the order of O(β 2) in the limit of β. It s clear to us that the greater β is, the less sensitive (more robust) model we can get. In practical, we show that "over-robust" may hurt the model s performance (we show the detail empirically in Figure 7).

E.2 2-LAYER DIAGONAL LINEAR NETWORK CLASSIFICATION

However, our main theorem assumes the loss function satisfying Polyak-Łojasiewicz (PŁ) condition. To extend our result to a more general case, here we study a 2-layer diagonal linear network classification problem whose loss is exponential-based and not satisfied the PŁ condition. Example E.2. We consider a classification problem using a 2-layer diagonal linear network with exp-loss. The robustness ϵ(S, A) has a similar relationship in Theorem 3.4. Given training set S, after iterations t > Tϵ, C2 > 0, ϵ(S, A) C2 supt Tϵ κ(θ(t), S).

Given a training set S = (X, y), X = [x1, ..., xn], X Rd n. A depth-2 diagonal linear networks with parameters u = [u+, u ] R2d specified by:

f(u, x) = u2 + u2 , x

We consider exponential loss where L(t) = 1

n Pn i=1 exp( x i θ(t)yi) and yi { 1, 1}. It has the same tail behavior and thus similar asymptotic properties as the logistic or cross-entropy loss. WLOG,

Published as a conference paper at ICLR 2024

we assume i [n] : yi = 1 such that xi = yixi. Suggest by Moroshko et al. (2020), we have

θ(t) = 2α2 sinh 4X Z t

where r(t) = 1

nexp( X θ(t)) and r(t) 1 = L(t). Note that

θ2(t) + 4α41 X exp X θ(t) = A(t)Xr(t) (110)

where A(t) = diag 4 p

θ2(t) + 4α41 .

Part i, sharpness. First derivative and Hessian can be obtained by

θL(t) = (Xr(t))

Hθ(L(t)) = (Xr(t))

i=1 ri(t)Xi X i . (111)

With the definition of sharpness, we can get

κ(θ(t), S) = θ(t) 2 tr(Hθ[L(t)])

= θ(t) 2 tr

i=1 ri(t)Xi X i

= θ(t) 2 tr

i=1 ri(t)X i Xi

= θ(t) 2 n X

i=1 ri(t) xi 2 !

Part ii, robustness. Now, let s use the same discussion of the robustness constant in the previous case. Follow the previous definition of ϵ(S, A), after some iteration number Tϵ it is defined by

ϵ(S, A) := sup i [n],t Tϵ |nrj(t) r (t)| (113)

where nrj(t) is the (denormalized) point-wise loss of xj and r (t) denotes the minimum loss of point x . There exists n < n, we have

ϵ(S, A) = sup t Tϵ n ( r(t) 1 r (t)) sup t Tϵ n r(t) 1. (114)

Let xmin = mini [n] xi , the above equation

ϵ(S, A) 1 xmin 2 sup t Tϵ (n r(t) xmin 2)

xmin 2 sup t Tϵ

i=1 ri(t) xmin 2 !

xmin 2 sup t Tϵ

i=1 ri(t) xi 2 !

Part iii, connection. Compare the last part of (112) and (115), we found that robustness and sharpness depend on the same term. Further, we can say that for any step t, θ(t) 2 will have the upper bound.

From Lemma 11 of Moroshko et al. (2020), we have the following inequality,

θ(t) 2α2 sinh x 2γ2 2α2 γ(t) (116)

where L(t) = exp( γ(t)). Then, we can bound the θ(t) 2 via:

C1 θ(t) 2 d θ(t) 2 = 4dα4 sinh2 x 2γ2 2α2 γ(t) < (117)

Published as a conference paper at ICLR 2024

Note that C1 > 0 In summary, we have

C1 xmin sup t Tϵ κ(θ(t), S) C2 sup t Tϵ κ(θ(t), S) (118)

where C2 = n C1 xmin > 0 is a constant.

Part iv, sanity check asymptotic. Asymptotically, as t , L(t) 0 while θ(t) 2 will be explode. So if sharpness κ(θ(t), X) , then it will fail to imply robustness.

κ(θ(t), S) = θ(t) 2 n X

i=1 ri(t) xi 2 !

i=1 ri(t) xmax 2 !

= θ(t) 2L(θ(t)) xmax 2

= κmax(θ(t), S)

Let κmax(θ(t), X) be a upper bound of κ(θ(t), X) at any time step t. The dynamics will be

dκmax(θ(t), S)

dt = θκmax(θ(t), S) θ(t)

= xmax 2 tr(XX ) θ(t) 2 L(t) + 2L(t)θ(t) θ(t)

= xmax 2 tr(XX ) θ(t) 2(Xr(t)) A(t)Xr(t) + 2L(t)θ(t) A(t)Xr(t)

= xmax 2 tr(XX ) 2L(t)θ(t) θ(t) 2(Xr(t)) A(t)Xr(t) (120)

As t , we have L(t) 0, thus it is easy to converge that

lim t dκmax(θ(t), S)

dt = xmax 2 θ(t) 2(Xr(t)) A(t)Xr(t) = xmax 2 θ(t) 2 L(t) = 0 (121) As we can see, the dynamics of the derivative of κmax(θ(t), X) is decreasing to 0 as t which means the sharpness κ(θ(t), X) will be upper bounded by a converged number κ(θ(t ), X) < C < . So sharpness will not explode.

F COMPARISON TO FEATURE ROBUSTNESS

F.1 THEIR RESULTS

Definition F.1 (Feature robustness Petzka et al. Petzka et al. (2021)). Let ℓ: Y Y R+denote a loss function, ϵ and δ two positive (small) real numbers, S X Y a finite sample set, and A Rm m a matrix. A model f(x) = (ψ ϕ)(x) with ϕ(X) Rm is called ((δ, S, A), ϵ)-feature robust, if Eϕ F(f, S, αA) ϵ for all 0 α δ. More generally, for a probability distribution A on perturbation matrices in Rm, we define

Eϕ F(f, S, A) = EA A h Eϕ F(f, S, A) i ,

and call the model ((δ, S, A), ϵ)-feature robust on average over A, if Eϕ F(f, S, αA) ϵ for 0 α δ Theorem F.2 (Theorem 5 Petzka et al. Petzka et al. (2021)). Consider a model f(x, w) = g(wϕ(x)) as above, a loss function ℓand a sample set S, and let Om Rm m denote the set of orthogonal matrices. Let δ be a positive (small) real number and w = ω Rd m denote parameters at a local minimum of the empirical risk on a sample set S. If the labels satisfy that y (ϕδA (xi)) = y (ϕ (xi)) = yi for all (xi, yi) S and all A 1, then f(x, ω) is ((δ, S, Om) , ϵ)-feature robust on average over Om for ϵ = δ2 2mκϕ(ω) + O δ3 .

Published as a conference paper at ICLR 2024

Proof sketch

1. With assumption that y[ϕδA(xi)] = yi for all (xi, yi) S and all A 1, feature perturbation around w is

Eϕ F(f, S, αA) + Eemp(w, S) = Eemp(w + αw A, S)

2. Since Taylor expansion for local minimum w = ω will only remains second order term, thus

Eemp(w + αw A, S) = Eemp(ω, S) + α2

s,t=1 (ωs A)Hs,t(ω, ϕ(S))(ωt A)

3. With basic algebra, one can easily get

EA Om h Eϕ F(f, S, αA) i δ2

s,t=1 EA Om h (ωs A) Hs,t (ωt A)T i + O δ3

s,t=1 ωs, ωt Tr (Hs,t) + O δ3

2mκϕ(ω) + O δ3

F.2 COMPARISON TO OUR RESULTS

We first summarize the commonalities and differences between their results and ours:

Both of us consider the robustness of the model. But they define the feature robustness while we study the loss robustness Xu & Mannor (2012) which has been studied for many years.

They consider a non-standard generalization gap by decomposing it into representativeness and the expected deviation of the loss around the sample points. But we strive to integrate sharpness into the general generalization guarantees.

For point 1, their defined feature robustness trivially depends on the sharpness. Because the sharpness (the curvature information) is just defined by the robust perturbation areas around the desired point. From step 2 in the above proof sketch we can see, the hessian w.r.t. ω is exactly the second expansion of perturbed expected risk. So we think this definition provides less information about the optimization landscape. In contrast, we consider the loss robustness for two reasons: 1) it is easy to get in practice without finding the orthogonal matrices Om first. 2) we highlight its dependence on the data manifold.

For point 2, we try to integrate this optimization property (sharpness) into the standard generalization frameworks in order to get a clearer interplay. Unlike feature robustness, the robustness defined by loss function will be easier analyzed in generalization tools, because it s hard and vague to define the "feature" in general. Besides, our result will also benefit the data-dependent bounds Xu & Mannor (2012); Kawaguchi et al. (2022).

G EXPERIMENTS AND DETAILS

G.1 RIDGE REGRESSION WITH DISTRIBUTIONAL SHIFTING

As we stated before, we followed the Duchi & Namkoong (2021) to investigate the ridge regression on distributional shift. We randomly generate θ 0 Rd in spherical space, and data from

X iid N(0, 1), y = Xθ 0 (122)

To simulate distributional shift, we randomly generate a perpendicular unit vector θ 0 to θ 0. Let θ 0 , θ 0 be the basis vectors, then shifted ground-truth will be computed from the basis by θ α = θ 0 cos(α)+θ 0 sin(α). For the source domain, we use θ 0 as our training distribution. We randomly

Published as a conference paper at ICLR 2024

sample 50 data points and train a linear classifier with a gradient descent of 3000 iterations. Starting from α = 0, we gradually increase the α to generate different distributional shifts.

From the left panel in Figure 7 we can see that a larger penalty suffers from lower loss increasing when β ranges from 0 to 2. Since we consider a cycling shift of label space, 180 corresponds to the maximum shift thus leading to the highest loss increase. According to our analysis of ridge regression, larger β means a flatter minimum and more robustness, resulting in a better OOD generalization. This experiment verifies our theoretical analysis. However, it is important to note that too large a coefficient of ℓ2 regularization will hurt the performance. As shown in Figure 7 (right panel), the curse of underfitting (indicated by brown colors) appears when β > 4.

Maximum Shift Maximum Shift

Poor fitting

Figure 7: OOD test losses are increasing along distributional shifting. The X-axis is the shifting angle α and the Y-axis is the test loss of the model which is trained on distribution α = 0. Left: The larger regularization β causes a lower increase in test loss (Darker purple lines have lower test losses). Right: Too large ℓ2 penalty coefficient β brings poor fitting and thus fails to generalize on both ID and OOD datasets (orange lines). Best viewed in colors.

G.2 ADDITIONAL RESULTS ON SHARPNESS

Here we show more experimental results on Rotated MNIST which is a rotation of MNIST handwritten digit dataset with different angles ranging from 0 , 15 , 30 , 45 , 60 , 75 (6 domains/distributions). For each environment, we select a domain as a test domain and train the model on all other domains. Then OOD generalization test accuracy will be reported on the test domain. In our experiments, we run each algorithm with 12 different seeds, thus getting 12 trained models of different minima. Then we compute the sharpness (see Algorithm 1) for all these models and then plot them in Figure 8. For algorithms, we choose Empirical Risk Minimization (ERM), and Stochastic Weight Averaging (SWA) as the plain OOD generalization algorithm which is shown in the first column. In robust optimization algorithms, we choose Group Distributional Robust Optimization Sagawa et al. (2019). We also choose CORALSun & Saenko (2016) as a multi-source domain generalization algorithm. Among these different types of out-of-domain generalization algorithms, we can conclude that sharpness will affect the test accuracy on the OOD dataset.

Experimental configurations are listed in Table 1. For each model, we run the 5000 iterations and choose the last model as the test model. To ease the computational burden, we choose the 3-layer MLP to compute the sharpness.

Algorithms Optimizer lr WD batch size MLP size eta MMD γ ERM(SWA) Adam 0.001 0 64 265*3 - - DRO Adam 0.001 0 64 265*3 0.01 - CORAL Adam 0.001 0 64 265*3 - 1

Table 1: Hyperparameters we use for different DG algorithms in the experiments.

From Figure 8 we can see, the sharpness has an inverse relationship with the out-of-domain generalization performance for every model in each individual environment. To make it clear, we plot similar

Published as a conference paper at ICLR 2024

tasks from environment 1 to 4 as the last row. Thus, we can see a clearer tendency in all algorithms. It verified our Theorem 3.1. Note that all algorithms have different feature scales. One may need to normalize the results of different algorithms when plotting them together.

Algorithm 1 Pseudocode of model sharpness computation

Require: feature layer f(x), training loss ℓ Ensure: Sharpness S

Get Jacobian matrix w.r.t. feature layer J = ℓ(f(x), y) for each gradient vector Ji in J do

Compute Hessian w.r.t. element i, j of f(x) by Ji fl(x) end for We store Hessian in the variable H Initialize sharpness S = 0 for i in feature layer.shape[0] do

for j in feature layer.shape[0] do

Retrieve the hessian value of i, j element via h H[:, j, i, :] Sharpness sij Trace(h) fij(x)2 S = S + sij end for end for

G.3 COMPARE OUR ROBUST BOUND TO OTHER OOD GENERALIZATION BOUNDS

G.3.1 COMPUTATION OF OUR BOUNDS

First, we follow Kawaguchi et al. (2022) to compute the K in an inverse image of the ϵ covering in a randomly projected space. The main idea is to partition input space in a projected space with transformation matrix A. The specific steps will be (1) To generate a random matrix A, we i.i.d. sample each entry from the Uniform Distribution U(0, 1). (2) Each row of the random matrix A R3 d is then normalized so that Ax [0, 1]3, i.e. Aij = Aij/ Pd j=1 Aij (3) After generating a random matrix A, we use the ϵ-covering of the space of u = Ax to define the pre-partition { Ci}K i=1.

G.3.2 COMPUTATION OF PAC-BAYES BOUND

We follow the definition to compute expected disρ in Germain et al. (2013) where

Definition G.1. Let H be a hypothesis class. For any marginal distributions DS and DT over X, any distribution ρ on H, the domain disagreement disρ (DS, DT ) between DS and DT is defined by,

disρ (DS, DT ) def = E h,h ρ2 [RDT (h, h ) RDS (h, h )] .

Since the disρ (DS, DT ) is defined as the expected distance, we can compute its empirical version according to their theoretical upper bound as follows.

Proposition G.2 (Germain et al. Germain et al. (2013) Theorem 3). For any distributions DS and DT over X, any set of hypothesis H, any prior distribution π over H, any δ (0, 1], and any real number α > 0, with a probability at least 1 δ over the choice of S T (DS DT )m, for every ρ on H, we have

disρ (DS, DT ) 2α h disρ(S, T) + 2KL(ρ π) ln 2

δ m α + 1 i 1

. With disρ(S, T), we can then compute the final generalization bound by the following inequality

ρ on H, RPT (Gρ) RPT Gρ T RPS (Gρ)

+ disρ (DS, DT ) + RDT Gρ, Gρ T + RDS Gρ, Gρ T

Published as a conference paper at ICLR 2024

ERM CORAL DRO

Out-of-Domain Generalization Accuracy

Figure 8: Domain generalization test accuracy on Rotated MNIST. From top to the bottom: environment 0, 1, 2, 3, 4, 5 with angles: [0 , 15 , 30 , 45 , 60 , 75 ] and a plot together. Each column shows the 12 runs of each algorithm.

Published as a conference paper at ICLR 2024

with ρ T = argminρ RPT (Gρ) is the best target posterior, and RD Gρ, Gρ T = Eh ρEh ρ T RD (h, h ).

Note that we ignore the expected errors over the best hypothesis by assuming the RD Gρ, Gρ T = 0. We apply the same operation in E of Proposition C.6 as well.

G.3.3 COMPARISONS

In this section, we add some additional experiments on comparing to other baselines, i.e. PAC-Bayes bounds Germain et al. (2013). As shown in the first row of Figure 9, our robust framework has a smaller distribution distance in the bound compared to the two baselines when increasing the model size. In the second row, we have similar results in final generalization bounds. From the third and fourth rows we can see, our bound is tighter than baselines when suffering distributional shifts.

Published as a conference paper at ICLR 2024

Figure 9: Additional spurious feature synthetic experiment. Each dot represents a trained model. The dash curves are the smoothed function fit by the test data points. The baseline is Proposition 2.1. (a),(d),(g),(j): the generalization error of the logistic regression models with increasing the model size/correlation probability. (b), (e): distribution distances for PAC-Bayes Germain et al. (2013) along model size increases. (h), (k): distribution distances for PAC-Bayes Germain et al. (2013) along model distribution shift. (c), (f), (i), (l): comparisons among baselines Proposition C.6, Germain et al. (2013) and ours. Note that model size > 500 is the overparameterized regime. The further the correlation probability is from 0.5, the greater the distributional shift is.