# spurious_feature_diversification_improves_outofdistribution_generalization__8a666ae5.pdf Published as a conference paper at ICLR 2024 SPURIOUS FEATURE DIVERSIFICATION IMPROVES OUT-OF-DISTRIBUTION GENERALIZATION Yong Lin Lu Tan Yifan Hao Ho Nam Wong Hanze Dong Weizhong Zhang Yujiu Yang Tong Zhang The Hong Kong University of Science and Technology, Tsinghua University, Fudan University University of Illinois Urbana-Champaign. Generalization to out-of-distribution (OOD) data is a critical challenge in machine learning. Ensemble-based methods, like weight space ensembles that interpolate model parameters, have been shown to achieve superior OOD performance. However, the underlying mechanism for their effectiveness remains unclear. In this study, we closely examine Wi SE-FT, a popular weight space ensemble method that interpolates between a pre-trained and a fine-tuned model. We observe an unexpected False False True phenomenon, in which Wi SE-FT successfully corrects many cases where each individual model makes incorrect predictions, which contributes significantly to its OOD effectiveness. To gain further insights, we conduct theoretical analysis in a multi-class setting with a large number of spurious features. Our analysis predicts the above phenomenon and it further shows that ensemble-based models reduce prediction errors in the OOD settings by utilizing a more diverse set of spurious features. Contrary to the conventional wisdom that focuses on learning invariant features for better OOD performance, our findings suggest that incorporating a large number of diverse spurious features weakens their individual contributions, leading to improved overall OOD generalization performance. Additionally, our findings provide the first explanation for the mysterious phenomenon of weight space ensembles outperforming output space ensembles in OOD. Empirically we demonstrate the effectiveness of utilizing diverse spurious features on a Multi Color MNIST dataset, and our experimental results are consistent with the theoretical analysis. Building upon the new theoretical insights into the efficacy of ensemble methods, we further identify an issue of Wi SE-FT caused by the overconfidence of finetuned models in OOD situations. This overconfidence magnifies the fine-tuned model s incorrect prediction, leading to deteriorated OOD ensemble performance. To remedy this problem, we propose a novel method called BAla Nced avera Ging (BANG) to mitigate the overconfidence problem, which significantly enhances the OOD performance of Wi SE-FT. 1 INTRODUCTION Machine learning has seen significant advancements recently. However, the assumption that testing samples follow the same distribution as training samples, known as the Identically Independent Distributed (IID) assumption, can be violated in real-world applications. When a machine learning model encounters novel testing samples that it hasn t seen during training, it faces the out-ofdistribution (OOD) generalization problem. Ensemble-based models (ESM) have achieved significant success in addressing OOD problems in recent years. Specifically, denote the input as x and the model as fθ with parameter θ. Given two models f θ and f θ, existing ESM works typically consider the output space ensemble (OSE) Equal contribution. Corresponding to: Yong Lin Published as a conference paper at ICLR 2024 Camel [0, 0, 1] Cow [0, 1, 0] Dog [1, 0, 0] Invariant Feature xv Spurious Feature xs,1 (Background) Spurious Feature xs,2 (Fur/Texture) (xv) (xs,1) (xv) (xs,2) [0.4, 0.6, 0] [0.4, 0, 0.6] 𝑓( + ) + [0.4, 0.3, 0.3] 1/2 1/2 Testing Sample Dog [1, 0, 0] Figure 1: Illustration of False False True phenomenon. Consider to classify camels, cows, and dogs. The invariant feature xv is the shape of the animal. There are 2 spurious features, i.e., 1) the background xs,1, e.g., camels are always on the sand, cows are on grass and dogs are on the floor. 2) the fur of the animals xs,2, e.g., camels have brown fur, cows have dotted fur and dogs are all in black in the training dataset. Suppose we fit two models, f and f, on the training dataset independently. Assume that f uses the invariant feature xv and xs,1, and f uses xv and xs,2. f and f both correctly predict the label of a sample from the training distribution. Consider an OOD testing sample of a dog with brown fur on the grass. f puts a large logit for the cow class since the background(grass) is spuriously correlated with cows, i.e., f(xv, xs,1) = [0.4, 0.6, 0]. f puts a large logit for the camel class since the texture(brown fur) is spuriously correlated with camels, i.e., f(xv, xs,2) = [0.4, 0, 0.6]. Both f and f make mistakes on this sample. However, the average of them can make correct prediction, i.e., 1/2 f(xv, xs,1) + 1/2 f(xv, xs,2) = [0.4, 0.3, 0.3]. which outputs f θ(x) + f θ(x) and the weight space ensemble (WSE) which outputs f( θ+ θ)/2(x). WSE is also called weight averaging in literature. Wortsman et al. (2022); Wortsman et al.; Rame et al. (2022) show that ESM can significantly improve the OOD performance and WSE outperforms OSE. Many works, e.g., Cha et al. (2021); Rame et al. (2022); Arpit et al. (2022); Rame et al.; Wortsman et al.; Tian et al. (2023); Kumar et al. (2022), adopt WSE to repeatedly improve the SOTA performance on many OOD benchmarks such as Domain Bed (Gulrajani & Lopez-Paz, 2020) and Image Net variants (Wortsman et al., 2022). See Appendix B.1 for more related works. Consider two types of features for OOD: (1) invariant features that consistently predict the label across distributions, and (2) spurious features that have unstable correlations with the label. Existing OOD theories (Arjovsky et al., 2019; Rosenfeld et al., 2020; Wald et al., 2022; Ahuja et al., 2020; Zhou et al., 2022b) show that an ERM-trained model relying on spurious features can fail in worst-case. ESM, which combines multiple ERM-trained models, may still heavily depend on such features and potentially fail in worst-case scenarios as well. There have been some previous attempts to explain the effectiveness of model ensemble, but they do not offer satisfactory explanations on the overall OOD improvement of ESM. Furthermore, the difference between weight and output space ensemble remains under-explored (a thorough discussion on related works in Appendix B.2). An intriguing phenomenon. To understand the benefits of ESM, we examine the Wi SE-FT (Wortsman et al., 2022), which interpolates between a pre-trained and fine-tuned model. When evaluating OOD datasets, we divided them into four groups based on the correctness of predictions made by the individual models. Surprisingly, we found a False False True phenomenon: Wi SE-FT can correct predictions on samples where both individual models make incorrect predictions. Further, we show that two individual models learn different feature sets, and Wi SE-FT utilizes more diverse features. Based on these observations, we then motivate our theory by a toy example (shown in Figure 1). Suppose we have two models, f and f, for a 3-class classification task. For a sample from the first class, f produces logits of (0.4, 0.6, 0), and f produces logits of (0.4, 0, 0.6). The ensemble model s prediction would be (0.4, 0.3, 0.3). This phenomenon can happen when f and f learn different subsets of spurious features, represented as S and S, respectively. Recall that the spurious correlations change in OOD. In the example, f generates a high logit (0.6) for the second class influenced by S, while f produces a high logit (0.6) for the third class influenced by S (details in Section 2). A new perspective on OOD generalization. In Section 3, we extend a popular theoretical setting (Rosenfeld et al., 2020; Wald et al., 2022) to a 3-class classification with multiple spurious features. Our theoretical results predicts the aforementioned phenomenon. We show that ESM incorporates more diverse spurious features, which weakens the contributions of individual spurious feature and Published as a conference paper at ICLR 2024 further leads to improved overall OOD performance. We also shed light on the difference between the weight and output space ensemble. Recall that there has been a significant effort in OOD community to learn invariant features and discard spurious features (Arjovsky et al., 2019). However, these approaches have not shown satisfactory performance when applied to real-world datasets (Gulrajani & Lopez-Paz, 2020), which may be due to the fact that invariant learning requires numerous domains (Rosenfeld et al., 2020), strong regularization (Zhou et al., 2022b), and faces additional difficulties induced by non-linearity (Rosenfeld et al., 2020), overparameterization (Lin et al., 2022a), and optimization challenges (Chen et al., 2023c). In contrast, our findings offer a new perspective that spurious features diversification actually improves OOD performance, which can be easily implemented as shown in ensemble-based models and has achieved remarkable empirical success. To further verify our findings, we introduce Multi Color MNIST in Section 3.4, a novel variant of CMNIST (Arjovsky et al., 2019), with multiple spurious features. Through empirical analysis, we show that individual models trained on Multi Color MNIST utilize different spurious features, and their ensemble achieves superior OOD performance by leveraging this diversity. Notably, while several methods promote feature diversity to enhance empirical performance, none of them have explored the spurious features diversification from a perspective similar to ours (details in Appendix B.2). An improved method. Our theoretical results indicate that the scaling of f and f should be similar to maintain the improvement of the model ensemble. If f is much more confident than f, resulting in a larger scaling for f, the ensemble model can become biased towards f. Unfortunately, the scaling issue arises in Wi SE-FT, which combines a pre-trained model and a fine-tuned model in the weight space. Empirical evidence shows that the pre-trained model is well calibrated, whereas the fine-tuned model is highly over-confident on OOD datasets, indicating a larger scaling compared to the pretrained model. Based on these findings, we propose BAla Nced avera Ging (BANG), which combines the pre-trained model with a model fine-tuned by over-confidence preventing methods like Label Smoothing and Mix Up. We demonstrate that BANG improves vanilla Wi SE-FT by approximately 1.9pp in average OOD performance across five Image Net variants. To summarize, the following are the main contributions of the paper: By examining Wi SE-FT, a popular method of ensemble-based models (EBM) that combines the pre-trained and fine-tuned model in the weight space, we discover an unexpected False False True phenomenon that Wi SE-FT can correct a large fraction of OOD samples on which both individual models make wrong predictions. We further show that two individual models use different sets of features and Wi SE-FT utilizes more diverse features. Through theoretical analysis on a multi-class classification problem with multiple spurious features, we provide a natural explanation for the observed phenomenon and show EBM can improve OOD performance through spurious features diversification. Additionally, our findings provide the first-ever explanation for the mysterious phenomenon of weight space ensembles outperforming output space ensembles in OOD scenarios. Contrary to the traditional belief that emphasizes the exclusive learning of invariant features for OOD, our findings suggest that incorporating diverse spurious features weakens their individual contributions, leading to improved overall OOD generalization performance. Through experiments on our Multi Color MNIST dataset, which contains multiple spurious features, we provide concrete evidence for the effectiveness of diverse spurious features. Based on our theoretical and empirical findings, we show that Wi SE-FT can suffer from the over-confidence problem of the fine-tuned model, which skews the ensemble and deteriorates the OOD performance. We further propose a novel method BANG to remedy this problem, and it significantly improves the OOD performance. 2 UNDERSTANDING ENSEMBLE-BASED MODELS VIA EXAMINING WISE-FT The False False True phenomenon. In this section, we closely examine Wi SE-FT (Wortsman et al., 2022) to obtain intuition on why EBM can improve OOD performance. Specifically, (Wortsman et al., 2022) ensemble pre-trained CLIP and the model fine-tuned on Image Net in the weight space. In Appendix C.1, we divide each dataset (Image Net as ID dataset and five Image Net variants 1 1They are Image Net-V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2021a), Image Net A (Hendrycks et al., 2021b), Image Net Sketch (Wang et al., 2019) and Object Net (Barbu et al., 2019). We refer to them as IN-V2, IN-R, IN-A, IN-S, and Obj Net for short. More details in Appendix E Published as a conference paper at ICLR 2024 IN-V2 IN-R IN-A IN-S Obj Net 0 Percentage(%) Ratio of False False True Overall Improvement Original Pre-trained Fine-tuned Wi SE-FT Figure 2: (Left) False False True ratio; (Right) Grad CAM feature visualization. as OOD datasets) into 8 groups by whether the pre-trained, fine-tuned and averaged models make correct predictions. We surprisingly find that Wi SE-FT can correct a substantial part of samples on which both the pre-trained and fine-tuned models make mistakes. Specifically, we calculate the number of False False True samples, i.e., samples on which Wi SE-FT is correct while both the pretrained and fine-tuned models are incorrect. We then calculate the False False True ratio by dividing False False True number over the dataset size. Figure 2(Left) shows False False True ratio on each OOD dataset and compares it with overall improvement , which is the accuracy improvement of Wi SE-FT over the best of pre-trained and fine-tuned model. We can see that there are substantial parts of False False True samples in each dataset. Refer to Appendix C.1 for more details. It is interesting that the False False True ratio is even higher than the overall improvement in IN-R and IN-A, we provide in-depth analysis and explanation in Appendix C.1 and E.6. Illustration on when False False True occurs. In this part, we try to understand the False False True phenomenon. We first consider the output space ensemble to be similar to the weight space ensemble in this part and will present an analysis of their difference in Section 3. Suppose we want to distinguish from camels, cows, and dogs. There is one invariant feature xv (the shape of the animal) and two spurious features (the background xs,1 and the fur of the animal xs,2). Camels are typically found on sand, cows on grass, and dogs on the floor. Camels have brown fur, cows have dotted fur, and dogs are all black in the training dataset. See Fig. 1 for illustration. Suppose we fit two different models, f and f on the training dataset. Further assume f uses the feature xv and xs,1, and f uses xv and xs,2 2. Both f and f correctly predict samples from the training distribution. Whereas, for a sample from the testing distribution, e.g., a dog with brown fur (xs,2) on the grass (xs,1): f puts a large logit for the cow class since the background, grass, is spuriously correlated with cow, i.e., f(xv, xs,1) = [0.4, 0.6, 0]; f puts a large logit for the camel class since the texture, brown fur, is spuriously correlated with camel, i.e., f(xv, xs,2) = [0.4, 0, 0.6]. Both f and f make different mistakes under distributional shifts due to using different spurious features. However, the ensemble of them can make a correct prediction, i.e., 1/2f1(xv, xs,1) + 1/2f1(xv, xs,2) = [0.4, 0.3, 0.3]. Feature visualization. The reasoning above assumes that individual models utilize different features. Grad Cam (Selvaraju et al., 2016) visualization of the features used by the pre-trained (zeroshot), fine-tuned, and Wi SE-FT in Figure 2(Right) confirms this assumption. The visualization shows that the pre-trained and fine-tuned models rely on different features, while Wi SE-FT utilizes more diverse features. Additionally, (Allen-Zhu & Li, 2020) provides empirical evidence supporting the use of diverse features by different DNNs with the same architecture trained on the same datasets (with different initialization). They also provide formal theoretical proof for 2-layer DNNs. We include some of (Allen-Zhu & Li, 2020) s empirical results in Appendix C.2. Additionally, there is more evidence suggesting that DNNs favor sparse feature representations and discard redundant features (Papyan et al., 2020; Andriushchenko et al., 2023). 3 ANALYSIS ON SPURIOUS FEATURE DIVERSIFICATION 3.1 THEORETICAL SETTINGS Notation. For simplicity of presentation, we consider a 3-class classification problem, i.e., y {e1, e2, e3}, where ei denotes the 3-dimensional unit vector with ith element equaling 1, e.g., e2 = [0, 1, 0] . In Appendix F.2, we extend the setting to K-class classification. a(k) means 2For simplicity of illustration, we assume that f and f learn the same invariant feature. However, this is not necessary for EBM to outperform both individual models, as demonstrated in Section 3 Published as a conference paper at ICLR 2024 Figure 3: (a) µs,j Rd 3 represents a spurious feature, e.g., the background. Each column of µs,j is an attribute of the spurious feature, e.g., µs,j(1), µs,j(2) and µs,j(3) are the floor, grass, and sand, respectively. (b) Qs,j {0, 1}3 3 represents the relationship between labels and spurious features. In the ID distribution, Qs,j equals I, indicating that each spurious feature is perfectly correlated with the corresponding class. (c) In the OOD distribution, spurious correlation can fail, e.g., Qs,j(1) equals e2 with probability p/3, indicating the background of the dog is the grass. the kth element of vector a, A(k) means the kth column of matrix A. We use IK to represent a K K identity matrix, e.g., I3 = [e1, e2, e3]. We omit the subscript of I when no confusion arises. Suppose we have dv invariant features {xv,i}dv i=1 and ds spurious features {xs,j}ds j=1 where xv,i, xs,j Rd and the whole feature x Rd (ds+dv) is the concatenation of them, i.e., x = Concat {xv,i}dv i=1 {xs,j}ds j=1 = [xv,1, . . . , xv,dv, xs,1, . . . , xs,ds]. Consider that each model f is composed of a featurizer Φ {0, 1}dv+ds and a classifier w Rd 3. Φ first selects feature by xΦ. For example, suppose x = [x1, x2, x3] and Φ = [1, 1, 0] , then xΦ = x1 + x2. Then the classifier w Rd 3 is fit based on the features selected by Φ as w = arg minv Rd 3 Rid(v, Φ) = arg minv Rd 3 E(x,y) Did[ℓ(v (xΦ), y)], where ℓis the crossentropy loss function and Did is the ID distribution. (Remark: Refer to Appendix D.1 for detailed discussions on the setting.) Following (Rosenfeld et al., 2020; Wald et al., 2022), we consider that each xv,i and xs,j are generated from the label y with the latent invariant features µv,i and spurious features µs,i, where µv,i, µs,j Rd 3. The full data generation process is: Definition 1 (Data Generation Process). The whole data generation process is as follows: y Unif {e1, e2, e3} , x = Concat {xv,i}dv i=1 {xs,j}ds j=1 , Pθ(xv,i | y) = N µv,i Qv,iy, σ2Id , Pθ(xs,j | y) = N µs,j Qs,jy, σ2Id , i, j. (1) where Qv,i, Qs,j {0, 1}3 3. Further, Qv,i = I3 = [e1, e2, e3] always hold. In the ID distribution Did, Qs,j = I3; and in OOD Dood, the kth column of Q, i.e., Qs,j(k), is as follows for k = 1, 2, 3: Qs,j(k) = ek, with probability 1 p Unif{e1, e2, e3}, with probability p. The intuition of the data generation process. We consider the example in Figure 1. Figure 3 shows the intuition of µs,j and Qs,j. Suppose the spurious feature µs,j is the background in Figure 1. Here µs,j = [µs,j(1), µs,j(2), µs,j(3)] Rd 3 and each column µs,j(k) for k = 1, 2, 3 represents a specific attribute that is associated with class k in the training set. In other words, µs,j(1), µs,j(2), and µs,j(3) represent 3 attributes of background, namely, floor, grass, and sand, which are correlated with dog, cow, and camel, respectively. Consider a dog image (i.e., y = e1 = [1, 0, 0] ). We have µs,j Qy|y=e1 = µs,j Qs,j(1) and 3 further (a) In the ID distribution Did, Qs,j(1) = e1 and µs,j Qs,jy|y=e1 = µs,je1 = µs,j(1). Then xs,j = N(µs,j(1), σI), indicating that in Did the background of the dog (i.e., y = e1) is the floor (i.e., µs,j(1)). (b) In the OOD distribution Dood, Qs,j(1) = e1 with probability 1 p and Qs,j(1) Unif{e1, e2, e3} with probability p. Then we have the following: µs,j Qs,jy|y=e1 = µs,j(1), with probability 1 p Unif{µs,j(1), µs,j(2), µs,j(3)}, with probability p, 3Specifically, Qy|y=e1 = Q[1, 0, 0] = Qs,j(1), where Qs,j(1) is the first column of Qs,j. Published as a conference paper at ICLR 2024 indicating that in the OOD distribution the background of the dog (i.e., y = e1) is the floor (i.e., µs,j(1)) with probability 1 p and is randomly drawn from floor, grass, and sand (i.e., µs,j(1), µs,j(2), and µs,j(3)) with p. In other words, p is the probability that spurious correlation no-longer holds and a larger p indicates larger distributional shift. Remark. Our data generation process extends the setting of (Wald et al., 2022; Rosenfeld et al., 2020) to a 3-class classification problem with multiple features. This extension aligns with the intuition behind popular multi-class datasets used in empirical studies on OOD generalization, such as Full Color MNIST, Colored Object, and Cifar MNIST (Zhang et al., 2021; Lin et al., 2022a; Zhou et al., 2022b;a; Ahmed et al., 2021). Take Colored Object for example, correlations between classes and background colors exist in the training dataset but fail with a certain probability in OOD. Definition 2 (Individual models). Denote the whole invariant feature set as V := {xv,i}dv i=1 and spurious feature set S := {xs,j}ds j=1. Consider f = ( Φ, w) and f = ( Φ, w). Suppose Φ learns V V and S S, and Φ learns V V and S S. Denote | V| = nv, | S| = ns, | V| = nv, | S| = ns, | V V| = nvo, and | S S| = nso. Specifically, we have x Φ = P xv V xv + P xs S xs, w = arg minv Rd 3 Rid(v, Φ), and x Φ = P xv V xv + P xs S xs, w = arg minv Rd 3 Rid(v, Φ). Definition 3 (Output space ensemble (OSE)). Given the two individual models defined in Definition 2, the prediction of the the output space ensemble is fose(x) = 1 2( w (x Φ) + w (x Φ)). The predicted class of the sample (x, y) is the class with the maximum logit. Specifically, denote the logit as ˆl = f(x). The predicted class is ˆk = arg maxh {1,2,3} ˆl(h) where ˆl(h) of the hth dimension of the logit ˆl. The model makes correct prediction if I(eˆk = y) holds where I is the indicator function. The accuracy is A(f) = Ex,y[I(eˆk = y)]. We denote the OOD accuracy as Aood(f) = EQs Ex,y[I(eˆk = y)|Qs] , where we use Qs as a short hand for Qs,1, . . . , Qs,ds. We discuss the metric in Appendix D.4. We defer the analysis of ID accuracy to Appendix D.5 since we consider infinite samples and the ID accuracy of all considered models are all close to 1. Assumption 1 (Small Noise). Denote n v and n s as the the maximum number of invariant features and spurious features that a model can learn, respectively. We need the overall noise to be small to satisfy F K( 1 σ(n v+n s)) 1 ϵ, in which F is the cumulative distribution function of standard Gaussian random variable, and K refers to the class number (here we analyze the case K = 3). Remark. Since we impose random noise on each feature, e.g., xv,i = µv,i + z where z N(0, σ2Id) where Id is a d-dimensional identity matrix and d dv + ds, it is natural to assume the overall noise is controlled, e.g., we have ϵ 10 6 when K = 10, σ = 1/100, n v + n s = 20. Assumption 2 (Orthogonal features (Wald et al., 2022; Allen-Zhu & Li, 2020)). (1) µv,i(k) 2 = 1 and µs,j(k) 2 = 1 for i = 1, , dv, j = 1, , ds, k = 1, 2, 3. (2) vi(k) vi (k ) for any (i, k) = (i , k ), k, k = 1, 2, 3, vi, vi {µv,1, , µv,dv, µs,1, . . . , µs,ds}. 3.2 THEORETICAL RESULTS We first show the intuition on the simple Example 1 and then extend to the general setting in Def. 3: Example 1 (Illustrative examples). Consider that there are totally 4 invariant features {xv,i}4 i=1 and 6 spurious features {xs,j}6 j=1, and two individual models ( w, Φ) and ( w, Φ) learn nonoverlapped features as x Φ = P i=1,2 xv,i+P j=1,2,3 xs,j, and x Φ = P i=3,4 xv,i+P j=4,5,6 xs,j. Proposition 1 (Illustrative examples). Consider Example 1, suppose Assumption 1 and 2 hold, and there are infinite ID and OOD samples. Omitting small terms containing ϵ, we have Aood( f) = Aood( f) = 1 1 9p3, and Aood(fose) = 1 2p5 We can see that OSE improves OOD by Aood(fose) max{Aood( f), Aood( f)} > 1/81p3. Intuition of the proof (Full proof in Appendix F.1). Let s consider the samples of first class y = e1 = [1, 0, 0]. Model ( w, Φ) has x Φ|y=e1 = P2 i=1 µv,i Qv,i(1) + P3 j=1 µs,j Qs,j(1) + z where z N(0, 5σ2Id). By Lemma 5, we have w(k) = P2 i=1 µv,i(k) + P3 j=1 µs,j(k) for each class k = 1, 2, 3. Omitting the small noise term, the predicted logit for class k is w(k) (x Φ)|y=e1 = Published as a conference paper at ICLR 2024 P2 i=1 µv,i(k) (µv,i Qv,i(1)) + P3 j=1 µs,j(k) (µs,j Qs,j(1)) . The model will mistakenly predict e2 on the samples with true label e1 when w(1) x Φ|y=e1 < w(2) x Φ|y=e1. This will happen when the three events {Qs,j(1) = e2}3 j=1 simultaneously happen in OOD (see Appendix D.7 for detailed discussion). Each event occurs with a probability of p/3, resulting in a combination probability of p3/27. This means that with a probability of p3/27, we encounter an OOD scenario where the model f = ( w, Φ) incorrectly predicts almost all samples from the first class e1 as the second class e2. This failure occurs because all three spurious features happen to have values that are spuriously correlated with e2 in the training dataset. In other words, the three spurious features dominate the prediction of e2, overshadowing the two invariant features that predict the true label e1. For the OSE model, we have w(k) (x Φ) + w(k) (x Φ)|y=e1 = P4 i=1 µv,i(k) (µv,i Qv,i(1)) + P6 j=1 µs,j(k) (µs,j Qs,j(1)). The model will mistakenly predict e2 on the samples with true label e1 when at least five of the six events {Qs,j(1) = e2}6 j=1 simultaneously happen in OOD (see Appendix D.6 for details), whose probability is much less than that of f. Intuitively, the failure probability of the averaged model is smaller as it utilizes more spurious features, which are less likely to make the same mistakes. Proposition 2 (General Results for OSE). Consider Definition 1-3, Assumption 1-2 hold, and infinite ID and OOD samples. Omitting small constants involving ϵ, we have Aood( f) = Fp (1 p) ns+ nv ns , Aood( f) = Fp (1 p) ns+ nv ns Aood(fose) = Fp (1 p)( ns+ ns)+( nv+ nv) ns+ ns+2nso Figure 4: (a) Illustration of of F(x); (b) Aood(fose) Aood( f) in Example 2; Here Fp(x) is a cumulative density function (CDF) parameterized by p as defined in Appendix F.2, which is monotonically increasing with x as shown in Figure 4(a). Suppose two individuals learns the same number of features with no-overlap, i.e., nv = nv = nv, ns = ns = ns, and nvo = nso = 0, we have Aood(fose) = Fp 2t and Aood( f) = Aood( f) = Fp(t) where t = (1 p) ns + nv ns , indicating that fose is better than f since F( ) is monotonically increasing. Example 2. Consider p = 0.9 and two individual models learn none overlapped, i.e., nvo = nso = 0, fixing nv = 5, ns = 20, and vary nv = 0, 1, .., 5 and ns = 0, 1, ..., 20. Figure 4(b) illustrates Aood(fose) Aood( f) on Example 2. fose achieves better OOD performance than f in most cases. One exception is that if f is much weaker than f, e.g., f learns 5 invariant features but f learns 0 invariant features, the ensemble model fose is inferior than f. 3.3 THE DIFFERENCE BETWEEN THE OUTPUT AND WEIGHT SPACE ENSEMBLE It is an open problem on the difference between output space ensemble (OSE) and WSE (referred as OSE-WSE difference). Furthermore, the mysterious phenomenon of weight space ensembles outperforming output space ensembles in OOD scenarios has puzzled researchers (Wortsman et al., 2022; Wortsman et al.; Rame et al., 2022). We shed light on this by our bilinear theoretical model w xΦ: Definition 4 (Weight space ensemble (WSE)). Given the two individual models defined in Definition 2, the prediction of the WSE is fwse(x) = 1 4( w + w) x( Φ + Φ) . In Appendix D.2, we show that the OSE-WSE difference in a 2-layer DNN is closely connected with the OSE-WSE difference captured by our models in Definition 34. Proposition 3 (General Results for WSE). Consider Definition 1-3, Assumption 1-2, and infinite ID and OOD samples. Omittimg small constants involving ϵ, we have Aood(fwse) = Fp( (1 p)( ns+ ns+2nso)+( nv+ nv+2nvo) ns+ ns+14nso ). Comparing Proposition 2 and 3, we can see that the only difference between Aood(fwse) and Aood(fose) is the number of overlapped invariant and spurious features learned by individual Published as a conference paper at ICLR 2024 models, i.e., nvo and nso. Specifically, when Φ and Φ selects no overlapped features, fwse and fose makes the same prediction since x Φ w and x Φ w by Assumption 2 and further ( w + w) x( Φ + Φ) w x Φ + w x Φ. When there is overlapped features: (a) for WSE, the coefficient of overlapped features is amplified by 2 in Φ + Φ, and further amplified twice in w + w. This results in coefficient of the overlapped feature becoming 4 in ( w + w) x( Φ + Φ). (b) for OSE, i.e., w x Φ + w xΦ, the coefficient of the overlapped feature is 2. See Appendix D.7.1 for a detailed discussion. In Appendix D.7.2, we provide conditions when fwse outperforms fose, in addition with simulation results and supportive experiments. Our findings provide the first-ever explanation for the mysterious phenomenon of weight space ensembles outperforming output space ensembles in OOD. 3.4 EXPERIMENTAL VERIFICATION ON MULTICOLORMNIST Previous efforts in OOD community have focused on learning invariant features and discarding spurious features (Arjovsky et al., 2019). However, these approaches have not performed well on realworld datasets (Rosenfeld et al., 2020). This could be due to the requirements of invariant learning, such as the need for numerous domains (Rosenfeld et al., 2020), strong regularization (Zhou et al., 2022b), and the challenges posed by non-linearity, overparameterization, and optimization (Rosenfeld et al., 2020; Lin et al., 2022a; Chen et al., 2023c). In contrast, our findings show that learning diverse spurious features also help with OOD generalization. This approach, as shown in ensemblebased models, is easily implementable and has shown remarkable empirical success. To further verify our findings, we contruct Multi Color MNIST, a 10-class variant of CMNIST (Arjovsky et al., 2019) with 32 spurious features, following Definition 1. As shown in Figure 5, each sample in Multi Color MNIST consists of 32 color patches, each serving as a spurious feature. We train two neural networks, denoted as fθ1 and fθ2, with the same architecture but different initializations on Multi Color MNIST. The results in Table 1 show that the OSE model (fθ1(x) + fθ2(x)) improve OOD performance over individual models (fθ1(x) and fθ2(x)). In Appendix C.3, (1) we show that each individual model learn a subset of spurious features in Multi Color MNIST and OSE utilizes more diverse spurious features (2) we construct Single Color MNIST with only one spurious feature and show OSE yields little performance gain since both individual models learn the same spurious feature (similar to the results in Rame et al. (2022)). Figure 5: A sample from Multi Color MNIST p 0.70 0.75 0.80 0.85 0.90 model 1 71.05 1.04 60.07 1.04 48.57 0.92 36.93 0.70 26.01 0.45 model 2 71.77 0.94 60.75 0.91 49.26 0.83 37.74 0.66 26.63 0.42 model ensemble 78.64 0.73 67.61 0.80 55.25 0.75 42.34 0.64 29.28 0.40 Table 1: OOD performance of (output space) model ensemble on Multi Color MNIST. The spurious correlation is 1 and 1 p in the training and testing set, respectively. A larger p indicates larger distributional shift 4 BALANCED AVERAGING (BANG) Our previous results show that EBM can boost the OOD performance. An implicit requirement is that the scaling of the two models should be roughly the same. If the two models have different scalings, e.g., one model is much more confident than the other, the EBM improvement is weakened. Proposition 4 (Imbalanced scaling weakens WSE). Consider the Example 1, Definition 1-4, Assumption 1-2. Consider an WSE of two imbalanced models, f = ( w, Φ) and fλ = (λ w, λ Φ), where λ 1. Specifically, fwse(x) = 0.25( w + λ w)x( Φ + λ Φ). We have Aood(fwse)|λ> 5 Aood(fwse)|λ=1 < 34/729p3. See Appendix F.3 for proofs and Appendix D.8 for an illustration of the over-confidence characterized by λ. When λ = 1, indicating similar confidence levels between f and fλ, the WSE is balanced. However, when λ > 5 and fλ is significantly more confident than f, fwse becomes biased towards fλ, resulting in a performance drop of over 34/729p3. Here we set λ = 5 for illustration purposes and similar results can be similarly obtained for other λ > 1. Unfortunately, we find Published as a conference paper at ICLR 2024 Methods Model Averaging IN IN-V2 IN-R IN-A IN-S Object Net Avg OOD Zero-shot (Wortsman et al., 2022) No 68.3 61.9 77.6 49.8 48.2 53.0 58.1 Fine-tuning (Wortsman et al., 2022) No 81.3 70.9 65.6 36.7 46.3 49.6 53.8 Fine-tuning (LS) No 82.0 72.3 63.3 38.3 46.5 51.1 54.3 Fine-tuning (Mixup) No 83.0 72.7 66.4 43.7 48.8 52.4 56.8 Fine-tuning (Mixup + LS) No 82.9 72.7 65.8 43.6 48.5 52.2 56.6 Wi SE-FT (Wortsman et al., 2022) Yes 81.7 72.8 78.7 52.2 53.9 57.3 63.0 BANG (LS) Yes 82.1 73.3 78.2 55.2 53.7 58.9 63.9 BANG (Mixup) Yes 81.5 73.0 79.5 57.9 54.5 58.7 64.7 BANG (Mixup + LS) Yes 81.6 73.1 79.7 58.2 54.8 58.9 64.9 Table 2: Results of fine-tuning CLIP VIT-B/16 on Image Net. LS is short for Label Smoothing. The performance of the baseline methods are from the Table 8 of (Wortsman et al., 2022). Wi SE-FT, which is the WSE of the pre-trained model (PM) and fine-tuned model (FM), suffers from the imbalanced confidence issue. Specifically, we compare the PM and FM on their confidence and accuracy. The confidence is defined as the largest probability that a model assigns to a class (details in Appendix E.3). Figure 6 shows that the fine-tuned model is highly over-confident, especially on OOD datasets, e.g., Image Net A have only 0.37 accuracy while the average confidence is over 0.7. Such overconfidence magnifies the FM s incorrect prediction, leading to deteriorate OOD ensemble performance (details in Appendix E.6). A direct fix to the issue of over-confidence is to tune the temperature of the softmax of the fine-tuned model (Kumar et al., 2022). However, this method can not be directly applied to Wi SE-FT since Wi SE-FT ensemble model weights instead of the outputs. Moreover, the temperature scaling tuned on the ID dataset (Kumar et al., 2022) fails to calibrate the fine-tuned model on OOD datasets, where over-confidence is more severe (results of (Kumar et al., 2022) in Appendix E.5-E.6). Therefore, we propose BAla Nced avera Ging (BANG), which adopt label smoothing or Mixup during fine-tuning to prevent overconfidence and then average the pretrained model with such fine-tuned model. (1) Label smoothing replaces the label of the true class (e.g., 1) with a positive value (e.g., 0.8), while distributing the smoothing parameter (e.g., 0.2) evenly among the other classes (M uller et al., 2019). (2) Mixup (Zhang et al., 2017) generates new samples during fine-tuning by linearly mixing pairs of training data and their labels. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy The comparision of zero-shot and finetuned model zeroshot finetune Figure 6: Comparison of confidence and accuracy between zero-shot and finetuned model. In the figure, refers to IN, for IN-A, for IN-R, + for INS, for IN-V2 and for Obj Net. We conduct experiments with CLIP Vi T-B/16(Radford et al., 2021). We impose Mixup or Label Smoothing during fine-tuning the pre-trained CLIP on Image Net (IN), and test OOD performance on IN-V2, IN-R, IN-A, INS and Object Net. Following Wortsman et al. (2022), BANG averages the pre-trained CLIP model and the model finetuned with LS and Mix Up (details in Appendix E.4). The results in Table 2 show that BANG effectively improve the performance over Wi SE-FT. Specifically, BANG(LS+Mixup), where both LS and Mix Up are adopted, achieves 1.9% higher average OOD accuracy than Wi SE-FT. Further experimental results in the appendix show that Mixup and Label Smoothing can effectively alleviate the over-confidence of the fine-tuned model on both ID and OOD datasets. Since Mixup and LS also improve the performance of the fine-tuned model, so a curious reader would wonder whether the improvement of BANG comes from better calibration or just due to the improvement in the fine-tuned model. We conduct further investigation in Appendix E.6 to confirm the contribution of better calibration: (1) Dividing the weight of the vanilla fine-tuned model by multiple scalars significantly enhances the performance of weight averaging, which nearly matches the performance of BANG. (2) BANG can correct substantially more samples that is mis-classified by the fine-tuned model. We also show that BANG s effectiveness can not be explained by other data augmentation methods in Appendix E.5. ACKNOWLEDGEMENT We are grateful for the insightful discussion with Yongqiang Chen, Damien Teney, Alexandra Rame, Pang Wei Koh, Mitchell Wortsman, Difan Zou, and Hao Wang. Thank you for your valuable inputs. Published as a conference paper at ICLR 2024 Faruk Ahmed, Yoshua Bengio, Harm Van Seijen, and Aaron Courville. Systematic generalisation with group invariant predictions. In International Conference on Learning Representations, 2021. Kartik Ahuja, Jun Wang, Amit Dhurandhar, Karthikeyan Shanmugam, and Kush R Varshney. Empirical or invariant risk minimization? a sample complexity perspective. ar Xiv preprint ar Xiv:2010.16412, 2020. Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. ar Xiv preprint ar Xiv:2012.09816, 2020. Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Sgd with large step sizes learns sparse features. In International Conference on Machine Learning, pp. 903 925. PMLR, 2023. Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Devansh Arpit, Huan Wang, Yingbo Zhou, and Caiming Xiong. Ensemble of averages: Improving model selection and boosting performance in domain generalization. Advances in Neural Information Processing Systems, 35:8265 8277, 2022. Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019. Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning, 36:105 139, 1999. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446 461. Springer, 2014. Leo Breiman. Bagging predictors. Machine learning, 24:123 140, 1996. Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning, pp. 18, 2004. Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405 22418, 2021. Yimeng Chen, Tianyang Hu, Fengwei Zhou, Zhenguo Li, and Zhi-Ming Ma. Explore and exploit the diverse knowledge in model zoo for domain generalization. In International Conference on Machine Learning, pp. 4623 4640. PMLR, 2023a. Yongqiang Chen, Wei Huang, Kaiwen Zhou, Yatao Bian, Bo Han, and James Cheng. Towards understanding feature learning in out-of-distribution generalization. ar Xiv preprint ar Xiv:2304.11327, 2023b. Yongqiang Chen, Kaiwen Zhou, Yatao Bian, Binghui Xie, Bingzhe Wu, Yonggang Zhang, MA KAILI, Han Yang, Peilin Zhao, Bo Han, et al. Pareto invariant risk minimization: Towards mitigating the optimization dilemma in out-of-distribution generalization. In The Eleventh International Conference on Learning Representations, 2023c. Xu Chu, Yujie Jin, Wenwu Zhu, Yasha Wang, Xin Wang, Shanghang Zhang, and Hong Mei. Dna: Domain generalization with diversified neural averaging. In International Conference on Machine Learning, pp. 4010 4034. PMLR, 2022. Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606 3613, 2014. Published as a conference paper at ICLR 2024 MMPre Train Contributors. Openmmlab s pre-training toolbox and benchmark. https:// github.com/open-mmlab/mmpretrain, 2023. Yihe Deng, Yu Yang, Baharan Mirzasoleiman, and Quanquan Gu. Robust learning with progressive data expansion against spurious correlation. ar Xiv preprint ar Xiv:2306.04949, 2023. Thomas G Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21 23, 2000 Proceedings 1, pp. 1 15. Springer, 2000. Thomas G Dietterich et al. Ensemble learning. The handbook of brain theory and neural networks, 2(1):110 125, 2002. Qishi Dong, Awais Muhammad, Fengwei Zhou, Chuanlong Xie, Tianyang Hu, Yongxin Yang, Sung Ho Bae, and Zhenguo Li. Zood: Exploiting model zoo for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35:31583 31598, 2022. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. Zhili Feng, Anna Bair, and J Zico Kolter. Leveraging multiple descriptive features for robust fewshot image learning. ar Xiv preprint ar Xiv:2307.04317, 2023. Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259 3269. PMLR, 2020. Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096 2030, 2016. Robert Geirhos, J orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665 673, 2020. Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. ar Xiv preprint ar Xiv:2007.01434, 2020. Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340 8349, 2021a. Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262 15271, 2021b. Saachi Jain, Dimitris Tsipras, and Aleksander Madry. Combining diverse feature priors. In International Conference on Machine Learning, pp. 9802 9832. PMLR, 2022. Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. ar Xiv preprint ar Xiv:2204.02937, 2022. Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp. 554 561, 2013. Ananya Kumar, Tengyu Ma, Percy Liang, and Aditi Raghunathan. Calibrated ensembles can mitigate accuracy tradeoffs under distribution shift. In Uncertainty in Artificial Intelligence, pp. 1041 1051. PMLR, 2022. Published as a conference paper at ICLR 2024 Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European conference on computer vision (ECCV), pp. 624 639, 2018. Weixin Liang, Yining Mao, Yongchan Kwon, Xinyu Yang, and James Zou. Accuracy on the curve: On the nonlinear correlation of ml performance between data subpopulations. 2023. Yong Lin, Hanze Dong, Hao Wang, and Tong Zhang. Bayesian invariant risk minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16021 16030, 2022a. Yong Lin, Shengyu Zhu, Lu Tan, and Peng Cui. Zin: When and how to learn invariance without environment partition? Advances in Neural Information Processing Systems, 35:24529 24542, 2022b. John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, pp. 7721 7735. PMLR, 2021. Rafael M uller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? Advances in neural information processing systems, 32, 2019. Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512 523, 2020. Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019. Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40): 24652 24663, 2020. Seo Yeon Park and Cornelia Caragea. On the calibration of pre-trained language models using mixup guided by area under the margin and saliency. ar Xiv preprint ar Xiv:2203.07559, 2022. Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947 1012, 2016. Aahlad Puli, Lily H Zhang, Eric K Oermann, and Rajesh Ranganath. Out-of-distribution generalization in the presence of nuisance-induced spurious correlations. ar Xiv preprint ar Xiv:2107.00520, 2021. Shikai Qiu, Andres Potapczynski, Pavel Izmailov, and Andrew Gordon Wilson. Simple and fast group robustness by automatic feature reweighting. ar Xiv preprint ar Xiv:2306.11074, 2023. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021. Alexandre Rame, Kartik Ahuja, Jianyu Zhang, Matthieu Cord, Leon Bottou, and David Lopez-Paz. Model ratatouille: Recycling diverse models for out-of-distribution generalization. Alexandre Ram e, Kartik Ahuja, Jianyu Zhang, Matthieu Cord, L eon Bottou, and David Lopez-Paz. Recycling diverse models for out-of-distribution generalization. ar Xiv preprint ar Xiv:2212.10445, 2022. Alexandre Rame, Matthieu Kirchmeyer, Thibaud Rahier, Alain Rakotomamonjy, Patrick Gallinari, and Matthieu Cord. Diverse weight averaging for out-of-distribution generalization. ar Xiv preprint ar Xiv:2205.09739, 2022. Published as a conference paper at ICLR 2024 Alexandre Rame, Kartik Ahuja, Jianyu Zhang, Matthieu Cord, L eon Bottou, and David Lopez-Paz. Model ratatouille: Recycling diverse models for out-of-distribution generalization. 2023. Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp. 5389 5400. PMLR, 2019. Elan Rosenfeld, Pradeep Ravikumar, and Andrej Risteski. The risks of invariant risk minimization. ar Xiv preprint ar Xiv:2010.05761, 2020. Elan Rosenfeld, Pradeep Ravikumar, and Andrej Risteski. Domain-adjusted regression or: Erm may already learn features sufficient for out-of-distribution generalization. ar Xiv preprint ar Xiv:2202.06856, 2022. Robert E Schapire. The strength of weak learnability. Machine learning, 5:197 227, 1990. Robert E Schapire. The boosting approach to machine learning: An overview. Nonlinear estimation and classification, pp. 149 171, 2003. Robert E Schapire. Explaining adaboost. In Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik, pp. 37 52. Springer, 2013. Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. Grad-cam: Why did you say that? ar Xiv preprint ar Xiv:1611.07450, 2016. Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pp. 443 450. Springer, 2016. Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton Van den Hengel. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16761 16772, 2022. Junjiao Tian, Xiaoliang Dai, Chih-Yao Ma, Zecheng He, Yen-Cheng Liu, and Zsolt Kira. Trainable projected gradient method for robust fine-tuning. ar Xiv preprint ar Xiv:2303.10720, 2023. Yoav Wald, Gal Yona, Uri Shalit, and Yair Carmon. Malign overfitting: Interpolation and invariance are fundamentally at odds. In Neur IPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022. Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019. Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965 23998. PMLR. Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959 7971, 2022. Dinghuai Zhang, Kartik Ahuja, Yilun Xu, Yisen Wang, and Aaron Courville. Can subnetwork structure be the key to out-of-distribution generalization? In International Conference on Machine Learning, pp. 12356 12367. PMLR, 2021. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017. Jianyu Zhang and L eon Bottou. Learning useful representations for shifting tasks and distributions. In International Conference on Machine Learning, pp. 40830 40850. PMLR, 2023. Published as a conference paper at ICLR 2024 Jianyu Zhang, David Lopez-Paz, and L eon Bottou. Rich feature construction for the optimizationgeneralization dilemma. In International Conference on Machine Learning, pp. 26397 26411. PMLR, 2022. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452 1464, 2017. Xiao Zhou, Yong Lin, Renjie Pi, Weizhong Zhang, Renzhe Xu, Peng Cui, and Tong Zhang. Model agnostic sample reweighting for out-of-distribution learning. In International Conference on Machine Learning, pp. 27203 27221. PMLR, 2022a. Xiao Zhou, Yong Lin, Weizhong Zhang, and Tong Zhang. Sparse invariant risk minimization. In International Conference on Machine Learning, pp. 27222 27244. PMLR, 2022b. Published as a conference paper at ICLR 2024 1 Introduction 1 2 Understanding Ensemble-based Models via Examining Wi SE-FT 3 3 Analysis on Spurious Feature Diversification 4 3.1 Theoretical Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 The difference between the output and weight space ensemble . . . . . . . . . . . 7 3.4 Experimental Verification on Multi Color MNIST . . . . . . . . . . . . . . . . . . . 8 4 BAla Nced avera Ging (BANG) 8 A Social Impact 17 B Related Works 17 B.1 A review on the existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 On our difference with existing works . . . . . . . . . . . . . . . . . . . . . . . . 18 C Supportive Empirical Results for the Theory 20 C.1 False False True Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Deep neural networks learn different features . . . . . . . . . . . . . . . . . . . . 21 C.3 Experiments on Multi Color MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.3.1 Increasing the Number of Ensemble . . . . . . . . . . . . . . . . . . . . . 24 C.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D Discussions, illustrations, and supportive results for the theoretical parts. 26 D.1 Discussion on the theoretical models . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.2 Comparison on our model with a 2-layer DNN . . . . . . . . . . . . . . . . . . . . 26 D.3 Illustration of the transformation matrix Q . . . . . . . . . . . . . . . . . . . . . . 28 D.4 On the pessimism of worst-case theoretical analysis for OOD . . . . . . . . . . . . 28 D.5 ID performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 D.6 Intuition of OOD Performance Improvement of OSE . . . . . . . . . . . . . . . . 28 D.7 The Difference Between WSE and OSE in OOD . . . . . . . . . . . . . . . . . . . 30 D.7.1 Explaining the difference between WSE and OSE . . . . . . . . . . . . . . 30 D.7.2 The theoretical condition of WSE outperforming OSE . . . . . . . . . . . 31 D.7.3 Empirical Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 D.8 Illustrating the over-confidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E More experimental details and results on BANG 33 E.1 Details on Image Net Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Published as a conference paper at ICLR 2024 E.2 Details of Places365, Stanford Cars, DTD and Food101 (PSDF) . . . . . . . . . . . 34 E.3 Details on calculating the confidence . . . . . . . . . . . . . . . . . . . . . . . . . 34 E.4 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 E.5 More Results on BANG and Discussions . . . . . . . . . . . . . . . . . . . . . . . 36 E.6 Wi SE-FT benefits significantly from better calibration . . . . . . . . . . . . . . . . 37 F Proofs 39 F.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 F.2 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 F.2.1 Proof for Single Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 F.2.2 Proof for Weight Space Ensemble . . . . . . . . . . . . . . . . . . . . . . 44 F.2.3 Proof for Output Space Ensemble . . . . . . . . . . . . . . . . . . . . . . 46 F.2.4 Case Study for K = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 F.2.5 Close Form of Fp( ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 F.2.6 Close Form of G(nv, ns, nvo, nso, C) . . . . . . . . . . . . . . . . . . . . 49 F.3 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 F.4 Proof of Proposition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 F.5 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 G Illustrating the Theory of Wi SE-FT 55 H Illustrating the Effectiveness of BANG Through the Lens Accuracy on the Curve 56 Published as a conference paper at ICLR 2024 A SOCIAL IMPACT We investigate how to enable machine learning models to generalize in OOD scenarios, which makes machine learning models more reliable in real-world applications. B RELATED WORKS B.1 A REVIEW ON THE EXISTING METHODS Out-of-distribution generalization Machine learning models are based on the I.I.D. (independently and identically distribution) assumption. Whereas, the I.I.D. assumption can be easily violated since the model can easily encounter novel testing samples that are from distributions different with the training distribution. This is also known as the out-of-distribution generalization (OOD) problem. Existing works find that the model performance deteriorates dramatically under distributional shift. This is especially the case when the model rely on spurious features that are unstable in a new domain (Geirhos et al., 2020; Arjovsky et al., 2019; Deng et al., 2023). OOD problem has attracted great attention in recent years and there are a rich line of works in this direction, such as Invariant Risk Minimization (IRM) (Arjovsky et al., 2019; Lin et al., 2022b), model averaging (Wortsman et al.; 2022; Ram e et al., 2022; Cha et al., 2021), feature alignment methods (Ganin et al., 2016; Sun & Saenko, 2016; Li et al., 2018) and so on. Among them, Invariant Risk Minimization (IRM) has gained significant attention from the researchers (Arjovsky et al., 2019) and inspires a great line of works. Recall that there are two kinds of features for OOD generalization: invariant features that can stably predict the labels, and spurious features whose correlation with the labels is unstable. IRM tries to build robust models by extracting only invariant features. IRM has strong theoretical guarantees in linear system and clear connection with causal theory. Nevertheless, IRM methods face challenges in dealing with large scale real-world datasets, as it has been repeatedly observed that IRM cannot outperform ERM on various datasets. Some works have provided explanation for this, e.g., (Rosenfeld et al., 2020) shows that IRM needs a very great number of domains, (Rosenfeld et al., 2020) shows IRM lacks theretical guarantees on non-linear models, Lin et al. (2022b) shows it is difficult to learn invariance without domain partition and (Lin et al., 2022a; Chen et al., 2023c) show the difficulty of optimizing IRM objects on deep neural networks. In contrast, model averaging is exceptionally powerful and achieve SOTA performance in a lot of benchmark with various models and architecture (Cha et al., 2021; Wortsman et al., 2022; Wortsman et al.; Rame et al., 2022; Chu et al., 2022; Arpit et al., 2022). Output and weight space ensemble The ensemble of multiple models is a powerful idea that often leads to stronger predictive performance (Caruana et al., 2004; Dietterich, 2000; Bauer & Kohavi, 1999; Breiman, 1996). Typically, conventional ensemble methods aggregate the outputs of models, as known as the output space ensemnle. The recent application usually average the parameters of models which is generated from the same pre-training model by finetuning (Wortsman et al.; 2022; Cha et al., 2021), also known as the weight space ensemble. While averaging two models trained from scratch by different initialization often yields poor results (close to random guessing). (Neyshabur et al., 2020) finds that fine-tuning two models from the same pre-trained initialization results in two different models that were connected via a linear path in weight-space, along which the performance remains high. This is also known as linear mode connectivity (Frankle et al., 2020). A notable difference between ensemble in ID and OOD study is that the improvement of ensemble in OOD is much more significant than that in IID. (Wortsman et al., 2022) shows that an ensemble the finetuned model with the pre-trained model improve near 1pp in Image Net (the ID domain) and over 6-8pp on the variants of Image Net (the OOD Domain). Actually, model averaging is still among strongest methods for OOD generalization. It still remains mysterious on why averaging methods are so effective for OOD. Theory of Out-of-distribution generalization. Existing theory mostly focus on the worst case metric on analyzing the OOD performance (Wald et al., 2022; Arjovsky et al., 2019; Rosenfeld et al., 2020; Puli et al., 2021; Zhou et al., 2022b). The worst case metric requires a model to be robust at any OOD testing distribution. Typically, a model that only uses invariant features can minimize the worst case metric. However, as we discuss above, invariance learning is hard in Published as a conference paper at ICLR 2024 practice and performs ineffectively on real world datasets (Gulrajani & Lopez-Paz, 2020; Rosenfeld et al., 2020; Lin et al., 2022b). The worst case metric based theory can not explain the success of model averaging methods. To be specific, the averaging of two models can use spurious features that learnt by each individual model due to its pessimism as described in Appendix D.4. In contrast, our theoretical results characterize the probability of the model failure due to distributional shift, which can successfully explain the experimental results. B.2 ON OUR DIFFERENCE WITH EXISTING WORKS Difference with existing works on learning diverse features. There have been some works that promote feature diversity to enhance empirical performance OOD generalization by weight average (Chu et al., 2022; Rame et al., 2022; 2023), feature concatenation (Zhang & Bottou, 2023), boosted rich feature learning (Zhang et al., 2022; Chen et al., 2023b; Jain et al., 2022; Teney et al., 2022; Feng et al., 2023), and utilizing model zoo (Dong et al., 2022; Chen et al., 2023a). Teney et al. (2022) train a set of diverse models and select the best one among them for OOD. Feng et al. (2023) proposes to use an ensemble of prompts which contains diverse descriptions of a class to perform classification via CLIP. Jain et al. (2022) train two models with different feature priors and then ensemble the predictions of these models. However, existing explanations either do not distinguish the invariant or spurious features (Chu et al., 2022; Rame et al., 2022; 2023; Zhang & Bottou, 2023; Dong et al., 2022; Chen et al., 2023a), or focus only on learning the potentially missing invariant features (Chen et al., 2023b; Feng et al., 2023). In fact, according to existing invariance learning perspective Arjovsky et al. (2019) arguing that models relying on spurious features are prone to failure in OOD scenarios, these methods that learn diverse features while also incorporating spurious features may not be able to generalize effectively under distributional shift. In contrast, our spurious feature diversification viewpoint provides a explanation by characterizing why and when incorporating more diverse spurious feature diversification can improve OOD performance. Difference with the existing theoretical results on ensemble and boosting in IID settings. There are existing explanations for the effectiveness of model ensemble in the IID setting, which is mainly from the perspective of variance due to over-fitting the label noise in finite samples cases (Dietterich et al., 2002). Specifically, model ensemble can have smaller variance in prediction compared with each single model. Whereas, we consider infinite sample case, where the variance of the model due to fitting label noise is zero. So model ensemble can not bring significant IID improvement in this case. However, the model trained on infinite samples can still fail due to distributional shift (Arjovsky et al., 2019). This is because the model utilizes the spurious features, which are also considered as a kind of bias (Wald et al., 2022). Our results show that model ensemble can reduce the risk of model failure and lead to better expected performance under distributional shift by spurious feature diversification. In other words, model ensemble reduces the probability of the model failure due to the bias. This is a new result in the OOD problem as shown in Proposition 1 and 2. Notably, Allen-Zhu & Li (2020) also considers ensemble in the IID setting, however, their theory can not explain the OOD performance improvement of ESM models on the data in Definition 1, explain the False False True phenomenon, or explain the difference of weight and output space ensemble. Another related area to this work is boosting. Boosting can benefit by training multiple models, where each model corrects the mistakes made by the previous ones and each model would possibly utilize on different subsets of features. While previous studies on boosting mainly focused on ID scenarios (Schapire, 1990; Freund & Schapire, 1997; Schapire, 2013; 2003), we show that in the context of OOD, the improvement in performance due to using diverse features can be even more significant. This is because different irrelevant features can cause different errors when the distribution changes, and diversifying the features helps reduce the impact of each individual feature (as shown in Figure 1). By utilizing a diverse set of models, boosting allows us to take advantage of a wider range of features and effectively deal with the challenges posed by OOD situations. Difference with existing explanations on the OOD performance of ensemble-based methods (EBM). There are some previous attempts that try to explain the effectiveness of EBM for OOD. Cha et al. (2021) shows that the loss landscape changes under distributional shift and model averaging can lead to flatter minima. However, as discussed in Rame et al. (2022), the upper bound of Cha et al. (2021) is uncontrolled and their analysis based on flat minima fails to explain many experi- Published as a conference paper at ICLR 2024 mental results. Rame et al. (2022) decomposes the OOD loss into the bias, variance and covariance terms. They show that the variance term can benefit from EBM. Different from the results of Rame et al. (2022) that only tackles with the variance term, our results provide a concise characterization on the overall OOD performance. Further, Rame et al. (2022) s results can not differentiate between the weight and output space ensemble. Published as a conference paper at ICLR 2024 C SUPPORTIVE EMPIRICAL RESULTS FOR THE THEORY C.1 FALSEFALSETRUE PHENOMENON In this subsection, we take a deeper look at Wi SE-FT (Wortsman et al., 2022), a popular model averaging method that averages the weights of the pre-trained and fine-tuned model. (Wortsman et al., 2022) obtains the fine-tuned model by fine-tuning the pre-trained CLIP model on Image Net. They have shown that the averaging of the pre-trained and the fine-tuned model can outweigh both of them on Image Net (ID dataset) as well as five OOD datasets (Image Net V2, Image Net A, Image Net R, Image Net Sketch and Object Net). We denote the pre-trained model as PM, fine-tuned model as FM, and averaged model as AM. To understand why model averaging is effective, we divide each dataset into eight groups of samples according to whether the PM, FM and AM make correct predictions, respectively. We further use T/F to denote whether a model makes correct predictions, i.e., T for True and F for False. For example, we use PM(T)-FM(F)-AM(T) to denote the group of samples on which the predictions of PM , FM and AM are correct, wrong, and correct, respectively. A simple explanation for the improvement of the averaging of two models is that when one model makes a mistake and the other one is correct, the correct model can rectify the mistakes made by the other model. So we evaluate the performance on the group of data where one model makes a wrong prediction while the other model makes a correct prediction, i.e., the group containing PM(T)-FM(F) and PM(F)-FM(T). We refer to this group of data as TF+FT for short in the following discussion. We also look into another subset TT+FF which contains PM(T)-FM(T) and PM(F)-FM(F). Given a subset G of a dataset D, we use Correct Num(G; f) to denote the number samples in G that are correctly predicted by a model f, e.g., Correct Num(TF+FT; PM) stands for the number of samples that are correctly classified by the pre-trained model PM. We propose the metric Improve Contri(G) which estimates how much AM performs better than PM and FM on the group G and how much the improvement on G contributes to the overall accuracy improvement on the whole dataset D: Improve Contri(G) = Correct Num(G; AM) max{Correct Num(G; PM), Correct Num(G; FM)} For Example, suppose D contains 1,000 samples and its subset G contains 200 samples. PM, FM and AM correctly predict 120, 118, 130 samples in G, i.e., Correct Num(G; PM) = 120, Correct Num(G; FM) = 118, Correct Num(G; PM) = 130. AM outperform PM and FM by making 10 more correct predictions on G, further these 10 samples contribute to 10 1,000 100% = 1.0% accuracy improvement on the whole dataset. Note that Improve Contri(D) denotes the accuracy improvement of model averaging on the dataset D, which is also denoted as Improve Contri(ALL) in the following discussion. The results of Improve Contri(TT+FF), Improve Contri(TF+FT) and Improve Contri(ALL) are illustrated in Figure 7(a). We surprisingly find that Improve Contri(G) is significant on TT+FF in all the datasets, which means the averaged model AM can exceed PM and FM on the groups where PM and FM are both right or wrong. Recall that the subset TT+FF contains four groups, PM(T)-FM(T)-AM(T), PM(T)-FM(T)- AM(F), PM(F)-FM(F)-FM(F), and PM(F)-FM(F)-FM(T). We further plot the ratio of the sample size in PM(T)-FM(T)-AM(F) and PM(F)-FM(F)-FM(T) over |G| in Figure 7(b), respectively. We find that PM(T)-FM(T)-AM(F) is nearly the same (about 0.5% ) in all datasets. The group PM(F)- FM(F)-AM(T) is much larger than PM(T)-FM(T)-AM(F), especially in OOD datasets. It indicates that AM can make correct predictions on many samples where the both PM and FM make wrong predictions when distributional shift occurs! Interestingly, we find Improve Contri(TF + FT) is negative on some datasets, e.g, IN-R and IN-S. In Section 4 and Appendix E.6, we find that this is because the fine-tuned model is highly over-confident and the fine-tuned model dominate Wi SE-FT even when it make mistakes. Remark: In Figure 2(Left) of Section 2, we present the results of Improve Contri(TT+FF) to represent the samples where both individual models make incorrect predictions, but the averaged model makes correct predictions. Improve Contri(TT+FF) is calculated as the group ratio of PM(F)-FM(F)- AM(T) subtracted by PM(T)-FM(T)-AM(F). We use Improve Contri(TT+FF) instead of PM(F)- FM(F)-AM(T) because we believe that there is a certain proportion of samples in PM(F)-FM(F)- AM(T) where the averaged model corrects mistakes due to the randomness introduced by the non- Published as a conference paper at ICLR 2024 IN IN-V2 IN-R IN-A IN-S Obj Net 2 Percentage(%) Improve Contri(TF+FT) Improve Contri(TT+FF) Improve Contri(ALL) IN IN-V2 IN-R IN-A IN-S Obj Net 0 Group ratio(%) 0.52 0.56 0.42 0.47 0.51 0.61 PM(T)-FM(T)-AM(F) PM(F)-FM(F)-AM(T) Figure 7: A closer look at Wi SE-FT, which averages the pre-trained CLIP and the model obtained by fine-tuning CLIP on Image Net. Here Image Net is regarded as ID domain and the other 5 Image Net variants are OOD domains, i.e., IN-V2 (Image Net V2), IN-R(Image Net R), IN-A(Image Net A), IN-S (Image Net Sketch), and Obj Net (Object Net). (Left) Improve Contri(G) is defined in Eqn. equation 2, which estimates how the AM (averaged model) performs better than the PM (pre-trained) and FM (fine-tuned model) on the group G and how much the improvement on G contributes to the overall accuracy improvement on the whole dataset D. (Right) The ratio of sample size in PM(T)-FM(T)- AM(F) and PM(F)-FM(F)-AM(T) over the same size of the whole dataset. Here PM(T)-FM(T)- AM(F) denotes the group where the AM make wrong predictions and the PM and FM models make correct predictions; PM(F)-FM(F)-AM(T) denotes the group where AM make correct predictions while both PM and FM make wrong predictions. Putting these two figures together, we can see the AM can correct many samples on which PM and FM make wrong predictions in OOD. linearity of deep neural networks (DNNs) during weight averaging. To approximate such randomness, we use the size of PM(T)-FM(T)-AM(F). This adjustment helps account for a more accurate approximation of the sample ratios where the averaged model corrects the samples due to its utilization of more diverse spurious features. C.2 DEEP NEURAL NETWORKS LEARN DIFFERENT FEATURES In Section 2, we have shown that the pre-trained and fine-tuned uses different features and the averaged model can utilize more diverse features. Actually, (Allen-Zhu & Li, 2020) provides empirical evidence (e.g., Figure 3 and 4 in (Allen-Zhu & Li, 2020)) supporting the use of diverse features by different deep neural networks with same architecture, even when trained on the same datasets (with different initialization). We add their empirical observations in Figure 8 for easy of reference. C.3 EXPERIMENTS ON MULTICOLORMNIST Multi Color MNIST. We extend the CMNIST (Arjovsky et al., 2019) to Multi Color MNIST, which is constructed following Definition 1. Multi Color MNIST contains 10 classes with 32 spurious features. Each image has 42 42 3 pixels. There are 32 patches in each image and each patch can take one of 10 colors. Figure ?? illustrates two samples from Multi Color MNIST. Specifically, the label of the sample is generated from the shape of the digit and each color patch is perfected correlated with the label. Let Ci denote ith color patch for i = 1, 2, ..., 32. Each Ci takes one of the color which is perfectly correlated with y. For example, the 1st patch, i.e., C1, always takes white on samples with label 5; the 2nd patch, i.e., C2, always takes yellow on samples with label 5. Each Ci is independently generated from the label y and we have Ci Cj|y for i = j. See Figure 10 for detailed illustration of the data generation process which follows the theoretical Definition 1. In the OOD testing distribution, the spurious correlation can fail with probability p. For example, samples with label 5 can randomly pick any color with probability p in OOD . The data generation process is analogous to the theoretical setting in Definition 1, where each patch is a spurious feature and each color is an attribute that the spurious feature can take. Single Color MNIST. We also introduce Single Color MNIST for better comparision. Single Color MNIST has 10 classes and each image has 42 42 3 pixels, which is the same with Multi Col- Published as a conference paper at ICLR 2024 Figure 8: Figures taken from (Allen-Zhu & Li, 2020) which show that different DNN (with the same architecture) learns different features even trained on the same dataset. or MNIST. However, Single Color MNIST only contains 1 spurious features. In other words, the 32 patches in each image are the same. The spurious correlation is defined similarly with Multi Color MNIST. Figure 9 illustrates two samples from Single Color MNIST. Experimental Details. We use the following configuration for both Single Color MNIST and Multi Color MNIST. We use an 2 layer MLP with 64 hidden units to perform classification. We adopt Adam with learning rate 10 3 and batch size 100. We train for 5000 steps and report the performance at the last step. We train two individual models f and f with different random initialization on Multi Color MNIST. We also evaluate the ensemble of the two models, i.e., fose(x) = f(x) + f(x). Each experiment is repeated for n = 20 random seeds. Results. We vary p in Multi Color MNIST and compare the performance of the ensemble model with each individual model. p is the probability that spurious correlation no-longer holds in testing environment. A larger p indicates larger distributional shift. The results of Multi Color MNIST are summarized in Table 3. We can see that model ensemble consistently improve the OOD performance. Figure 11 visualizes how much each model relies on each patch. Specifically, Figure 11 shows how much the model changes its prediction when we replace a patch with black color. We can see each individual models uses different feature sets and model ensemble uses more diverse features. Table 4 shows the results of Single Color MNIST. We can see that model ensemble can not improve the OOD performance in Single Color MNIST (since there is only one spurious feature in Single Color MNIST and model ensemble can not utilize more diverse spurious features). Comparing Table 3 and Table 4, we can see that the performance of individual model in Multi Color MNIST is higher than that in Single Color MNIST when the p is the same. This is because the individual model already learns multiple spurious features (even though it is only a small subset of the whole feature set as shown in Figure 11). This is also consistent with our theoretical results that diverse spurious features leads to better OOD performance. Remark. Recall weight space ensemble (WSE) needs to be conducted between the pre-trained and fine-tuned models or different fine-tuned models starting from the same pre-trained model (Wortsman et al., 2022; Frankle et al., 2020). Since we have suitable pre-trained model for the synthetic dataset, we leave the investigation of WSE on Multi Color MNIST to future work. Published as a conference paper at ICLR 2024 Figure 9: Two samples from Single Color MNIST. Single Color MNIST has 10 classes and each sample contains 1 spurious feature. Figure 10: The data generation process of Multi Color MNIST (follows Definition 1) Figure 11: Visualization of the features uses by each model and model ensemble. p model 1 model 2 model ensemble 0.10 100.00 0.00 100.00 0.00 100.00 0.00 0.20 99.99 0.00 99.99 0.00 100.00 0.00 0.30 99.91 0.01 99.89 0.04 99.99 0.01 0.40 99.16 0.11 99.19 0.13 99.75 0.03 0.50 95.84 0.35 96.06 0.35 98.13 0.14 0.60 87.15 0.74 87.56 0.69 92.31 0.41 0.70 71.05 1.04 71.77 0.94 78.64 0.73 0.75 60.07 1.04 60.75 0.91 67.61 0.80 0.80 48.57 0.92 49.26 0.83 55.25 0.75 0.85 36.93 0.70 37.74 0.66 42.34 0.64 0.90 26.01 0.45 26.63 0.42 29.28 0.40 Table 3: Results on Multi Color MNIST Published as a conference paper at ICLR 2024 p model 1 model 2 model ensemble 0.1 91.04 0.03 91.07 0.05 91.04 0.04 0.2 82.34 0.02 82.39 0.07 82.34 0.04 0.3 72.93 0.08 72.99 0.10 72.93 0.09 0.4 64.08 0.09 64.21 0.19 64.10 0.11 0.5 54.89 0.12 54.99 0.18 54.88 0.14 0.6 45.91 0.16 46.09 0.31 45.92 0.19 0.7 37.39 0.15 37.55 0.27 37.39 0.17 0.8 27.86 0.19 28.06 0.32 27.87 0.22 0.9 19.28 0.18 19.50 0.34 19.29 0.20 Table 4: Results on Single Color MNIST C.3.1 INCREASING THE NUMBER OF ENSEMBLE In Table 1 and 3, we show that the ensemble of two models improves significantly over each individual model on Multi Color MNIST. In this part, we are going to show that if increasing the number of models in the ensemble can even increases more significantly. Specifically, in Table 5, we show the results of different model number in the ensemble. When the ensemble number is 1, it means that we consider a single model (in other words, not performing model ensemble). If ensemble number is 16, it indicates that we independently train 16 models with different initialization and use the ensemble of these 16 models to make predictions. We can see that increasing the ensemble number can signficantly boost the OOD performance. For example, when p = 0.8, the OOD performance of single model (ensemble number equals 1) is 49.33%. The ensemble of two models achieves 55.92% OOD accuracy. The ensemble of 16 models can increases the OOD accuracy to 64.85%! This also gives us a hint on the effectiveness of model soup, which averages multiple checkpoints trained with different hyper-parameters. p 0.70 0.75 0.80 0.85 0.90 Ensemble Number 1 71.66 2.06 60.68 2.23 49.33 2.02 37.74 1.58 26.74 1.05 2 78.88 1.24 68.34 0.89 55.96 0.77 42.91 0.64 29.89 0.63 4 84.39 1.33 74.00 1.26 62.04 1.32 47.92 1.17 32.74 0.75 8 85.64 1.22 75.73 1.62 63.52 1.61 49.15 1.23 33.67 0.93 16 86.76 0.55 77.31 0.87 64.85 1.09 50.63 0.69 34.47 0.40 Table 5: Experiments on Multi Color MNIST. A larger p indicates larger distributional shift. On the other hand, if the dataset only contains a single spurious feature, e.g., the Single Color MNIST, we find that increasing ensemble number does not help the OOD performance. These results are included in Table 6. p 0.70 0.75 0.80 0.85 0.90 Ensemble Number 1 37.40 0.11 32.64 0.11 27.90 0.14 23.32 0.14 19.32 0.14 2 37.32 0.03 32.54 0.05 27.78 0.04 23.20 0.04 19.18 0.05 4 37.35 0.09 32.57 0.12 27.81 0.13 23.22 0.11 19.23 0.12 8 37.30 0.01 32.50 0.01 27.74 0.02 23.16 0.02 19.16 0.00 16 37.35 0.08 32.56 0.09 27.82 0.12 23.22 0.10 19.24 0.11 Table 6: Experiments on Single Color MNIST. A larger p indicates larger distributional shift. Published as a conference paper at ICLR 2024 C.4 SIMULATION In this section, we take some simulations to investigate the performance of theoretical forecasting results of OOD accuracy. Following the data generation process in Definition 1, here we consider four examples: 1. example 1-1: nv = 2, ns = 3 in model 1; nv = 2, ns = 3 in model 2; overlapped feature number nvo = nso = 0; noise variance σ = 0.01; distribution shift probability p = 0.9. 2. example 1-2: nv = 2, ns = 3 in model 1; nv = 2, ns = 3 in model 2; overlapped feature number nvo = nso = 1; noise variance σ = 0.01; distribution shift probability p = 0.9. 3. example 2-1: nv = 5, ns = 20 in model 1; nv = 4, ns = 20 in model 2; overlapped feature number nvo = nso = 0; noise variance σ = 0.01; distribution shift probability p = 0.9. 4. example 2-2: nv = 5, ns = 20 in model 1; nv = 5, ns = 20 in model 2; overlapped feature number nvo = 4, nso = 1; noise variance σ = 0.01; distribution shift probability p = 0.9. In each example, we take 1000 simulations to report the mean OOD accuracy in Table 7. To be precise, the training data size is 20000 and the test data size is 10000 in each simulation. Then com- Model 1 Model 2 Model Average Model Ensemble Example 1-1 Simulation Results 0.866 0.866 0.974 0.974 Theoretical Results 0.865 0.865 0.973 0.973 Example 1-2 Simulation Results 0.866 0.861 0.943 0.940 Theoretical Results 0.865 0.865 0.948 0.946 Example 2-1 Simulation Results 0.940 0.894 0.978 0.978 Theoretical Results 0.941 0.910 0.980 0.980 Example 2-2 Simulation Results 0.943 0.939 0.999 0.989 Theoretical Results 0.943 0.943 0.992 0.983 Table 7: Simulation for the OOD accuracy in different models paring the results of theoretical results and simulation results, it is safely to say that our theoretical analysis, as well as proper approximations, could take an effective estimation for OOD accuracy. Published as a conference paper at ICLR 2024 D DISCUSSIONS, ILLUSTRATIONS, AND SUPPORTIVE RESULTS FOR THE THEORETICAL PARTS. D.1 DISCUSSION ON THE THEORETICAL MODELS Our theoretical models in Section 3 is designed to mimic the modern deep learning architectures such as Vision Transforms (Vi T) (Dosovitskiy et al., 2020). Figure 12 provides a comparison between our theoretical models and Vision Transformers. Similar to Vi T, we process images as patches, where each patch corresponds to a specific feature denoted as Patchi. Each Patchi is represented by high-dimensional vectors xi Rd in the embedding space. Consequently, the whole feature is obtained by concatenating the embeddings of each patch, resulting in x = [x1, x2, ...]. Assuming a total of dt features (dt = dv + ds in Section 3), we have x Rd dt. Notably, (Allen-Zhu & Li, 2020) also uses a similar theoretical data model that concatenates the patches to analyze the convolutional neural networks, e.g., Figure 5 in (Allen-Zhu & Li, 2020). To simplify the model, we utilize a two-layer structure consisting of a binary feature mask Φ as the feature encoder and a linear classifier w, analogous to Vi T which uses the transformer feature encoder with an MLP classifier. This two-layer simplification approach has been widely employed in OOD literature (Arjovsky et al., 2019; Rosenfeld et al., 2020; Zhou et al., 2022b; Peters et al., 2016; Lin et al., 2022b). The difference between our theoretical model and Vi T is that Vi T process the features sequentially while we select the feature at once. The binary feature mask Φ is represented as {0, 1}dt. For instance, if we have three features, i.e., x = [x1, x2, x3], and Φ = [1, 1, 0], the learned feature would be x Φ = x1 + x2. Considering a 3-class classification task, the linear classifier w Rd 3 takes the learned feature x Φ as input and produces a 3-dimensional vector whose elements represent the logits of the three classes. The classifier w is optimized to minimize the in-distribution (ID) loss based on the learned feature. Therefore, we have: w = arg min v Rd 3 Rid(v, Φ), where Rid(v, Φ) represents the loss of (v, Φ) in the ID distribution. D.2 COMPARISON ON OUR MODEL WITH A 2-LAYER DNN In this paper, we consider the model w xΦ, where w Rd K and Φ {0, 1}dv+ds are paramters. Here the input x Rd (dv+ds) and see App D.1 for detailed discussion. We then compare our model with a general 2-layer DNN to see why it can capture the difference between weight space ensemble (WSE) and output space ensemble (OSE) in DNN. Consider a general 2-layer DNN parameterized by (Wa Rd1 d2, Wb Rd2 K) with Re LU activation δ( ) and output fdnn(X) = W b δ(W a X) for X Rd1. Here we use uppercase X to avoid confusion with our previous x since they have slightly different dimensions (App D.1). Since WSE is conducted on the models that is close to a pre-trained model (Wortsman et al.), e.g., (Wa0, Wb0), so we consider fdnn(X) = (Wb0 + Wb) δ((Wa0 + Wa) X) where Wa and Wb is small and trainable. By Taylor expansion, we have fdnn(X) = W b0δ(W a0X) | {z } (a)Fixed Term + W b0δ(W a0X) + Wb0δ (W a0X)( W a X) | {z } (b)Linear Term + Wbδ (W a0X)( W a X) | {z } (c)Bilinear Term Where δ (Y ) is δ(Y ) Y . Further, we incorporate the fact that the second order derivative 2δ(Y ) 2Y is zero almost everywhere for Re LU activation function (except at Y = 0). Then ξ is the error term induced by the non-linearity of Re LU activation function (while W T a0X has some zero elements). To be precise, as here we just focus on fine-tuning regime and W T a0X is not sparse in general situations, it is safely to say that ξ is small. WSE and OSE are exactly the same for the (a) fixed term and (b) linear term. We will show that WSE and OSE differs on the (c)bilinear term, which is captured by our model in Definition 3-4. Published as a conference paper at ICLR 2024 Figure 12: Comparison of our theoretical models in Section 3 with Vision Transformers (Dosovitskiy et al., 2020). Some parts of the figures are adopted from (Dosovitskiy et al., 2020). Consider two models, fdnn and fdnn, both close to the pre-trained model. Specifically, fdnn(X) = (Wb0 + Wb) δ((Wa0 + Wa) X); fdnn(X) = (Wb0 + Wb) δ((Wa0 + Wa) X); Then the output space ensemble of fdnn(X) and fdnn(X) is fdnn,ose = 0.5 (Wb0 + Wb) δ((Wa0 + Wa) X) + (Wb0 + Wb) δ((Wa0 + Wa) X) = W b0δ(W a0X) | {z } (a)Fixed Term + 0.5( Wb0 + Wb0) δ(W a0X) + Wb0δ (W a0X)(0.5( Wa + Wa) X) | {z } (b)Linear Term + 0.5 Wbδ (W a0X)( W a X) + Wbδ (W a0X)( W a X) | {z } (c)Bilinear Term fdnn,wse =(Wb0 + 0.5( Wb + Wb)) δ((Wa0 + 0.5( Wa + Wa)) X) = W b0δ(W a0X) | {z } (a)Fixed Term + 0.5( Wb0 + Wb0) δ(W a0X) + Wb0δ (W a0X)(0.5( Wa + Wa) X) | {z } (b)Linear Term + 0.25( Wb + Wb)δ (W a0X)(( Wa + Wa) X) | {z } (c)Bilinear Term Comparing fdnn,ose with fdnn,wse , we can see that the difference of them lies in the bilinear term: fdnn,wse fdnn,ose = 0.25( Wb + Wb)δ (W a0X)(( Wa + Wa) X) 0.5 Wbδ (W a0X)( W a X) + Wbδ (W a0X)( W a X) (3) We can see the bilinear term difference has a clear analogy with our models in Definition 3-4. Specifically, according to our Definition of OSE and WSE in Definition 3-4, we have fwse fose = 0.25( w + w) x( Φ + Φ) 0.5( w x Φ + w x Φ). (4) Published as a conference paper at ICLR 2024 Comparing equation 3 and equation 4, we can see that w is analogous to Wb and Φ is analogous to Wa. equation 3 and equation 4 differ by a scaling δ (W a0X), which is a fixed matrix independent of the trainable parameter ( Wa, Wb) . D.3 ILLUSTRATION OF THE TRANSFORMATION MATRIX Q Consider the 3-class classification problem. In the ID distribution, we have, Qs,i = [e1, e2, e3] = I3 = "1, 0, 0 0, 1, 0 0, 0, 0 This indicates that each spurious feature is perfectly correlated with the invariant feature, as illustrated in Figure 13 (left). For instance, Qs,j = I3 implies that the background of the dog, crow, and camel are floor, grass, and sand, respectively. In the OOD distribution, Qs,j is no longer equal to I, indicating that the correlation between animals and the background may fail with a certain probability. Figure 13 (right) illustrates Qs,j(1), which represents the first column of Qs,j. Qs,j(1) can take the value e2 with a probability of p/3, indicating that the background of the dog is grass in this case. Similarly, Qs,j(1) can take the value e3 with a probability of p/3, indicating that the background of the dog is sand with a probability of p/3. Figure 13: Illustrations of the matrix Qs,j. D.4 ON THE PESSIMISM OF WORST-CASE THEORETICAL ANALYSIS FOR OOD D.5 ID PERFORMANCE Recall that in Section 3 the OOD accuracy is defined by Aood(f) = EQs Ex,y[I(eˆk = y)|Qs] . The ID accuracy Aid(f) is defined similarly by fixing [Qs,1, . . . , Qs,ds] = [I, . . . , I]. According to Lemma 3, we know that the ID accuracy of all models involved in Definition 2, Example 1-2 are larger than 1 ϵ. D.6 INTUITION OF OOD PERFORMANCE IMPROVEMENT OF OSE We use Example 1 to show the main intuition of the output space ensemble (OSE). In Example 1, two individual models learn non-overlapped feature, so model ensemble and averaging are the same. According to the proof in Appendix F.1, consider the samples from the first class, i.e., y = e1, the Published as a conference paper at ICLR 2024 predicted logit of the each class is w(1) x Φ|y=e1 = i=1 µv,i(1) (µv,i Qv,i(1)) + j=1 µs,j(1) (µs,j Qs,j(1)) , w(2) x Φ|y=e1 = i=1 µv,i(2) (µv,i Qv,i(1)) + j=1 µs,j(2) (µs,j Qs,j(1)) , w(3) x Φ|y=e1 = i=1 µv,i(3) (µv,i Qv,i(1)) + j=1 µs,j(3) (µs,j Qs,j(1)) , where we omit the noise term whose impact on the accuracy is less than ϵ according to Lemma 3. Further, let µ denote any µv,i and µs,j and Q denote its corresponding transformation matrix. For example, µ = µs,j and Q = Qs,j. Suppose Q(1) = ek2, we have µ(k1) (µQ(1)) = µ(k1) (µek2) = µ(k1) µ(k2) = 1, if k1 = k2, 0, otherwise. For the invariant features, Qv,i(1) = e1 always hold and for spurious features, Qs,j(1) takes e1 with probability 1 2p 3 , and takes e2 or e3 with p 3, respectively. So the predicted logit of the each class is simply w(1) x Φ|y=e1 =2 + j=1 I(Qs,j(1) = e1), w(2) x Φ|y=e1 =0 + j=1 I(Qs,j(1) = e2), w(3) x Φ|y=e1 =0 + j=1 I(Qs,j(1) = e3), Let us consider the probability of w(2) x Φ|y=e1 > w(1) x Φ|y=e1, i.e., the model ( Φ, w) mistakenly predicts the second the class e2 even if the true class is e1. This will happen when {I(Qs,j(1) = e2)}j=1,2,3 holds simultaneously, whose probability would be p3 27. Intuitively, 3 spurious features takes the value in OOD that is correlated with the second class e2, overwhelming the two invariant features correlated with e1. As for the averaged model, we have w(1) x Φ|y=e1 =4 + j=1 I(Qs,j(1) = e1), w(2) x Φ|y=e1 =0 + j=1 I(Qs,j(1) = e2), w(3) x Φ|y=e1 =0 + j=1 I(Qs,j(1) = e3). We will have w(2) x Φ|y=e1 > w(1) x Φ|y=e1 if either of the following occurs {I(Qs,j(1) = e2)}6 j=1 holds simultaneously, whose probability would be p6 729 Five of {I(Qs,j(1)}6 j=1 takes e2 and the remaining one takes e3, i.e., P6 j=1 I(Qs,j(1) = e2) = 5 and P6 j=1 I(Qs,j(1) = e3) = 1. Such probability is 6p6 The total probability is then 7p6 27. See Figure 14 for a visualization of the main intuition. Published as a conference paper at ICLR 2024 Figure 14: Comparison of the failure probability of individual model and averaged model. With probability p3/27, the individual model ( Φ, w) will encounter an OOD distribution where it mistakenly predicting the second class e2 on the samples from the first class e1. For the averaged model, such probability would be roughly about p6/729. Refer to Appendix D.6 for detailed explanation. D.7 THE DIFFERENCE BETWEEN WSE AND OSE IN OOD D.7.1 EXPLAINING THE DIFFERENCE BETWEEN WSE AND OSE We use the following Example 3 to show the main intuition of the difference between model averaging and ensemble. Example 3. Two individual models learn overlapped features xv,2 and xs,3 as x Φ = xv,1 + xv,2 + xs,1 + xs,2 + xs,3, x Φ = xv,2 + xv,3 + xs,3 + xs,4 + xs,5, Proposition 5. Consider the Example 3, suppose Assumption 1 and 2 hold, and there are infinite ID and OOD samples, the averaged and ensemble models are defined as Definition 3. Omitting small terms containing ϵ, we have Aood( f) = Aood( f) = 1 1 9p3 and Aood(fose) = 1 4p4 Aood(fwse) = 1 4p4 Full Proof in Appendix F.4. In Example 3, two individual models learn overlapped feature, xv,i(k) and xs,3(k). By Lemma 5 , for k = 1, 2, 3, we have i=1 µv,i(k) + j=1 µs,j(k), i=2 µv,i(k) + j=3 µs,j(k), So we have w(k) + w(k) = X i=1,3 µv,i(k) + 2µv,2(k) + X j=1,2,4,5 µs,j(k) + 2µs,3(k) For samples from the first class, we also have x( Φ+ Φ)|y=e1 = X i=1,3 µv,i Qv,i(1)+2µv,2Qv,2(1)+ X j=1,2,4,5 µs,j Qs,j(1)+2µs,3Qs,3(1)+ where zi N(0, σ2Id), i. We then have ( w(k) + w(k)) x( Φ + Φ)|y=e1 i=1,3 µv,i(k) (µv,i Qv,i(1)) + 4µv,2(k) (µv,2Qv,2(1)) j=1,2,4,5 µs,j(k) (µs,j Qs,j(1)) + 4µs,3(k) (µs,3Qs,3(1)) + ξ (5) Published as a conference paper at ICLR 2024 Figure 15: (a) Aood(fwse) Aood( f) on Example 4, (b) Aood(fwse) Aood(fose) on Example 4, As for model ensemble, we have w(k) x Φ + w(k) x Φ i=1,2 µv,i(k) (µv,i Qv,i(1)) + X j=1,2,3 µs,j(k) (µs,j Qs,j(1)) i=2,3 µv,i(k) (µv,i Qv,i(1)) + X j=3,4,5 µs,j(k) (µs,j Qs,j(1)) (6) i=1,3 µv,i(k) (µv,i Qv,i(1)) + 2µv,2(k) (µv,2Qv,2(1)) j=1,2,4,5 µs,j(k) (µs,j Qs,j(1)) + 2µs,3(k) (µs,3Qs,3(1)) + ξ (7) Comparing equation 5 and equation 6, we can see that For model averaging, the overlapped features µv,2 and µs,3 (corresponding to xv,2 and xs,3) have coefficients amplified by 2 in Φ+ Φ, and further amplified twice in w + w. This results in coefficients of the overlapped feature becoming 4 in ( w + w) x( Φ + Φ. For model ensemble, i.e., w x Φ + w xΦ, the coefficients of the overlapped feature are 2. D.7.2 THE THEORETICAL CONDITION OF WSE OUTPERFORMING OSE Recall Proposition 2 that we have Aood(fwse) = Fp (1 p)( ns + ns + 2nso) + ( nv + nv + 2nvo) ns + ns + 14nso Aood(fose) = Fp (1 p)( ns + ns) + ( nv + nv) ns + ns + 2nso A direct consequence of Proposition 2 is as follows, which illustrates when model averaging can be more effective than model ensemble: Proposition 6. Consider the models in Definition 2, suppose Assumption 1 and 2 hold, there are infinite ID and OOD samples. Suppose the number of features that Φ and Φ learn are the same, i.e., nv = nv .= nv, ns = ns .= ns and denote ρs .= nso/ns, ρv .= nvo/nv. Omitting small constants involving ϵ, we have Aood(fwse) > Aood(fose) when ρv ρs > 3(1 p)ns Aood(fwse) Aood(fose) when ρv ρs 3(1 p)ns As shown in Appendix D.7.1, the coefficient of an overlapped feature in model averaging is 4, and The coefficient of an overlapped feature in model ensemble is 2. If more Φ and Φ learns more overlapped invariant features, the model averaging would put more weight on the invariant features, leading to better OOD performance. In Figure 4 (c) and (d), we illustrate Aood(fwse) Aood( f) and Aood(fwse) Aood(fose) on the following example: Published as a conference paper at ICLR 2024 Example 4. Consider both models learn the same number of features, i.e., fixing nv = nv = 10 and ns = ns = 20, vary nvo = 0, 1, ..., 5 and nso = 0, 1, ..., 5. We can see that fwse achieves larger OOD improvement over fose when two individual models learns more overlapped invariant features (e.g., larger nvo) and less overlapped spurious features (e.g., smaller nso). In Appendix D.7.2, we provide conditions when fwse outperforms fose, discuss why this can happen easily in real-world datasets, provide some primary experimental results. Why does it easily happen in OOD on many real-world applications? Recall that there are totally dv invariant features and ds spurious features. It is a common believe that spurious features are high-dimensional and invariant features are low-dimensional, i.e., ds dv (Arjovsky et al., 2019; Rosenfeld et al., 2020). Since the spurious features are high dimensional and (Allen-Zhu & Li, 2020; Zhang & Bottou, 2023) indicate that different models can learn different (limited size) subsets of features, the overlap ratio of spurious feature ρs is relatively low. On the other hand, there are a small number of invariant features and recent studies (Rosenfeld et al., 2022; Qiu et al., 2023; Kirichenko et al., 2022) show that models always learn some invariant features for the fine-tuned task during ERM fine-tuning regardless of the presence of spurious features, so we conjecture that the overlapped ratio of invariant feature ρv is relatively higher. However, we recognize that our discussion regarding the overlap ratio of invariant spurious features being larger than spurious features is not supported by rigorous proof, but rather it remains a conjecture. Further research in this area is necessary to provide more conclusive evidence and establish a solid foundation for this claim. In the next part, we will conduct experiments to provide some initial support for this conjecture. D.7.3 EMPIRICAL VERIFICATION It is very difficult to directly empirically verified Proposition 6 because For real-world datasets, it is hard to identify whether and how much a model relies on invariant or spurious features. Verifying Proposition 6 needs to estimate how much two models relies on the same feature. For synthetic datasets, such as CMNIST (Arjovsky et al., 2019), there is no feasible pretrained models available. On the other hand, weight space ensemble needs to be conducted on models close to pre-trained models. In this part, we design a primary experiment to get around the above obstacles. Consider the ensemble of two models: pre-trained CLIP ( f) and the CLIP fine-tuned on Image Net ( f). First, we use Image Net variants (Image Net-V2, Image Net-Sketch, Image Net-A, Image Net-R, Object Net) for OOD performance evaluation. Recall that Image Net variants share the same invariant features with Image Net. Also recent studies (Rosenfeld et al., 2022; Qiu et al., 2023; Kirichenko et al., 2022) show that ERM fine-tuned models always learn some invariant features for the finetuned task regardless of the presence of spurious features. So f learns the invariant features for Image Net variants. At the same time, the pre-trained CLIP f can stably perform zero-shot classification on Image Net and its variants, indicating that f also learns good invariant features for Image Net variants. According to the previous discussion, f and f have some overlapped invariant features for Image Net variants, leading to better weight space ensemble than output space ensemble on Image Net variants (shown in Figure 16(Left)). We then evaluate the OSE and WSE on three other distinct datasets, i.e., Places365, Stanford Cars, DTD and Food101 (refer as PSDF datasets). These tasks have different label space with Image Nets, and contains different invariant features with Image Net. Then in this case, the model f fine-tuned on the Image Net learns little invariant for PSDF datasets. So overlap invariant features used by the pre-trained model f and fine-tuned f are rather limited, indicating ρv is close to zero. Then according to Proposition 6, WSE would be no better than OSE. This is consistent with the results in Figure 16(right). Published as a conference paper at ICLR 2024 Figure 16: Comparison of model ensemble and averaging. Left) OOD performance on Image Net variants, Right) OOD performance on PSDF (Places365, Stanford Cars, DTD and Food101). D.8 ILLUSTRATING THE OVER-CONFIDENCE. In Section 4, we use λ to characterize the over-confidence of fλ = (λw, λΦ). Specifically, we have fλ(x) = λ2w xΦ. Denote q := w xΦ, which is a 3-dimensional vector for a 3-class classification problem. Consider an example, i.e., q = [2, 1, 1]. Recall that q(k) is the k-th element of q for k = 1, 2, 3.. The predicted probability for the first class when λ = 1 is Probability of class 1 = exp(q(1)) exp(q(1)) + exp(q(2)) exp(q(3)) = exp(2) exp(2) + exp(1) + exp(1) = 0.576. When the predicted class of fλ would be the same for λ > 1 and λ = 1. Whereas, when λ > 1, the predicted probability for the largest class would be amplified, e.g., when λ = Probability of class 1 = exp(λ2q(1)) exp(λ2q(1)) + exp(λ2q(2)) exp(λ2q(3)) = exp(2λ2) exp(2λ2) + exp(λ2) + exp(λ2) =0.99 So we can see that a larger λ won t change the predicted class, but would make fλ more confident. E MORE EXPERIMENTAL DETAILS AND RESULTS ON BANG E.1 DETAILS ON IMAGENET VARIANTS Details for Image Net variants: Image Net-V2(IN-V2): A recreated version of the Image Net test set, but with a different set of data distribution. Image Net-R(IN-R): Renditions of 200 Image Net classes resulting in 30,000 images. Image Net Sketch(IN-S): Sketch style images of the same categories as Image Net, with a total of 50000 images. Object Net(ON): Objects in this dataset are captured in cluttered and natural environments at unusual poses. Image Net-A(IN-A): This dataset consists of naturally occurring images that are misclassified by a Res Net-50 model for 200 Image Net classes. Published as a conference paper at ICLR 2024 Image Net (Deng et al.) Image Net-A (Hendrycks et al.) Image Net-R (Hendrycks et al.) Image Net V2 (Recht et al.) Object Net (Barbu et al.) Image Net Sketch (Wang et al.) Figure 17: Datasets on Image Net and its variants. For each dataset, we pick 4 samples of the class lemon and show illustrative images from each dataset. The dataset descriptions are similar to that of Wortsman et al. (2022). E.2 DETAILS OF PLACES365, STANFORDCARS, DTD AND FOOD101 (PSDF) Places365 (Zhou et al. (2017)): A scene recognition dataset. In this paper, we use the validation set from the Places365-standard, which is composed of 36,000 validation images from 365 scene classes. Stanford Cars (Krause et al. (2013)): This dataset contains 196 classes of cars. Classes are typically at the level of Make, Model, Year, ex. 2012 Tesla Model S or 2012 BMW M3 coupe. In this paper, we evaluate models on the test set, comprising 8,041 images. Describable Textures Dataset (DTD) (Cimpoi et al. (2014)): DTD is a texture database, organized according to a list of 47 categories inspired from human perception such as banded, dotted and gauzy. In the paper, we use the test set with 40 images per class. Food101 (Bossard et al. (2014)): This dataset consists of 101 food categories. In the paper, we use the test set with 250 test images for each class. E.3 DETAILS ON CALCULATING THE CONFIDENCE Consider a K-class classification problem. Denote the lth element of the output as Probl, indicating the probability the model assigns to the lth class. We have PK l=1 Probl = 1. The confidence is defined as as: Confidence = max {Probl}K l=1 . E.4 EXPERIMENTAL DETAILS We use the CLIP model Vi T-B/16Radford et al. (2021). We fine-tune the pre-trained model on Image Net. We use the Adam W optimizer with the default Py Torch Adam W hyperparameters and choose 512 as batch size. We use a learning rate of 3 10 5, gradient clipping at global norm 1 and fine-tune for a total of 10 epochs. The settings mentioned above are the same with Wortsman et al. (2022). For our method BANG, we try four smoothing for LS (label smoothing): 0.05, 0.10, 0.15 and 0.20. We adopt 0.10 in our reported results in Table 2. Further results in Table 9 show that BANG is relatively insensitive to the hyper-parameter. We do not tune the hyper-parameters of Mixup Zhang et al. (2017). We use the default hyperparamter as MMPre Train Contributors (2023). Published as a conference paper at ICLR 2024 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy The comparision of zero-shot and finetuned model zeroshot finetune (a) The vanilla fine-tuned model 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy Zero-shot and Calibration [Kumar. et. al., 2022] model zeroshot Calibration [Kumar. et. al., 2022] (b) The vanilla fine-tuned model calibrated on in the ID dataset Kumar et al. (2022). 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy Zero-shot and LS Fine-tuning model zeroshot LS Fine-tuning (c) The model fine-tuned with label smoothing 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy Zero-shot and Mixup Fine-tuning model zeroshot Mixup Fine-tuning (d) The model fine-tuned with Mixup 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy Zero-shot and Mixup with LS Fine-tuning model zeroshot Mixup with LS Fine-tuning (e) The model fine-tuned with both Mixup and LS Figure 18: Comparison of confidence and accuracy between zero-shot and the model finetuned with different methods. In the figure, the ID dataset is the Image Net dataset, which is represented by . the five OOD datasets are: for Image Net A, for Image Net R, + for Image Net Sketch, for Image Net V2 and for Object Net. Published as a conference paper at ICLR 2024 Methods Model Averaging IN IN-V2 IN-R IN-A IN-S Object Net Avg OOD Zero-shot Wortsman et al. (2022) No 68.3 61.9 77.6 49.8 48.2 53.0 58.1 Fine-tuning Wortsman et al. (2022) No 81.3 70.9 65.6 36.7 46.3 49.6 53.8 Flip No 81.3 70.5 63.1 36.8 44.6 51.4 53.3 Rotate No 81.4 70.7 65.2 35.6 45.3 49.5 53.3 Color No 81.4 71.5 65.3 37.3 46.7 50.4 54.2 Mixup No 83.0 72.7 66.4 43.7 48.8 52.4 56.8 Flip Yes 81.8 72.7 78.2 52.9 53.6 58.4 63.1 Rotate Yes 81.7 72.8 78.8 52.7 53.7 57.3 63.1 Color Yes 81.7 72.9 78.5 53.2 54.2 58.2 63.4 Mixup Yes 81.5 73.0 79.5 57.9 54.5 58.7 64.7 Table 8: Results of fine-tuning CLIP VIT-B/16 with flip, color, and rotation data augmentation on Image Net. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy Zero-shot and Rotate Augmented Fine-tuning model zeroshot Rotate Augmented Fine-tuning (a) With rotate augmentation 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy Zero-shot and Flip Augmented Fine-tuning model zeroshot Flip Augmented Fine-tuning (b) With flip augmentation 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy Zero-shot and Color Augmented Fine-tuning model zeroshot Color Augmented Fine-tuning (c) With color augmentation Figure 19: Comparison of confidence and accuracy between zero-shot and the model finetuned with different data augmentation. In the figure, refers to Image Net dataset, for Image Net A, for Image Net R, + for Image Net Sketch, for Image Net V2 and for Object Net. E.5 MORE RESULTS ON BANG AND DISCUSSIONS Mixup and label smoothing can alleviate the over-confidence of the fine-tuned model Compare Figure 18c, 18d, 18e with Figure 18a, we can see that imposing Mixup and LS during fine-tuning can alleviate the over-confidence of the fine-tuned model on both ID (Image Net, denoted by in the figure) and OOD datasets, which is consistent with existing results Park & Caragea (2022). Comparison with Calibrated Ensemble Kumar et al. (2022). Kumar et al. (2022) calibrates the fine-tuned model on the ID dataset by searching for a temperature T of the softmax. Figure 18b shows that the confidence of the calibrated fine-tuned model approximately equals its accuracy on the ID dataset (Image Net). However, such model is still highly over-confidence in OOD datasets, e.g., the confidence is over 0.6 while the accuracy is lower than 0.4 on Image Net A (denoted by ), which is consistent with the findings in Ovadia et al. (2019) and also the discussions in the Section 4.2 of Ovadia et al. (2019). So the scaling issue shown in Proposition 4 still exists in OOD datasets. Notably, Calibrated Ensemble itself Kumar et al. (2022) can not be directly applied on model averaging: Model averaging merges the parameters of each layer. However, calibrated Ensemble only tunes the temperature of the softmax, which does not affect the lower layers, indicating that the layers other than the output layer can still suffer from scaling issues. We try a direct adaptation of (Kumar et al., 2022) to Wi SE-FT: divide the weights in the last layer w by a scalar (temperature) and then perform weight averaging. This also does not yields satisfactory results (Appendix E.6) and the reason is discussed above. Comparison between Mixup with other data augmentations We also compare Mixup with other data augmentations. We fine-tune the CLIP on Image Net with flip, rotate, and color augmentation, respectively. We then performance weight averaging on these fine-tuned model with the pre-trained model as Wortsman et al. (2022) does. Table 8 shows that flip, rotate, and color augmentation can not enhance the performance of model averaging. Figure 19 also shows that these augmentation methods can not alleviate the over-confidence of the fine-tuned model. BANG is relatively insensitive to hyper-parameters. Table 9 shows the performance of BANG with different hyper-parameters of label smoothing. BANG is relatively insensitive to such hyperparameters, e.g., the average OOD performance of BANG(Mixup+LS) all remains at about 64.9% for the four hyper-parameters. Published as a conference paper at ICLR 2024 Methods Model Averaging IN(Image Net) IN-V2 IN-R IN-A IN-Sketch Object Net Avg OOD Zero-shot Wortsman et al. (2022) No 68.3 61.9 77.6 49.8 48.2 53.0 58.1 Fine-tuning Wortsman et al. (2022) No 81.3 70.9 65.6 36.7 46.3 49.6 53.8 Fine-tuning(LS(0.05)) No 82.0 71.5 62.8 37.7 45.5 50.6 53.6 Fine-tuning(LS(0.10)) No 82.0 72.3 63.3 38.3 46.5 51.1 54.3 Fine-tuning(LS(0.15)) No 82.1 72.1 63.3 38.0 46.6 50.7 54.1 Fine-tuning(LS(0.20)) No 82.1 72.1 62.8 36.9 46.2 50.5 53.7 Fine-tuning(Mixup) No 83.0 72.7 66.4 43.7 48.8 52.4 56.8 Fine-tuning(Mixup + LS(0.05)) No 83.0 73.2 65.9 43.9 48.5 52.3 56.7 Fine-tuning(Mixup + LS(0.10)) No 82.7 73.0 66.4 43.3 48.6 52.4 56.8 Fine-tuning(Mixup + LS(0.15)) No 82.9 72.7 65.8 43.6 48.5 52.2 56.6 Fine-tuning(Mixup + LS(0.20)) No 82.9 73.2 66.4 44.6 48.5 52.4 57.0 Wi SE-FT Wortsman et al. (2022) Yes 81.7 72.8 78.7 52.2 53.9 57.3 63.0 BANG(LS(0.05)) Yes 82.2 73.0 78.1 54.7 53.8 58.3 63.6 BANG(LS(0.10)) Yes 82.1 73.3 78.2 55.2 53.7 58.9 63.9 BANG(LS(0.15)) Yes 82.0 73.2 78.1 55.0 53.4 58.9 63.7 BANG(LS(0.20)) Yes 81.7 73.1 77.9 54.2 53.6 58.6 63.4 BANG(Mixup) Yes 81.5 73.0 79.5 57.9 54.5 58.7 64.7 BANG(Mixup + LS(0.05)) Yes 81.6 73.1 79.7 58.2 54.8 58.9 64.9 BANG(Mixup + LS(0.10)) Yes 81.5 73.0 79.8 57.9 54.8 59.0 64.9 BANG(Mixup + LS(0.15)) Yes 81.7 72.9 79.6 57.7 54.6 59.1 64.8 BANG(Mixup + LS(0.20)) Yes 81.6 73.1 79.9 57.8 54.8 59.0 64.9 Table 9: Results of BANG with CLIP-B/16. We show different hyper-parameters of label smoothing. Mixup use the default hyper-parameter of MMPre Train Contributors (2023). Wi SE-FT Exp 1 Exp 2 BANG 63.0% 63.0% 64.1% 64.9% Table 10: Wi SE-FT can benefit significantly from better calibration by scaling the fine-tuned model. (Exp 1)-(Exp 2) are described in Appendix E.6. E.6 WISE-FT BENEFITS SIGNIFICANTLY FROM BETTER CALIBRATION In Section 4, we theoretically show that model WSE can suffer from the imbalance issue where two individual models have different scaling. This can happen if one model is much more confident than the other. Unfortunately, we observe that the popular method, Wi SE-FT suffers from this issue. Specifically, Wi SE-FT averages the pre-trained model with the fine-tuned model. In Section 4, we show that the fine-tuned model is high-overconfident compared with the pre-trained model. We propose BANG, which averages the pre-trained model with the model fine-tuned with Label Smoothing (LS) or Mix Up. Since LS and Mix Up can also improve the fine-tuned performance, we conduct the following experiment to isolate the effect of better calibration from better fine-tuned performance. Scale the fine-tuned model during weight space ensemble. A straightforward method to alleviate over-confidence is to tune the temperature of the softmax of the fine-tuned model (Kumar et al., 2022). However, this method can not be directly applied to Wi SE-FT since Wi SE-FT averages model weights instead of ensemble the outputs. We first apply a direct adaptation of (Kumar et al., 2022) to Wi SE-FT: divide the weights in the last layer w by a scalar (temperature), which is equivalent to softmax tempering. However, recall the model averaging ( w + w) x( Φ + Φ) also suffer from the imbalance issue of Φ. Specifically, Proposition 4 shows that the averaged feature can be biased towards Φ if the scaling of Φ is larger than Φ. So merely adjusting the weight of the classifier w cannot alleviate this bias. The experiment result (Exp 1) in Table 10 also shows that merely rescaling the classifier can hardly improve Wi SE-FT. In practice, we use a transformer (VIT-B/16) with 12 block layers and 1 linear layer. We obtain the averaged model (ˆθ, ˆw) as follows (Exp 1) Re-scale the classifier of FM (fine-tuned, θ, w) model during averaging, i.e., ˆθ = 0.5( θ + θ) and ˆw = (1 α) w + α w. (Exp 2) Re-scale whole network of FM as, ˆθ = (1 α) θ + α θ and ˆw = (1 α) w + α w. We search for the best α among 0.2-0.5 (with interval 0.1) for each ood dataset. The results in Table 10 shows can merely scaling the fine-tuned model to alleviate its over-confidence can significantly improvement the performance of Wi SE-FT. Published as a conference paper at ICLR 2024 Figure 20: (Left) Margins of Wi SE-FT, where the fine-tuned model is obtained through vanilla finetuning. (Right) Margins of BANG, where the fine-tuned model is obtained through fine-tuning with Mix UP+LS. N1/N2 Wi SE-FT 10.6% BANG 13.4% Table 11: The ratio (versus the entire dataset) of samples where model averaging can correct the prediction in the PM(T)-FM(F) group. N1 denote the number of samples where model averaging can correct the prediction in the PM(T)-FM(F) group and N2 denote the total sample size in the dataset. BANG can correct more samples on which the fine-tuned model make mistakes. In this part, We denote the pre-trained model as PM, fine-tuned model as FM, and averaged model as AM. We divide each dataset into four groups of samples according to whether the PM and FM make correct predictions, respectively. We further use T/F to denote whether a model makes correct predictions, i.e., T for True and F for False. For example, we use PM(T)-FM(F) to denote the group of samples on which the predictions of PM and FM AM are correct and wrong, respectively. We visualize the average margin of fine-tuned and pre-trained models on four groups, i.e., PM(T)-FM(T), PM(T)- FM(F), PM(F)-FM(T), and PM(F)-FM(F). The margin is the difference between the probability assigned to the correct class and the maximum probability among the wrong classes, i.e., Margin = Probl max k =l Prob(k) where l is the it the true class, k = 1, ..K, and Prob(k) is the probability that a assign to the class k. Averaging models with negative and positive margins can potentially correct mistakes. Figure 20 (Left) and (Right) visualize the the margins of pre-trained and fine-tuned models on each group of each datasets for Wise-ft and BANG. In Wise-ft, the fine-tuned model exhibits significantly negative margins in the PM(T)-FM(F) group. Specifically, Margin(PM) + Margin(FM) on the group PM(T)-FM(F) is negative for Wi SE-FT on Figure 20(Left), indicating dominance of finetuned models in Wi SE-FT even fine-tuned make mistakes. This also explains that why in some datasets, e.g., IN-R and IN-A, Improve Contri(TF+FT) is negative as shown in Figure 7. However, in BANG, Margin(PM) + Margin(FM) on the group PM(T)-FM(F) is positive on average as shown in Figure 20(Right), suggesting that BANG is capable of correcting more mistakes within the PM(T)-FM(F) group. Table 11 shows that ratio (versus the entire dataset) of samples where model averaging can correct the prediction in the PM(T)-FM(F) group. Specifically, let N1 denote the number of samples where Wi SE-FT can correct the prediction in the PM(T)-FM(F) group and N2 denote the total sample size in the dataset, Table 11 compares the N1/N2 (averaged over 5 OOD datasets) of Wi SE-FT and BANG. Table 11 shows BANG can correct substantially more mistakes made by the fine-tuned model. Published as a conference paper at ICLR 2024 F.1 PROOF OF PROPOSITION 1 Proof. (a) Two individual models. Recall that in the 3-class classification problem, w = [w(1), w(2), w(3)] Rd 3. We first solve the w on the infinite ID samples. Lemma 5 , for k = 1, 2, 3, we have i=1 µv,i Qv,i(k) + j=1 µs,j Qs,j(k) = i=1 µv,i(k) + j=1 µs,j(k), where the last inequality is because: Qv,i = I always hold and Qs,j = I in the ID distribution. Then Qv,i(k) = ek and Qs,j(k) = ek (recall that Q(k) is the kth column of the 3 3 matrix Q). Then for each µQ(k) = µek = µ(k) and µ(k) is the kth column of the d 3 matrix µ. Similarly, we have i=3 µv,i(k) + j=4 µs,j(k). We first look at the model ( w, Φ) and consider the OOD accuracy of the samples from first class k = 1. For each sample from the first class in OOD, we have i=1 µv,i Qv,i(1) + j=1 µs,j Qs,j(1) + where zi N(0, σ2Id), i. The model ( w, Φ) makes correct prediction on the samples from y = e1 if the following holds w(1) x Φ|y=e1 > w(2) x Φ)|y=e1, and w(1) x Φ|y=e1 > w(3) x Φ|y=e1 So for each OOD sample, we have w(1) x Φ)|y=e1 i=1 µv,i(1) + j=1 µs,j(1) i=1 µv,i Qv,i(1) + j=1 µs,j Qs,j(1) + i=1 µv,i(1) (µv,i Qv,i(1)) + j=1 µs,j(1) (µs,j Qs,j(1)) + ξ j=1 µs,j(1) (µs,j Qs,j(1)) | {z } Aj where the second equality is by the Assumption 2 that different µ are all orthogonal to each other; the last equality is because Qv,i(1) = e1 always hold and further by Assumption 2 we have µv,i(1) (µv,i Qv,i(1)) = µv,i(1) (µv,ie1) = µv,i(1) µv,i(1) = 1. Similarly we have w(2) x Φ|y=e1 i=1 µv,i(2) + j=1 µs,j(2) i=1 µv,i Qv,i(1) + j=1 µs,j Qs,j(1) + i=1 µv,i(2) (µv,i Qv,i(1)) + j=1 µs,j(2) (µs,j Qs,j(1)) + ξ j=1 µs,j(2) (µs,j Qs,j(1)) | {z } Bj Published as a conference paper at ICLR 2024 where the last equality is because µv,i(2) (µv,i Qv,i(1)) = µv,i(2) µv,i(1) = 0. Similarly, we also have w(3) x Φ|y=e1 = j=1 µs,j(3) (µs,j Qs,j(1)) | {z } Cj It is easy to see that, for k1, k2 = 1, 2, 3, µs,j(k1) (µs,jek2) = 0, if k1 = k2, 1, otherwise. Since in the OOD distribution, Qs,j(1) can be any of e1, e2 and e3, we have Aj, Bj, Cj {0, 1} and Aj + Bj + Cj = 1 for j = 1, 2, 3. Specifically, Aj = 1, Bj = 0, Cj = 0 if Qs,j = e1, Aj = 0, Bj = 1, Cj = 0 if Qs,j = e2, and Aj = 0, Bj = 0, Cj = 1 if Qs,j = e3. We then have w(1) x Φ w(2) x Φ |y=e1 = 1, if P3 j=1 I(Qs,j(1) = e2) = 3, 0, if P3 j=1 I(Qs,j(1) = e2) = 2 and P3 j=1 I(Qs,j(1) = e3) = 1, 1 otherwise. Recall Definition 1, in the OOD distribution, we have e1, with probability 1 2 3p, e2, with probability p 3, e3, with probability p Combing Lemma 3 with the results above we have Aood( f) [0, ϵ] when P3 j=1 I(Qs,j(1) = e2) = 3 (equivalent to {I(Qs,j(1) = e2)}3 j=1 holds simultaneously) or P3 j=1 I(Qs,j(1) = e3) = 3, the probability is 2p3/27. Aood( f) [1/2 ϵ, 1/2 + ϵ] when P3 j=1 I(Qs,j(1) = e2) = 2 and P3 j=1 I(Qs,j(1) = e3) = 1 or (P3 j=1 I(Qs,j(1) = e3) = 2 and P3 j=1 I(Qs,j(1) = e2) = 1) , the probability of which is 2 C1 3p3/27 = 2p3/9. Aood( f) [1 ϵ, 1] otherwise, the probability of which is 1 8p3/27. So the overall expected OOD acuracy is Aood( f) = (2p3/9 1/2 + (1 8p3/27) 1) ε [1 5p3/27 ε, 1 5p3/27 + ε]. We have Aood( f) [1 5p3/27 ϵ, 1 5p3/27 + ϵ] following the same proof. (b) Output space ensemble and weight space ensemble. Similar to the proof above, for weight space ensemble we have ( w(1) + w(1)) x( Φ + Φ)|y=e1 i=1 µv,i(1) + j=1 µs,j(1) i=1 µv,i Qv,i(1) + j=1 µs,j Qs,j(1) + i=1 µv,i(1) (µv,i Qv,i(1)) + j=1 µs,j(1) (µs,j Qs,j(1)) + ξ j=1 µs,j(1) (µs,j Qs,j(1)) | {z } Aj Published as a conference paper at ICLR 2024 We also have ( w(2) + w(2)) x( Φ + Φ)|y=e1 = j=1 µs,j(2) (µs,j Qs,j(1)) | {z } Bj ( w(3) + w(3)) x( Φ + Φ)|y=e1 = j=1 µs,j(3) (µs,j Qs,j(1)) | {z } Cj w(1) xΦ w(2) xΦ |y=e1 = 2, if P6 j=1 I(Qs,j(1) = e2) = 6, 1, if P6 j=1 I(Qs,j(1) = e2) = 5 and if P6 j=1 I(Qs,j(1) = e3) = 1, 0, if P6 j=1 I(Qs,j(1) = e2) = 5 and P6 j=1 I(Qs,j(1) = e1) = 1 or , P6 j=1 I(Qs,j(1) = e2) = 4 and P6 j=1 I(Qs,j(1) = e3) = 2 . 1 otherwise. Then by Lemma 3, we have [0, ϵ], with probability 2((p/3)6 + 6 (p/3)6) = 14p6/729, [ 1 2 + ϵ], with probability 2(6 (p/3)5 (1 2p/3) + 6C2 (p/3)6) = 2p6/243 + 4p5/81 [1 ϵ, 1], with probability 1 20p6/729 4p5/81. Then the overall expected OOD accuracy Aood(fwse) is in [1 2p5/81 17p6/729 ε, 1 2p5/81 17p6/729 + ε]. The accuracy of the model ensemble and model averaging are the same in Example 1 since ( w(k) + w(k)) x( Φ + Φ)|y=e1 = i=1 µv,i(k) (µv,i Qv,i(1)) + j=1 µs,j(k) (µs,j Qs,j(1)) + ξ w(k) (x Φ) + w(k) (x Φ)|y=e1 = i=1 µv,i(k) (µv,i Qv,i(1)) + j=1 µs,j(k) (µs,j Qs,j(1)) + ξ. F.2 PROOF OF PROPOSITION 2 Before starting the proof process, we restate Proposition 2 and Definition 1 for K (K 3) class situation as follows: Definition 5 (Data Generation Process). The whole data generation process are as follows: y Unif {e1, e2, . . . , e K} , x = Concat {xv,i}dv i=1 {xs,j}ds j=1 , Pθ(xv,i | y) = N µv,i Qv,iy, σ2Id , Pθ(xs,j | y) = N µs,j Qs,jy, σ2Id , i, j. (8) where Qv,i, Qs,j {0, 1}K K. Further, Qv,i = I3 = [e1, e2, . . . , e K] always hold. In the ID distribution Did, Qs,j = IK; and in OOD Dood, the kth column of Q, i.e., Qs,j(k), is as follows for k = 1, 2, . . . , K: Qs,j(k) = ek, with probability 1 p Unif{e1, e2, . . . , e K}, with probability p. Published as a conference paper at ICLR 2024 Proposition 7 (General Results for OSE). Consider Definition 1-3, Assumption 1-2 hold, and infinite ID and OOD samples. Omitting small constants involving ϵ, we have Aood( f) = Fp (1 p) ns + nv ns Aood( f) = Fp (1 p) ns + nv ns Aood(fose) = Fp (1 p)( ns + ns) + ( nv + nv) ns + ns + 2nso In the proof process, here we take a notation first: L(t1, . . . , t K 1) = Pz N(0,σ2IK 1) a T i z + ti > 0, i = 1, . . . , K 1 in which ai 2 2= 2, a T i aj = 1, for any i = j. Consider the extracted features in both models as {xv, i} nv nvo i=1 {xs, j} ns nso j=1 {xv, i} nv nvo i=1 {xs, i} ns nso i=1 {xv,i}nvo i=1 {xs,i}nso i=1, in which nvo, nso are the numbers of overlapped invariant features and spurious features respectively. Then for each single model, we have i=1 xv, i + j=1 xs, j + P nv nvo i=1 µv, i(k) + P ns nso j=1 µs, j(k) + Pnvo i=1 µv,i(k) + Pnso i=1 µs,i(k) nv + ns , k = 1, . . . , K i=1 xv, i + i=1 xs, i + P nv nvo i=1 µv, i(k) + P ns nso i=1 µs, i(k) + Pnvo i=1 µv,i(k) + Pnso i=1 µs,i(k) nv + ns , k = 1, . . . , K. Then we can analysis the forecasting accuracy for both averaging model and ensemble model respectively. F.2.1 PROOF FOR SINGLE MODEL Considering the extracted features in Algorithm 1 as {xv, i} nv i=1 {xs, j} ns j=1, for convenience, we denote i=1 xv, i + then according to Lemma 5, we can obtain the estimated classifier on label ek: 1 nv + ns µv, i + 1 nv + ns µs, j. Published as a conference paper at ICLR 2024 Based on this classifier, the forecasting accuracy on ID case is P(ˆy = y) = 1 k=1 Ex|y=ek{1( x w(k) > x w(k ), k = k)} k=1 Pz N(0,( nv+ ns)σ2Id) ( w(k) w(k ))T z + ( w(k) w(k ))T E( x | y = ek) > 0, k = k k=1 Pz N(0,σ2Id) ( w(k) w(k ))T z + δk,k > 0, k = k , in which we denote that δk,k = 1 nv + ns for any k = k. And considering Assumption 2, we have ( w(k) w(k ))T ( w(k) w(k )) = 1, w(k) w(k ) 2 2= 2, for any k = k = k . Then with Lemma 2, the IID forecasting accuracy can be expressed as P(y = ˆy) = L(1, . . . , 1), which can not be influenced by nv, ns. Then we turn to the OOD forecasting accuracy. For class k, we suppose there are rk spurious features maintaining their parameters, and rk k refer to the number of spurious features flipping to the class k , the corresponding probability is P([rk, [rk k , k = k]]) = ns! rk!Πk =krk k !(1 p + p K p) ns rk, and the conditional OOD forecasting accuracy on label ek is P(ˆy = ek | [rk, [rk k , k = k]], y = ek) = E x|[rk,[rk k , k =k]],y=ek 1( x T w(k) > x T w(k ), k = k) = Pz N(0,( nv+ ns)σ2Id) ( w(k) w(k ))T z + nv + rk rk k nv + ns , k = k = L( nv + rk rk k nv + ns , k = k), according to this, the OOD forecasting accuracy can be expressed as P(ˆy = y) = Ey[P(ˆy = y | y)] = P(ˆy = e1 | y = e1) r1,r1 k , k 2 P([r1, [r1 k , k = 1]])L( nv + r1 r1 k nv + ns , k 1). with Lemma 3, we can get related properties about L( ), then take upper and lower bounds respectively. Considering the close form of G( ) in equation 13, we denote nv = nv, ns = ns, nvo = nso = 0, C = 0, then the OOD forecasting accuracy can be lower bounded as P(ˆy = y) P(A)(1 ϵ) + N=1 P(C(N))(h(N) ϵ) N=1 P(C(N))h(N) ϵ = G( nv, ns, 0, 0, 0) ϵ, and on the other hand, it can also be upper bounded by P(ˆy = y) P(A) + N=1 P(C(N))h(N) + ϵP(B) G( nv, ns, 0, 0, 0) + ϵ, Published as a conference paper at ICLR 2024 Similar with Algorithm 1, we can also get the ID and OOD forecasting accuracy in Algorithm 2, which is related to another nv invariant features and ns spurious features. For the OOD forecasting accuracy, we d like to take some intuitive approximation for G(nv, ns, 0, 0, 0). As the number nv, ns are large enough, we can take approximation by multivariate Gaussian distribution. To be specific, we denote r = [r1, r1 2, . . . , r1 K], then can regard them as r N(γ, Σ), in which γ = [ns(1 p + p/K), nsp/K, . . . , nsp/K]T , Σi,i = γi(ns γi) ns , Σi,j = γiγj If we denote a new (K 1) K matrix as 1 1 0 . . . 0 1 0 1 . . . 0 ... ... ... 0 1 0 0 . . . 1 And the new (K 1)-dim random variable, i.e, η .= T T r + nv1 is still Gaussian, to be specific, if we denote its distribution as η N(α, M), then we have α = (ns(1 p) + nv) 1, Mi,i = ns p(K + 2 p K) Mi,j = ns p(K + 1 p K) G(nv, ns, 0, 0, 0) can be approximated as P(η1 > 0, . . . , ηK 1 > 0), which is equal to Fp (ns(1 p) + nv)/ ns , and Fp( ) is defined in Appendix F.2.5. F.2.2 PROOF FOR WEIGHT SPACE ENSEMBLE For averaging model, we denote 2x( Φ+ Φ) = 1 i=1 xv, i+1 j=1 xs, j+1 i=1 xv, i+1 then after scaling, the averaging classifier on label ek: P nv nvo i=1 µv, i(k) + P ns nso j=1 µs, j(k) + P nv nvo i=1 µv, i(k) + P ns nso i=1 µs, i(k) + 2 Pnvo i=1 µv,i(k) + 2 Pnso i=1 µs,i(k) nv + ns + nv + ns + 2nvo + 2nso . Based on this classifier, if we denote ˆn = ( nv + ns + nv + ns + 2nvo + 2nso)/4, the forecasting accuracy on ID case is P(ˆy = y) = 1 k=1 Ex|y=ek{1(ˆx ˆw(k) > ˆx ˆw(k )), k = k)} k=1 Pz N(0,ˆnσ2Id) ( ˆw(k) ˆw(k ))T z + ( ˆw(k) ˆw(k ))T E(ˆx | y = ek) > 0, k = k k=1 Pz N(0,σ2Id) ( ˆw(k) ˆw(k ))T z + ˆδk,k > 0, k = k , Published as a conference paper at ICLR 2024 in which we denote that P nv nvo i=1 1 + P ns nso j=1 1 + P nv nvo i=1 1 + P ns nso j=1 1 + Pnvo i=1 4 + Pnso i=1 4 nv + ns + nv + ns + 2nvo + 2nso = 1, for any k = k. And considering Assumption 2, we have ( ˆw(k) ˆw(k ))T ( ˆw(k) ˆw(k )) = 1, ˆw(k) ˆw(k ) 2 2= 2, for any k = k = k . Then with Lemma 2, the IID forecasting accuracy can be expressed as P(y = ywse) = L(1, . . . , 1) [1 ϵ, 1], which can not be influenced by nv, ns, nv, ns, nvo, nso. Then we turn to the OOD forecasting accuracy and for each k = 1, . . . , K, we take some notations as follows: rk = |{I(µs,i(k) = µs,i(k))} ns i=1 nso| rk = |{I(µs,i(k) = µs,i(k))} ns nso i=1 | ro k = |{I(µs,i(k) = µs,i(k))}nso i=1| rk k = |{I(µs,i(k) = µs,i(k ))} ns i=1 nso| rk k = |{I(µs,i(k) = µs,i(k ))} ns nso i=1 | ro k k = |{I(µs,i(k) = µs,i(k ))}nso i=1|. To be specific, for class k, we suppose there are rk, rk spurious features (no overlapped) maintaining their parameters, related to Algorithm 1, 2, and correspondingly, rk k , rk k refer to the number of spurious features flipping to the class k , and ro k, ro k k are defined similar in overlapped spurious features. Then denoting Rk(r) = [ rk, rk, ro k, [ rk k , rk k , ro k k , k = k]], we obtain the corresponding probability as P(Rk(r)) = ( ns + ns 2nso)!nso! ( rk + rk)!ro k!Πk =k( rk k + rk k )!ro k k !(1 p+ p K ) rk+ rk+ro k(K 1 K p) ns+ ns nso rk rk ro k, and the conditional OOD forecasting accuracy on label ek is P(ˆy = ek | Rk(r), y = ek) = Eˆx|Rk(r),y=ek 1(ˆx T ˆw(k) > ˆx T ˆw(k ), k = k) = Pz N(0,ˆnσ2Id) ( ˆw(k) ˆw(k ))T z + nv + nv + 2nvo + rk + rk + 4ro k rk k rk k 4ro k k nv + ns + nv + ns + 2nvo + 2nso , k = k = L( nv + nv + 2nvo + rk + rk + 4ro k rk k rk k 4ro k k nv + ns + nv + ns + 2nvo + 2nso , k = k), according to this, the OOD forecasting accuracy can be expressed as P(ˆy = y) = Ey[P(ˆy = y | y)] = P(ˆy = e1 | y = e1) ˆr1 P(R1(r))L( nv + nv + 2nvo + r1 + r1 + 4ro 1 r1 k r1 k 4ro 1 k nv + ns + nv + ns + 2nvo + 2nso , k = k). with Lemma 3, we can get related properties about L( ), then take upper and lower bounds respectively. Still recalling the expression in equation 13 with nv = nv + nv, ns = ns + ns, nvo = nvo, nso = nso, C = 4, we can obtain the lower bound for OOD forecasting accuracy as P(ˆy = y) P(A)(1 ϵ) + N=1 P(C(N))(h(N) ϵ) N=1 P(C(N))h(N) ϵ = G( nv + nv, ns + ns, nvo, nso, 4) ϵ, and on the other hand, it can be upper bounded by P(ˆy = y) P(A) + N=1 P(C(N))h(N) + ϵP(B) G( nv + nv, ns + ns, nvo, nso, 4) + ϵ, Published as a conference paper at ICLR 2024 Similar to the analysis before, for ID forecasting accuracy, we have Jid = 0 3ϵ, and for OOD forecasting accuracy, we can draw a conclusion that Jood G( nv + nv, ns + ns, nvo, nso, 4) max{G( nv, ns, 0, 0, 0), G( nv, ns, 0, 0, 0)} 3ϵ. Similar to the analysis above, we d like to take some intuitive approximation for OOD forecasting accuracy in the ensemble model. As the number nv, ns, nv, ns, nvo, nso are large enough, we can take approximation by multivariate Gaussian distribution. To be specific, we denote r = [ r1, r1 2, . . . , r1 K], r = [ r1, r1 2, . . . , r1 K] and ro = [ro 1, ro 1 2, . . . , ro 1 K], then can regard them as r N( γ, Σ), r N( γ, Σ) and ro N(γo, Σo) (they are independent), in which γ = [( ns nso)(1 p + p/K), ( ns nso)p/K, . . . , ( ns nso)p/K]T , Σi,i = γi( ns nso γi) ns nso , Σi,j = γi γj ns nso , γ = [( ns nso)(1 p + p/K), ( ns nso)p/K, . . . , ( ns nso)p/K]T , Σi,i = γi( ns nso γi) ns nso , Σi,j = γi γj ns nso , γo = [nso(1 p + p/K), nsop/K, . . . , nsop/K]T , Σo i,i = γo i (nso γo i ) nso , Σo i,j = γo i γo j nso . If we denote a new (K 1) K matrix as 1 1 0 . . . 0 1 0 1 . . . 0 ... ... ... 0 1 0 0 . . . 1 And the new (K 1)-dim random variable, i.e, η .= (T , T , 4T )( r, r, ro)T + ( nv + nv + 2nvo)1 is still Gaussian, to be specific, if we denote its distribution as η N(α, M), then we have α = (( ns + ns + 2nso)(1 p) + nv + nv + 2nvo) 1, Mi,i = ( ns + ns + 14nso)p(K + 2 p K) Mi,j = ( ns + ns + 14nso)p(K + 1 p K) G( nv + nv, ns + ns, nvo, nso, 4) can be approximated as P(η1 > 0, . . . , ηK 1 > 0), which is equal to Fp (( ns + ns + 2nso)(1 p) + nv + nv + 2nvo)/ ns + ns + 14nso , and Fp( ) is defined in Appendix F.2.5. F.2.3 PROOF FOR OUTPUT SPACE ENSEMBLE For ensemble model, we also denote P nv nvo i=1 µv, i(k) + P ns nso j=1 µs, j(k) + P nv nvo i=1 µv, i(k) + P ns nso i=1 µs, i(k) + 2 Pnvo i=1 µv,i(k) + 2 Pnso i=1 µs,i(k) nv + ns + nv + ns + 2nvo + 2nso , then the forecasting accuracy on IID case is P(ˆy = y) = 1 k=1 Ex|y=ek{1(x Φ w(k) + x ΦT w(k) > x Φ w(k ) + x ΦT w(k ), k = k)} k=1 Pz N (0,σ2Id) ( ˆw(k) ˆw(k ))T z + ˆδk,k > 0, k = k , Published as a conference paper at ICLR 2024 P nv nvo i=1 1 + P ns nso j=1 1 + P nv nvo i=1 1 + P ns nso j=1 1 + Pnvo i=1 2 + Pnso i=1 2 nv + ns + nv + ns + 2nvo + 2nso nv + ns + nv + ns nvo nso = nv + ns + nv + ns nv + ns + nv + ns + 2nvo + 2nso nv + ns + nv + ns nvo nso .= s < 1, while nvo + nso < ( nv + ns + nv + ns)/2, for any k = k. And considering Assumption 2, we have ( ˆw(k) ˆw(k ))T ( ˆw(k) ˆw(k )) = 1, ˆw(k) ˆw(k ) 2 2= 2, for any k = k = k . Then with Lemma 2, the IID forecasting accuracy can be expressed as P(y = ˆy) = L(s, . . . , s) 1 ϵ, which can be influenced by nv, ns, nv, ns, nvo, nso. Then we turn to the OOD forecasting accuracy. Similar to the notation in equation 9, for class k, we suppose: rk = |{I(µs,i(k) = µs,i(k))} ns i=1 nso| rk = |{I(µs,i(k) = µs,i(k))} ns nso i=1 | ro k = |{I(µs,i(k) = µs,i(k))}nso i=1| rk k = |{I(µs,i(k) = µs,i(k ))} ns i=1 nso| rk k = |{I(µs,i(k) = µs,i(k ))} ns nso i=1 | ro k k = |{I(µs,i(k) = µs,i(k ))}nso i=1|. Then denoting Rk(r) := [ rk, rk, ro k, [ rk k , rk k , ro k k , k = k]], we have the corresponding probability is P(Rk(r)) = ( ns + ns 2nso)!nso! ( rk + rk)!ro k!Πk =k( rk k + rk k )!ro k k !(1 p+ p K ) rk+ rk+ro k(K 1 K p) ns+ ns nso rk rk ro k, and the conditional OOD forecasting accuracy on label ek is P(ˆy = ek | Rk(r), y = ek) = ExˆΦ|Rk(r),y=ek n 1(x Φ w(k) + x ΦT w(k) > x Φ w(k ) + x ΦT w(k ), k = k) o = L( nv + nv + rk + rk + 2ro k rk k rk k 2ro k k nv + ns + nv + ns + 2nvo + 2nso nv + ns + nv + ns nvo nso , k = k). According to this, the OOD forecasting accuracy can be expressed as P(ˆy = y) = Ey[P(ˆy = y | y)] = P(ˆy = e1 | y = e1) R1(k) P(R1(k))L( nv + nv + r1 + r1 + 2ro 1 r1 k r1 k 2ro 1 k nv + ns + nv + ns + 2nvo + 2nso nv + ns + nv + ns nvo nso , k = 1). with Lemma 3, we can get related properties about L( ), then take upper and lower bounds respectively. To be specific, recalling the expression in equation 13 with nv = nv + nv, ns = ns + ns, nvo = nvo, nso = nso, C = 2, we can lower bound the OOD forecasting accuracy as P(ˆy = y) P(A)(1 ϵ) + N=1 P(C(N))(h(N) ϵ) N=1 P(C(N))h(N) ϵ = G( nv + nv, ns + ns, nvo, nso, 2) ϵ, and on the other hand, it can be upper bounded by P(ˆy = y) P(A) + N=1 P(C(N))h(N) + ϵP(B) G( nv + nv, ns + ns, nvo, nso, 2) + ϵ, Similar to the analysis before, for ID forecasting accuracy, we have Jid = 0 3ϵ, Published as a conference paper at ICLR 2024 and for OOD forecasting accuracy, we can draw a conclusion that Jood G( nv + nv, ns + ns, nvo, nso, 2) max{G( nv, ns, 0, 0, 0), G( nv, ns, 0, 0, 0)} 3ϵ. Similar to the analysis above, we d like to take some intuitive approximation for OOD forecasting accuracy in the ensemble model. As the number nv, ns, nv, ns, nvo, nso are large enough, we can take approximation by multivariate Gaussian distribution. To be specific, we denote r = [ r1, r1 2, . . . , r1 K], r = [ r1, r1 2, . . . , r1 K] and ro = [ro 1, ro 1 2, . . . , ro 1 K], then can regard them as r N( γ, Σ), r N( γ, Σ) and ro N(γo, Σo) (they are independent), in which γ = [( ns nso)(1 p + p/K), ( ns nso)p/K, . . . , ( ns nso)p/K]T , Σi,i = γi( ns nso γi) ns nso , Σi,j = γi γj ns nso , γ = [( ns nso)(1 p + p/K), ( ns nso)p/K, . . . , ( ns nso)p/K]T , Σi,i = γi( ns nso γi) ns nso , Σi,j = γi γj ns nso , γo = [nso(1 p + p/K), nsop/K, . . . , nsop/K]T , Σo i,i = γo i (nso γo i ) nso , Σo i,j = γo i γo j nso . If we denote a new (K 1) K matrix as 1 1 0 . . . 0 1 0 1 . . . 0 ... ... ... 0 1 0 0 . . . 1 And the new (K 1)-dim random variable, i.e, η .= (T , T , 4T )( r, r, ro)T + ( nv + nv + 2nvo)1 is still Gaussian, to be specific, if we denote its distribution as η N(α, M), then we have α = (( ns + ns)(1 p) + nv + nv) 1, Mi,i = ( ns + ns + 2nso)p(K + 2 p K) Mi,j = ( ns + ns + 2nso)p(K + 1 p K) the OOD forecasting accuracy can be approximated as P(η1 > 0, . . . , ηK 1 > 0), which is equal to Fp((( ns+ ns)(1 p)+ nv+ nv)/ ns + ns + 2nso), and Fp( ) is defined in Appendix F.2.5. F.2.4 CASE STUDY FOR K = 3 To interpret the improvements on OOD accuracy of model average and model ensemble, here we set K = 3, and further take an insight on the representation function G( ). Recalling the results calculated above, the OOD accuracy for single models can be approximated as G( nv, ns, 0, 0, 0), G( nv, ns, 0, 0, 0), and for average model and ensemble model, we could focus on G( nv + nv, ns + ns, nvo, nso, 4), G( nv + nv, ns + ns, nvo, nso, 2). To take specific calculations, we denote the random vector as [r1, ro 1, r1 2, r1 3, ro 1 2, ro 1 3], and approximate them on probabilities related to the two-dimensional Gaussian random vector η N(0, H), in which the covariance matrix H has components as Hii = p(5 3p) 3 , Hij = p(4 3p) Published as a conference paper at ICLR 2024 Then we could obtain G( nv, ns, 0, 0, 0) = P(η1 (1 p) ns + nv nv , η2 (1 p) ns + nv nv ), G( nv, ns, 0, 0, 0) = P(η1 (1 p) ns + nv nv , η2 (1 p) ns + nv nv ), G( nv + nv, ns + ns, nvo, nso, 4) = P(η1 (1 p)( ns + ns + 2nso) + nv + nv + 2nvo ns + ns + 14nso , η2 (1 p)( ns + ns + 2nso) + nv + nv + 2nvo ns + ns + 14nso ), G( nv + nv, ns + ns, nvo, nso, 2) = P(η1 (1 p)( ns + ns) + nv + nv ns + ns + 2nso , η2 (1 p)( ns + ns) + nv + nv ns + ns ). Here we denote a new function F(x) := P(η1 x, η2 x), which implies that F is monotonically increasing with respect to x. And it shows that average model and ensemble model could obtain higher OOD accuracy compared with single models due to (1 p)( ns + ns + 2nso) + nv + nv + 2nvo ns + ns + 14nso max{(1 p) ns + nv nv , (1 p) ns + nv nv }, (1 p)( ns + ns) + nv + nv ns + ns + 2nso max{(1 p) ns + nv nv , (1 p) ns + nv nv }. F.2.5 CLOSE FORM OF Fp( ) Here we provide the explicit expression of function Fp(x) in K class situation, which is monotonically increasing with x. We denote a K 1-dim random variable η N(x, M), in which Mi,i = p(K + 2 p K) K , Mi,j = p(K + 1 p K) then Fp(x) is defined as Fp(x) = P(η1 > 0, . . . , ηK 1 > 0). F.2.6 CLOSE FORM OF G(nv, ns, nvo, nso, C) First, denoting a random vector Rk(r) := [rk, ro k, [rk k , ro k k , k = k]] R2K, we have the corresponding probability as P(Rk(r)) = (ns 2nso)!nso! rk!ro k!Πk =krk k !ro k k !(1 p + p K )rk+ro k(K 1 K p)ns nso rk ro k, then we can define several sets: A := {Rk(r) : r1 + Cro 1 r1 k Cro 1 k + nv > 0, k = 2, . . . , K}, (10) B := {Rk(r) : min k =2,...,K r1 + Cro 1 r1 k Cro 1 k + nv < 0}, (11) C(N) := {Rk(r) : min k =2,...,K r1 + Cro 1 r1 k Cro 1 k + nv = 0, the minimum can be achieved by N values}, (12) and related functions as h(N) = Pz N (0,σ2IN ) a T i z > 0, i = 1, . . . , N in which a T i aj = 1 and ai 2 2= 1 for any i = j. G(ns, nv, nso, nvo, C) is the probability defined as following: G(ns, nv, nso, nvo, C) = P(A) + N=1 P(C(N))h(N) (13) where set A and C(N) are two sets of Rk(r) defined in Equation equation 10 and equation 12, respectively. Note that P(Rk(r)) and the set A and C(N) all depend on ns, nv, nso, nvo and C. Published as a conference paper at ICLR 2024 F.3 PROOF OF PROPOSITION 4 By the proof in Proposition 1 we have i=1 µv,i(k) + j=1 µs,j(k). i=3 µv,i(k) + j=4 µs,j(k). Then we consider the averaged mode about the value of ˆw T xˆΦT , where ˆw = w+λ w 1+λ and ˆΦ = Φ+λ Φ We first have: ˆw(k) = 1 1 + λ( X i=1,2 µv,i(k) + λ X i=3,4 µv,i(k) + X j=1,2,3 µs,j(k) + X j=4,5,6 µs,j(k) ) xˆΦT |y=e1 = 1 1 + λ(x ΦT + λx ΦT ) = 1 1 + λ( X i=1,2 µv,i Qv,i(1) + λ X i=3,4 µv,i Qv,i(1) j=1,2,3 µs,j Qs,j(1) + λ X j=4,5,6 µs,j Qs,j(1) + ( ˆw(k)T xˆΦT |y=e1 = 1 (1 + λ)2 ( X i=1,2 µv,i(k)T µv,i Qv,i(1) + λ2 X i=3,4 µv,i(k)T µv,i Qv,i(1) j=1,2,3 µs,j(k)T µs,j Qs,j(1) + λ2 X j=4,5,6 µs,j(k)T µs,j Qs,j(1)) Then for class k = 1, we have ˆw(1)T xˆΦT |y=e1 = 1 (1 + λ)2 (2 + 2λ2 | {z } >12 + X j=1,2,3 µs,j(1)T µs,j Qs,j(1) + λ2 |{z} >5 j=4,5,6 µs,j(1)T µs,j Qs,j(1)) For the other two classes, we have ˆw(2)T xˆΦT |y=e1 = 1 (1 + λ)2 ( X j=1,2,3 µs,j(2)T µs,j Qs,j(1)) + λ2 |{z} >5 j=4,5,6 µs,j(2)T µs,j Qs,j(1)) ˆw(3)T xˆΦT |y=e1 = 1 (1 + λ)2 ( X j=1,2,3 µs,j(3)T µs,j Qs,j(1) + λ2 |{z} >5 j=4,5,6 µs,j(3)T µs,j Qs,j(1)) For simplicity of the discussion, we will ignore the constant factor 1 (1+λ)2 , when λ > 5, we discuss the value of ˆw(1)T xˆΦT |y=e1 ˆw(2)T xˆΦT |y=e1 : j=4,5,6 I(Qs,j(1) = e2) = 2, it suffices to consider comparing: j=1,2,3 µs,j(1)T µs,j Qs,j(1) + λ2 P j=4,5,6 µs,j(1)T µs,j Qs,j(1) P j=1,2,3 µs,j(2)T µs,j Qs,j(1) ˆw(1)T xˆΦT |y=e1 ˆw(2)T xˆΦT |y=e1 only when X j=4,5,6 I(Qs,j(1) = e1) = 0, that is, j=4,5,6 I(Qs,j(1) = e3) = 1. The probability of the aforementioned scenario is 3 (p/3)3 = p3/9. ˆw(1)T xˆΦT |y=e1 < ˆw(2)T xˆΦT |y=e1 if X j=1,2,3 I(Qs,j(1) = e2) = 3 ˆw(1)T xˆΦT |y=e1 = ˆw(2)T xˆΦT |y=e1 if X j=1,2,3 I(Qs,j(1) = e2) = 2 and X j=1,2,3 I(Qs,j(1) = e3) = 1 Published as a conference paper at ICLR 2024 Therefore, the probability of < is p3/9 (p/3)3 = p6/243, = is p3/9 3 (p/3)3 = p6/81. j=4,5,6 I(Qs,j(1) = e2) = 3, it suffices to consider comparing ( 2 + P j=1,2,3 µs,j(1)T µs,j Qs,j(1) P j=1,2,3 µs,j(2)T µs,j Qs,j(1) + λ2 Trivially we have P j=1,2,3 µs,j(2)T µs,j Qs,j(1) + λ2 > 5 2 + P j=1,2,3 µs,j(1)T µs,j Qs,j(1), therefore, < holds under this case, the probability is (p/3)3 = p3/27. j=4,5,6 I(Qs,j(1) = e2) = 0/1, X j=1,2,3 µs,j(2)T µs,j Qs,j(1)) + λ2 X j=4,5,6 µs,j(2)T µs,j Qs,j(1)) 3 + λ2 < 2 + 2λ2 + Therefore, is impossible to hold in this case. To sum up, for comparing the first class and the second class, the < probability is p6/243 + p3/27, the = probability is p6/81. Generally, the < probability is 2p6/243 + 2p3/27, the = probability is 2p6/81, the otherwise probability is 1 8p6/243 2p3/27. Then the total accuracy is approximately 1/2 2p6/81+(1 8p6/243 2p3/27) = 1 5p6/243 2p3/27, that is, the accuracy lies in [1 5p6/243 2p3/27 ε, 1 5p6/243 2p3/27 + ε]. F.4 PROOF OF PROPOSITION 5 Proof. (a)Two individual models. The accuracy of two individual models are the same with Example (1-1) following the same proof. Specifically, so we have Aood( f) [1 5p3/54 ϵ, 1 5p3/54 + ϵ]. Aood( f) [1 5p3/27 ϵ, 1 5p3/27 + ϵ] (b) Weight space ensemble. We first solve the w and w on the infinite ID samples. Lemma 5 , for k = 1, 2, 3, we have i=1 µv,i(k) + j=1 µs,j(k), i=2 µv,i(k) + j=3 µs,j(k), w(k) + w(k) = X i=1,3 µv,i(k) + 2µv,2(k) + X j=1,2,4,5 µs,j(k) + 2µs,3(k) For samples from the first class, we also have x( Φ + Φ)|y=e1 = X i=1,3 µv,i Qv,i(1) + 2µv,2Qv,2(1) + X j=1,2,4,5 µs,j Qs,j(1) + 2µs,3Qs,3(1) + where zi N(0, σ2Id), i. We then have ( w(k) + w(k)) x( Φ + Φ)|y=e1 i=1,3 µv,i(k) (µv,i Qv,i(1)) + 4µv,2(k) (µv,2Qv,2(1)) j=1,2,4,5 µs,j(k) (µs,j Qs,j(1)) + 4µs,3(k) (µs,3Qs,3(1)) + ξ ( w(1) + w(1)) x( Φ + Φ)|y=e1 j=1,2,4,5 µs,j(1) (µs,j Qs,j(1)) + 4µs,3(1) (µs,3Qs,3(1)) + ξ Published as a conference paper at ICLR 2024 Similarly, we have ( w(2) + w(2)) x( Φ + Φ)|y=e1 = X j=1,2,4,5 µs,j(2) (µs,j Qs,j(1)) + 4µs,3(2) (µs,3Qs,3(1)) + ξ, ( w(3) + w(3)) x( Φ + Φ)|y=e1 = X j=1,2,4,5 µs,j(3) (µs,j Qs,j(1)) + 4µs,3(3) (µs,3Qs,3(1)) + ξ. ( w(1) + w(1)) x Φ + Φ ( w(2) + w(2)) x Φ + Φ j=1,2,4,5 I(Qs,j(1) = e2) = 4, and I(Qs,3(1) = e2) = 1 1, if I(Qs,3(1) = e2) = 1, P j=1,2,4,5 I(Qs,j(1) = e2) = 3, and P j=1,2,4,5 I(Qs,j(1) = e3) = 1 , 0, if I(Qs,3(1) = e2) = 1, P j=1,2,4,5 I(Qs,j(1) = e2) = 3, and P j=1,2,4,5 I(Qs,j(1) = e1) = 1 or I(Qs,3(1) = e2) = 1, P j=1,2,4,5 I(Qs,j(1) = e2) = 2, and P j=1,2,4,5 I(Qs,j(1) = e3) = 2 . 1 otherwise. Then we can compute the probability respectively, For -2 case, the probability is given by 2 (p/3)5 = 2p5/243 For -1 case, the probability is given by 2 4 (p/3)5 = 8p5/243 For 0 case, the probability is given by 2 (4 (1 2p/3) (p/3)4 + 4C2 (p/3)5) = 8p4/81 4p5/243 Otherwise, the probability is: 1 8p4/81 2p5/81 Then the total accuracy can be computed as approximately 1 4p4/81 8p5/243, the interval is [1 4p4/81 8p5/243 ε, 1 4p4/81 8p5/243 + ε] (c) Output Space Ensemble. Similar to the derivation of model averaging, we have w(k) x Φ + w(k) x Φ|y=e1 i=1,3 µv,i(k) (µv,i Qv,i(1)) + 2µv,2(k) (µv,2Qv,2(1)) j=1,2,4,5 µs,j(k) (µs,j Qs,j(1)) + 2µs,3(k) (µs,3Qs,3(1)) + ξ. Then we consider the fist class: w(1) x Φ + w(1) x Φ|y=e1 i=1,3 µv,i(1) (µv,i Qv,i(1)) + 2µv,2(1) (µv,2Qv,2(1)) j=1,2,4,5 µs,j(1) (µs,j Qs,j(1)) + 2µs,3(1) (µs,3Qs,3(1)) + ξ. j=1,2,4,5 µs,j(1) (µs,j Qs,j(1)) + 2µs,3(1) (µs,3Qs,3(1)) + ξ. Similarly we have, w(2) x Φ + w(2) x Φ|y=e1 = X j=1,2,4,5 µs,j(2) (µs,j Qs,j(1)) + 2µs,3(2) (µs,3Qs,3(1)) + ξ. w(3) x Φ + w(3) x Φ|y=e1 = X j=1,2,4,5 µs,j(3) (µs,j Qs,j(1)) + 2µs,3(3) (µs,3Qs,3(1)) + ξ. ( w(1) x Φ + w(1) x Φ)|y=e1 ( w(2) x Φ + w(2) x Φ)|y=e1 j=1,2,4,5 I(Qs,j(1) = e2) = 4, and Qs,3(1) = e2 1, if P j=1,2,4,5 I(Qs,j(1) = e2) = 3, P j=1,2,4,5 I(Qs,j(1) = e3) = 1 and Qs,3(1) = e2 0, if (P j=1,2,4,5 I(Qs,j(1) = e2) = 2 and P j=1,2,4,5 I(Qs,j(1) = e3) = 2) and Qs,3(1) = e2) or (P j=1,2,4,5 I(Qs,j(1) = e2) = 4 and Qs,3(1) = e3) or (P j=1,2,4,5 I(Qs,j(1) = e2) = 3, P j=1,2,4,5 I(Qs,j(1) = e1) = 1, Qs,3(1) = e2) 1 otherwise. Published as a conference paper at ICLR 2024 Then we can compute the probability respectively: For -2 case, the probability is given by 2 (p/3)5 = 2p5/243 For -1 case, the probability is given by 2 4 (p/3)4 (p/3) = 8p5/243 For 0 case, the probability is given by 2 (4C2(p/3)4 (p/3)+(p/3)4 (p/3)+4 (1 2p/3) (p/3)3 (p/3)) = 8p4/81 2p5/243 Otherwise, the probability is given by 1 8p4/81 8p5/243 Then the probability can be computed as approximately [1 4p4/81 p5/27 ε, 1 4p4/81 p5/27 + ε]. F.5 AUXILIARY LEMMAS Lemma 1. For any N i.i.d random variables {zi}N i=1 N(0, σ2) and t > 0, with probability at least 2σ2 , we have zi + t 0, for any i = 1, . . . , N. Proof. For any index i, according to Markov inequality, with any λ > 0, we have P(zi + t 0) = P(eλzi eλt) Eeλzi eλt exp{λ2σ2 taking the minimum value on the right handside, with respect to λ, we can get P(zi + t 0) exp{ t2 Then considering the random variables through all index i = 1, . . . , N, P(min i zi + t 0) = P(emaxi λzi eλt)) E(exp{max i λzi λt}), while Eexp{max i λzi} = E max i eλzi X i Eeλzi Neλ2σ2/2, we can take the minimum value on the right hand side, with respect with λ, then further obtain P(min i zi + t 0) Ne t2/2σ2. Lemma 2. Suppose that x N(0, σ2Id), k 1 vectors {a1, . . . , ak 1} and δi R for i = 1, . . . , k 1, then by Gram-Schmidt process, the probability of a T 1 x + δ1 > 0, . . . a T k 1x + δk 1 > 0, is equivalent to the probability of e1 + δ1 a1 2 > 0, s 1 ( a T 2 v1 a2 2 v1 2 )2e2 + a T 2 v1 a2 2 v1 2 e1 + δ2 a2 2 > 0, . . . v u u t1 i=1 ( a T k 1vi ak 1 2 vi 2 )2ek 1 + a T k 1vi ak 1 2 vi 2 ei + δk 1 ak 1 2 > 0. in which {ei} are i.i.d. N(0, σ2) and {vi}k 1 i=1 are orthogonal vectors span on {ai}k 1 i=1 as v1 = a1 a1 2 , v2 = a2 (a T 2 v1)v1 p a2 2 2 (a T 2 v1)2 , vk 1 = ak 1 Pk 2 i=1 (a T k 1vi)vi q ak 1 2 2 Pk 2 i=1 (a T k 1vi)2 , Published as a conference paper at ICLR 2024 Lemma 3. Just consider two classes K = 1, 2 and denote r1 2 as the number of the events {I(Qs,i(1) = e2)} that holds. Suppose Assumption 1 hold, then 1. when ns + nv 2r1 2 > 0, the related probability P((w(1) w(2))T xΦ > 0 |y=e1) is larger than 1 ϵ, 2. when ns + nv 2r1 2 = 0, the related probability P((w(1) w(2))T xΦ > 0 |y=e1) in [1/2 ϵ, 1/2 + ϵ], 3. when ns + nv 2r1 2 < 0, the related probability P((w(1) w(2))T xΦ > 0 |y=e1) is less than ϵ. Proof. This can be directly deduced by using Lemma 4, which is the more general version. Lemma 4. Denote rk l as the number of the events {I(Qs,i(k) = el)}ns i=1 that hold. Suppose Assumption 1 hold, then for class k, when ns + nv P l =k rk l > maxl =k rk l, the accuracy is larger than 1 ϵ, when ns + nv P l =k rk l = maxl =k rk l, denote N as the number of the events that holds {I(rk l = ns + nv P l rk l )}l =k, the accuracy in [1/(N + 1) ϵ, 1/(N + 1) + ϵ], when ns + nv P l =k rk l < maxl =k rk l, the accuracy is less than ϵ. Proof. Considering the conditional forecasting accuracy on class k, with respect to rk 1, . . . , rk K, it is equivalent with G({ns + nv P l =k rk l rk s, s = k}), which is defined previously. Then we can take analysis case by case: l =k rk l > maxl =k rk l In this case, all elements in function G( ) are larger than 0, which means that no smaller than 1/(N v+ N s). With Assumption 1, we have G({ns + nv X l =k rk l rk s, s = k}) F K( 1 σ(N v + N s)) 1 ϵ. l =k rk l = maxl =k rk l In this case, all elements in function G( ) are no smaller than 1/(N v +N s), except N zero elements. With Assumption 1, we have G({ns + nv X l =k rk l rk s, s = k}) 1 2N (1 ϵ) 1 2N ϵ. l =k rk l < maxl =k rk l In this case, there is at least one element in G( ) no larger than 1/(N v + N s), still considering Assumption 1, we have G({ns + nv X l =k rk l rk s, s = k}) F ( 1 σ(N v + N s)) ϵ. Lemma 5. With features {xv,i}nv i=1 and {xs,j}ns j=1, the classifier trained on infinite samples is equivalent with the mean value of x = Pnv i=1 xi + Pns j=1 xj. Proof. Considering the classifier, it should be maxw P(ˆy|Φ(x) w). From Definition 1, we have x | y N(µ y, (nv +ns)σ2), in which µ = Pnv i=1 µv,i +Pns j=1 µs,j. For simplicity, we denote z = x | y µ y, and z is independent of y. Published as a conference paper at ICLR 2024 Given samples as {(xi, yi)}i, by maximum likelihood estimation (MLE), we have arg max w Πn i=1 exp{ 1 (nv+ns)σ2 x i (w yi)} P ej exp{ 1 (nv+ns)σ2 x i (w ej)} 1 (nv + ns)σ2 x i (w yi) ej exp{ 1 (nv + ns)σ2 x i (w ej)} then taking derivative for each wk, we have 1 (nv + ns)σ2 1 (nv + ns)σ2 exp{ 1 (nv+ns)σ2 x i wk} P ej exp{ 1 (nv+ns)σ2 x i (w ej)}xi = 0, k = 1, . . . , K. On the other hand, using Bayesian formula to consider the conditional expectation of x, we have E(x | y = ek) = E(x1(y = ek)) P(y = ek) = E(x E[1(y = ek) | x]) E(x | y = ek) = KE x exp{ 1 (nv+ns)σ2 x (µ ek)} PK r=1 exp{ 1 (nv+ns)σ2 x i (µ er)} it implies that as sample size n goes to infinity, the following term can maximize the likelihood function: w(k) = µ ek = ( j=1 µj) ek, for k = 1, . . . , K. And by scaling the classifier, we can get the estimated classifier as w(k) = 1 nv + ns µ ek, for any class k = 1, . . . , K, which is the same as (1/ nv + ns)Ex|y=ek[x | y = ek]. G ILLUSTRATING THE THEORY OF WISE-FT Recall Definition 2 that, f learns nv invariant features and ns spurious features, as well as another single model f has nv invariant features and ns spurious features. Further, f and f learns nvo overlapped invariant features and nso overlapped spurious features. Let f denote the pre-trained model and f denote the fine-tuned model. Wi SE-FT is specifically is the following: f has good OOD but bad ID, f has bad OOD but good ID, and the weight space ensemble of f and f has excellent OOD performance. These can be expressed as: Aid( f) < Aid( f), Aood( f) > Aood( f), Aood( fwse) > max{Aood( f), Aood( f)} It straightforward that the ID accuracy satisfies the following inequality due to Assumption 2: Aid( f) < Aid( f), if nv + ns < nv + ns. Intuitively, if a model learns more feature, it can predict the label better in the ID setting. As for the OOD accuracy, by Proposition 2, we have Aood( f) = Fp (1 p) ns + nv ns , Aood( f) = Fp (1 p) ns + nv ns Furthermore, by Proposition 3, the OOD accuracy of weight space ensemble (WSE) is Aood(fwse) = Fp (1 p)( ns + ns + 2nso) + nv + nv + 2nvo ns + ns + 14nso Published as a conference paper at ICLR 2024 Then the Wi SE-FT phenomenon can be effectively explained if the following conditions holds: nv + ns < nv + ns, (1 p) ns+ nv ns > (1 p) ns+ nv ns , (1 p)( ns+ ns+2nso)+ nv+ nv+2nvo ns+ ns+14nso > max{ (1 p) ns+ nv ns , (1 p) ns+ nv ns } The theoretical results above characterize the conditions for the Wi SE-FT phenomenon. To gain a better understanding, we use a concrete example for illustration: p = 0.9, there is no overlapped features learned by two models, i.e., nso = nvo = 0. The pretrained model f learns some invariant and spurious features, i.e., nv = 2, ns = 4. The fine-tuned f model learns more spurious features and less invariant features, i.e., nv = 1, ns = 6. In this example, the fine-tuned model f has better ID performance than the pre-trained f since nv + ns = 7 > nv + ns = 6. The fine-tuned model f has worse OOD performance than the pretrained model f since f focuses more on spurious features. Specifically, we have Aood( f) > Aood( f) since Aood( f) = Fp (1 p) ns + nv ns Aood( f) = Fp (1 p) ns + nv ns Based on Proposition 3, the OOD performance of wse is Aood(fwse) = Fp (1 p)( ns + ns + 2nso) + nv + nv + 2nvo ns + ns + 14nso Recall that Fp( ) is monotonically increasing, we can see that Aood(fwse) > max{Aood( f), Aood( f)}. H ILLUSTRATING THE EFFECTIVENESS OF BANG THROUGH THE LENS ACCURACY ON THE CURVE Considering that Mixup and Label Smoothing (LS) enhance the OOD performance of the fine-tuned model, we investigate whether the improvement achieved by BANG is primarily due to better calibration or the fine-tuned model s enhanced OOD performance. In Appendix E.6, we present our findings, which include the following observations: Dividing the weight of the vanilla fine-tuned model by multiple scalars significantly enhances the performance of weight averaging, closely approaching the performance of BANG. BANG demonstrates the ability to correct a substantial number of misclassified samples compared to the fine-tuned model. To further investigate the performance of BANG, we examine the concept of Accuracy on the Line (Miller et al., 2021; Liang et al., 2023). We generate many checkpoints of vanilla fine-tuning by using different hyper-parameters. Specifically, we fine-tune the model using various hyperparameters, including learning rates (1e 5, 2e 5, 3e 5, 5e 5), training epochs (4, 8, 10, 12, 16, 20), and learning rate schedules (cosine, step decay). Notably, the default hyperparameters used in Section 4 and mentioned in Wortsman et al. (2022) are a learning rate of 3e 5, 10 training epochs, and a cosine scheduler. Weight averaging is applied to each fine-tuned checkpoint with the pretrained model. Figure 21 illustrates the OOD performance of each fine-tuned model, as well as the averaged model that combines the fine-tuned model with the pre-trained model. Interestingly, we observe that the OOD accuracy of the averaged model forms a quadratic function with respect to the OOD accuracy of the fine-tuned model Liang et al. (2023), rather than a linear relationship as described in (Miller et al., 2021). Furthermore, BANG demonstrates significant robustness in OOD scenarios, surpassing the curve of expected performance. Published as a conference paper at ICLR 2024 Figure 21: Illustrating the effectiveness of BANG through the lens of accuracy on the curve .