# diverse_weight_averaging_for_outofdistribution_generalization__9b987614.pdf

Diverse Weight Averaging for Out-of-Distribution

Generalization

Alexandre Ramé1,*, Matthieu Kirchmeyer1,2,*

Thibaud Rahier2, Alain Rakotomamonjy2,4, Patrick Gallinari1,2, Matthieu Cord1,3

1Sorbonne Université, CNRS, ISIR, F-75005 Paris, France 2Criteo AI Lab, Paris, France 3Valeo.ai, Paris, France 4Université de Rouen, LITIS, France *Equal contribution

Standard neural networks struggle to generalize under distribution shifts in computer vision. Fortunately, combining multiple networks can consistently improve out-of-distribution generalization. In particular, weight averaging (WA) strategies were shown to perform best on the competitive Domain Bed benchmark; they directly average the weights of multiple networks despite their nonlinearities. In this paper, we propose Diverse Weight Averaging (Di WA), a new WA strategy whose main motivation is to increase the functional diversity across averaged models. To this end, Di WA averages weights obtained from several independent training runs: indeed, models obtained from different runs are more diverse than those collected along a single run thanks to differences in hyperparameters and training procedures. We motivate the need for diversity by a new bias-variance-covariancelocality decomposition of the expected error, exploiting similarities between WA and standard functional ensembling. Moreover, this decomposition highlights that WA succeeds when the variance term dominates, which we show occurs when the marginal distribution changes at test time. Experimentally, Di WA consistently improves the state of the art on Domain Bed without inference overhead.

1 Introduction

Learning robust models that generalize well is critical for many real-world applications [1, 2]. Yet, the classical Empirical Risk Minimization (ERM) lacks robustness to distribution shifts [3, 4, 5]. To improve out-of-distribution (OOD) generalization in classiﬁcation, several recent works proposed to train models simultaneously on multiple related but different domains [6]. Though theoretically appealing, domain-invariant approaches [7] either underperform [8, 9] or only slightly improve [10, 11] ERM on the reference Domain Bed benchmark [12]. The state-of-the-art strategy on Domain Bed

is currently to average the weights obtained along a training trajectory [13]. [14] argues that this weight averaging (WA) succeeds in OOD because it ﬁnds solutions with ﬂatter loss landscapes.

In this paper, we show the limitations of this ﬂatness-based analysis and provide a new explanation for the success of WA in OOD. It is based on WA s similarity with ensembling [15], a well-known strategy to improve robustness [16, 17], that averages the predictions from various models. Based on [18], we present a bias-variance-covariance-locality decomposition of WA s expected error. It contains four terms: ﬁrst the bias that we show increases under shift in label posterior distributions (i.e., correlation shift [19]); second, the variance that we show increases under shift in input marginal distributions (i.e., diversity shift [19]); third, the covariance that decreases when models are diverse; ﬁnally, a locality condition on the weights of averaged models.

Based on this analysis, we aim at obtaining diverse models whose weights are averageable with our Diverse Weight Averaging (Di WA) approach. In practice, Di WA averages in weights the models

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

obtained from independent training runs that share the same initialization. The motivation is that those models are more diverse than those obtained along a single run [20, 21]. Yet, averaging the weights of independently trained networks with batch normalization [22] and Re LU layers [23] may be counter-intuitive. Such averaging is efﬁcient especially when models can be connected linearly in the weight space via a low loss path. Interestingly, this linear mode connectivity property [24] was empirically validated when the runs start from a shared pretrained initialization [25]. This insight is at the heart of Di WA but also of other recent works [26, 27, 28], as discussed in Section 6.

In summary, our main contributions are the following:

We propose a new theoretical analysis of WA for OOD based on a bias-variance-covariance-

locality decomposition of its expected error (Section 2). By relating correlation shift to its bias and diversity shift to its variance, we show that WA succeeds under diversity shift. We empirically tackle the covariance term by increasing the diversity across models averaged

in weights. In our Di WA approach, we decorrelate their training procedures: in practice, these models are obtained from independent runs (Section 3). We then empirically validate that diversity improves OOD performance (Section 4) and show that Di WA is state of the art on all real-world datasets from the Domain Bed benchmark [12] (Section 5).

2 Theoretical insights

Under the setting described in Section 2.1, we introduce WA in Section 2.2 and decompose its expected OOD error in Section 2.3. Then, we separately consider the four terms of this bias-variancecovariance-locality decomposition in Section 2.4. This theoretical analysis will allow us to better understand when WA succeeds, and most importantly, how to improve it empirically in Section 3.

2.1 Notations and problem deﬁnition

Notations. We denote X the input space of images, Y the label space and : Y2 ! R+ a loss function. S is the training (source) domain with distribution p S, and T is the test (target) domain with distribution p T . For simplicity, we will indistinctly use the notations p S and p T to refer to the joint, posterior and marginal distributions of (X, Y ). We note f S, f T : X ! Y the source and target labeling functions. We assume that there is no noise in the data: then f S is deﬁned on XS , {x 2 X/p S(x) > 0} by 8(x, y) p S, f S(x) = y and similarly f T is deﬁned on XT , {x 2 X/p T (x) > 0} by 8(x, y) p T , f T (x) = y.

Problem. We consider a neural network (NN) f( , ) : X ! Y made of a ﬁxed architecture f with weights . We seek minimizing the target generalization error:

ET ( ) = E(x,y) p T [ (f(x, ), y)]. (1) f( , ) should approximate f T on XT . However, this is complex in the OOD setup because we only have data from domain S in training, related yet different from T. The differences between S and T are due to distribution shifts (i.e., the fact that p S(X, Y ) 6= p T (X, Y )) which are decomposed per [19] into diversity shift (a.k.a. covariate shift), when marginal distributions differ (i.e., p S(X) 6= p T (X)), and correlation shift (a.k.a. concept shift), when posterior distributions differ (i.e., p S(Y |X) 6= p T (Y |X) and f S 6= f T ). The weights are typically learned on a training dataset d S from S (composed of n S i.i.d. samples from p S(X, Y )) with a conﬁguration c, which contains all other sources of randomness in learning (e.g., initialization, hyperparameters, training stochasticity, epochs, etc.). We call l S = {d S, c} a learning procedure on domain S, and explicitly write (l S) to refer to the weights obtained after stochastic minimization of 1/n S

(x,y)2d S (f(x, ), y) w.r.t. under l S.

2.2 Weight averaging for OOD and limitations of current analysis

Weight averaging. We study the beneﬁts of combining M individual member weights { m}M

m=1 , { (l(m)

m=1 obtained from M (potentially correlated) identically distributed (i.d.) learning procedures LM

m=1. Under conditions discussed in Section 3.2, these M weights can be averaged despite nonlinearities in the architecture f. Weight averaging (WA) [13], deﬁned as:

f WA , f( , WA), where WA , WA(LM

is the state of the art [14, 29] on Domain Bed [12] when the weights { m}M

m=1 are sampled along a single training trajectory (a description we reﬁne in Remark 1 from Appendix C.2).

Limitations of the ﬂatness-based analysis. To explain this success, Cha et al. [14] argue that ﬂat minima generalize better; indeed, WA ﬂattens the loss landscape. Yet, as shown in Appendix B, this analysis does not fully explain WA s spectacular results on Domain Bed. First, ﬂatness does not act on distribution shifts thus the OOD error is uncontrolled with their upper bound (see Appendix B.1). Second, this analysis does not clarify why WA outperforms Sharpness-Aware Minimizer (SAM) [30] for OOD generalization, even though SAM directly optimizes ﬂatness (see Appendix B.2). Finally, it does not justify why combining WA and SAM succeeds in IID [31] yet fails in OOD (see Appendix B.3). These observations motivate a new analysis of WA; we propose one below that better explains these results.

2.3 Bias-variance-covariance-locality decomposition

We now introduce our bias-variance-covariance-locality decomposition which extends the biasvariance decomposition [32] to WA. In the rest of this theoretical section, is the Mean Squared Error for simplicity: yet, our results may be extended to other losses as in [33]. In this case, the expected error of a model with weights (l S) w.r.t. the learning procedure l S was decomposed in [32] into:

El SET ( (l S)) = E(x,y) p T [bias2(x, y) + var(x)], (BV)

where bias(x, y), var(x) are the bias and variance of the considered model w.r.t. a sample (x, y), deﬁned later in Equation (BVCL). To decompose WA s error, we leverage the similarity (already highlighted in [13]) between WA and functional ensembling (ENS) [15, 34], a more traditional way to combine a collection of weights. More precisely, ENS averages the predictions, f ENS , f ENS( , { m}M

m=1) , 1/M PM

m=1 f( , m). Lemma 1 establishes that f WA is a ﬁrst-order approximation of f ENS when { m}M

m=1 are close in the weight space. Lemma 1 (WA and ENS. Proof in Appendix C.1. Adapted from [13, 28].). Given { m}M

m=1 with learning procedures LM

m=1. Denoting LM

m=1k m WAk2, 8(x, y) 2 X Y:

f WA(x) = f ENS(x) + O( 2

S ) and (f WA(x), y) = (f ENS(x), y) + O( 2

This similarity is useful since Equation (BV) was extended into a bias-variance-covariance decomposition for ENS in [18, 35]. We can then derive the following decomposition of WA s expected test error. To take into account the M averaged weights, the expectation is over the joint distribution describing the M identically distributed (i.d.) learning procedures LM

m=1. Proposition 1 (Bias-variance-covariance-locality decomposition of the expected generalization error of WA in OOD. Proof in Appendix C.2.). Denoting f S(x) = El S[f(x, (l S))], under identically distributed learning procedures LM

m=1, the expected generalization error on domain T of WA(LM

m=1 m over the joint distribution of LM

S ET ( WA(LM

S )) = E(x,y) p T

bias2(x, y) + 1

M var(x) + M 1

where bias(x, y) = y f S(x),

and var(x) = El S

f(x, (l S)) f S(x)

and cov(x) = El S,l0

f(x, (l S)) f S(x)

S))) f S(x)

and 2 = ELM

m=1k m WAk2.

cov is the prediction covariance between two member models whose weights are averaged. The locality term 2 is the expected squared maximum distance between weights and their average.

Equation (BVCL) decomposes the OOD error of WA into four terms. The bias is the same as that of each of its i.d. members. WA s variance is split into the variance of each of its i.d. members divided by M and a covariance term. The last locality term constrains the weights to ensure the validity of our approximation. In conclusion, combining M models divides the variance by M but introduces the covariance and locality terms which should be controlled along bias to guarantee low OOD error.

2.4 Analysis of the bias-variance-covariance-locality decomposition

We now analyze the four terms in Equation (BVCL). We show that bias dominates under correlation shift (Section 2.4.1) and variance dominates under diversity shift (Section 2.4.2). Then, we discuss a trade-off between covariance, reduced with diverse models (Section 2.4.3), and the locality term, reduced when weights are similar (Section 2.4.4). This analysis shows that WA is effective against diversity shift when M is large and when its members are diverse but close in the weight space.

2.4.1 Bias and correlation shift (and support mismatch)

We relate OOD bias to correlation shift [19] under Assumption 1, where f S(x) , El S[f(x, (l S))]. As discussed in Appendix C.3.2, Assumption 1 is reasonable for a large NN trained on a large dataset representative of the source domain S. It is relaxed in Proposition 4 from Appendix C.3. Assumption 1 (Small IID bias). 9 > 0 small s.t. 8x 2 XS, |f S(x) f S(x)| . Proposition 2 (OOD bias and correlation shift. Proof in Appendix C.3). With a bounded difference between the labeling functions f T f S on XT \ XS, under Assumption 1, the bias on domain T is:

E(x,y) p T [bias2(x, y)] = Correlation shift + Support mismatch + O( ),

where Correlation shift =

(f T (x) f S(x))2p T (x)dx,

and Support mismatch =

f T (x) f S(x)

&2p T (x)dx.

We analyze the ﬁrst term by noting that f T (x) , Ep T [Y |X = x] and f S(x) , Ep S[Y |X = x], 8x 2 XT \ XS. This expression conﬁrms that our correlation shift term measures shifts in posterior distributions between source and target, as in [19]. It increases in presence of spurious correlations: e.g., on Colored MNIST [8] where the color/label correlation is reversed at test time. The second term is caused by support mismatch between source and target. It was analyzed in [36] and shown irreducible in their No free lunch for learning representations for DG . Yet, this term can be tackled if we transpose the analysis in the feature space rather than the input space. This motivates encoding the source and target domains into a shared latent space, e.g., by pretraining the encoder on a task with minimal domain-speciﬁc information as in [36].

This analysis explains why WA fails under correlation shift, as shown on Colored MNIST in Appendix H. Indeed, combining different models does not reduce the bias. Section 2.4.2 explains that WA is however efﬁcient against diversity shift.

2.4.2 Variance and diversity shift

Variance is known to be large in OOD [5] and to cause a phenomenon named underspeciﬁcation, when models behave differently in OOD despite similar test IID accuracy. We now relate OOD variance to diversity shift [19] in a simpliﬁed setting. We ﬁx the source dataset d S (with input support Xd S), the target dataset d T (with input support Xd T ) and the network s initialization. We get a closed-form expression for the variance of f over all other sources of randomness under Assumptions 2 and 3. Assumption 2 (Kernel regime). f is in the kernel regime [37, 38].

This states that f behaves as a Gaussian process (GP); it is reasonable if f is a wide network [37, 39]. The corresponding kernel K is the neural tangent kernel (NTK) [37] depending only on the initialization. GPs are useful because their variances have a closed-form expression (Appendix C.4.1). To simplify the expression of variance, we now make Assumption 3. Assumption 3 (Constant norm and low intra-sample similarity on d S). 9(λS, ) with 0 λS such that 8x S 2 Xd S, K(x S, x S) = λS and 8x0

S 6= x S 2 Xd S, |K(x S, x0

This states that training samples have the same norm (following standard practice [39, 40, 41, 42]) and weakly interact [43, 44]. This assumption is further discussed and relaxed in Appendix C.4.2. We are now in a position to relate variance and diversity shift when ! 0.

Proposition 3 (OOD variance and diversity shift. Proof in Appendix C.4). Given f trained on source dataset d S (of size n S) with NTK K, under Assumptions 2 and 3, the variance on dataset d T is:

Ex T 2Xd T [var(x T )] = n S

MMD2(Xd S, Xd T ) + λT n S

βT + O( ), (4)

where MMD is the empirical Maximum Mean Discrepancy in the RKHS of K2(x, y) = (K(x, y))2;λT , Ex T 2Xd T K(x T , x T ) and βT , E(x T ,x0

d T ,x T 6=x0

T K2(x T , x0

T ) are the em-

pirical mean similarities respectively measured between identical (w.r.t. K) and different (w.r.t. K2) samples averaged over Xd T .

The MMD empirically estimates shifts in input marginals, i.e., between p S(X) and p T (X). Our expression of variance is thus similar to the diversity shift formula in [19]: MMD replaces the L1 divergence used in [19]. The other terms, λT and βT , both involve internal dependencies on the target dataset d T : they are constants w.r.t. Xd T and do not depend on distribution shifts. At ﬁxed d T and under our assumptions, Equation (4) shows that variance on d T decreases when Xd S and Xd T are closer (for the MMD distance deﬁned by the kernel K2) and increases when they deviate. Intuitively, the further Xd T is from Xd S, the less the model s predictions on Xd T are constrained after ﬁtting d S.

This analysis shows that WA reduces the impact of diversity shift as combining M models divides the variance per M. This is a strong property achieved without requiring data from the target domain.

2.4.3 Covariance and diversity

The covariance term increases when the predictions of {f( , m)}M

m=1 are correlated. In the worst case where all predictions are identical, covariance equals variance and WA is no longer beneﬁcial. On the other hand, the lower the covariance, the greater the gain of WA over its members; this is derived by comparing Equations (BV) and (BVCL), as detailed in Appendix C.5. It motivates tackling covariance by encouraging members to make different predictions, thus to be functionally diverse. Diversity is a widely analyzed concept in the ensemble literature [15], for which numerous measures have been introduced [45, 46, 47]. In Section 3, we aim at decorrelating the learning procedures to increase members diversity and reduce the covariance term.

2.4.4 Locality and linear mode connectivity

To ensure that WA approximates ENS, the last locality term O( 2) constrains the weights to be close. Yet, the covariance term analyzed in Section 2.4.3 is antagonistic, as it motivates functionally diverse

models. Overall, to reduce WA s error in OOD, we thus seek a good trade-off between diversity and locality. In practice, we consider that the main goal of this locality term is to ensure that the weights are averageable despite the nonlinearities in the NN such that WA s error does not explode. This is why in Section 3, we empirically relax this locality constraint and simply require that the weights are linearly connectable in the loss landscape, as in the linear mode connectivity [24]. We empirically verify later in Figure 1 that the approximation f WA f ENS remains valid even in this case.

3 Di WA: Diverse Weight Averaging

3.1 Motivation: weight averaging from different runs for more diversity

Limitations of previous WA approaches. Our analysis in Sections 2.4.1 and 2.4.2 showed that the bias and the variance terms are mostly ﬁxed by the distribution shifts at hand. In contrast, the covariance term can be reduced by enforcing diversity across models (Section 2.4.3) obtained from learning procedures {l(m)

m=1. Yet, previous methods [14, 29] only average weights obtained along a single run. This corresponds to highly correlated procedures sharing the same initialization, hyperparameters, batch orders, data augmentations and noise, that only differ by the number of training steps. The models are thus mostly similar: this does not leverage the full potential of WA.

Di WA. Our Diverse Weight Averaging approach seeks to reduce the OOD expected error in Equation (BVCL) by decreasing covariance across predictions: Di WA decorrelates the learning procedures {l(m)

m=1. Our weights are obtained from M 1 different runs, with diverse learning procedures:

Algorithm 1 Di WA Pseudo-code

Require: 0 pretrained encoder and initialized classiﬁer; {hm}H

m=1 hyperparameter conﬁgurations. Training: 8m = 1 to H, m , Fine Tune( 0, hm) Weight selection:

Uniform: M = {1, , H}. Restricted: Rank { m}H

m=1 by decreasing Val Acc( m). M ;. for m = 1 to H do

If Val Acc( M[{m}) Val Acc( M) M M [ {m} Inference: with f( , M), where M = P

these have different hyperparameters (learning rate, weight decay and dropout probability), batch orders, data augmentations (e.g., random crops, horizontal ﬂipping, color jitter, grayscaling), stochastic noise and number of training steps. Thus, the corresponding models are more diverse on domain T per [21] and reduce the impact of variance when M is large. However, this may break the locality requirement analyzed in Section 2.4.4 if the weights are too distant. Empirically, we show that Di WA works under two conditions: shared initialization and mild hyperparameter ranges.

3.2 Approach: shared initialization, mild hyperparameter search and weight selection

Shared initialization. The shared initialization condition follows [25]: when models are ﬁne-tuned from a shared pretrained model, their weights can be connected along a linear path where error remains low [24]. Following standard practice on Domain Bed [12], our encoder is pretrained on Image Net [48]; this pretraining is key as it controls the bias (by deﬁning the feature support mismatch, see Section 2.4.1) and variance (by deﬁning the kernel K, see Appendix C.4.4). Regarding the classiﬁer initialization, we test two methods. The ﬁrst is the random initialization, which may distort the features [49]. The second is Linear Probing (LP) [49]: it ﬁrst learns the classiﬁer (while freezing the encoder) to serve as a shared initialization. Then, LP ﬁne-tunes the encoder and the classiﬁer together in the M subsequent runs; the locality term is smaller as weights remain closer (see [49]).

Mild hyperparameter search. As shown in Figure 5, extreme hyperparameter ranges lead to weights whose average may perform poorly. Indeed, weights obtained from extremely different hyperparameters may not be linearly connectable; they may belong to different regions of the loss landscape. In our experiments, we thus use the mild search space deﬁned in Table 7, ﬁrst introduced in SWAD [14]. These hyperparameter ranges induce diverse models that are averageable in weights.

Weight selection. The last step of our approach (summarized in Algorithm 1) is to choose which weights to average among those available. We explore two simple weight selection protocols, as in [28]. The ﬁrst uniform equally averages all weights; it is practical but may underperform when some runs are detrimental. The second restricted (greedy in [28]) solves this drawback by restricting the number of selected weights: weights are ranked in decreasing order of validation accuracy and sequentially added only if they improve Di WA s validation accuracy.

In the following sections, we experimentally validate our theory. First, Section 4 conﬁrms our ﬁndings on the Ofﬁce Home dataset [50] where diversity shift dominates [19] (see Appendix E.2 for a similar analysis on PACS [51]). Then, Section 5 shows that Di WA is state of the art on Domain Bed [12].

4 Empirical validation of our theoretical insights

We consider several collections of weights { m}M

m=1 (2 M < 10) trained on the Clipart , Product and Photo domains from Ofﬁce Home [50] with a shared random initialization and mild

hyperparameter ranges. These weights are ﬁrst indifferently sampled from a single run (every 50 batches) or from different runs. They are evaluated on Art , the fourth domain from Ofﬁce Home.

WA vs. ENS. Figure 1 validates Lemma 1 and that f WA f ENS. More precisely, f WA slightly but consistently improves f ENS: we discuss this in Appendix D. Moreover, a larger M improves the

Figure 1: Each dot displays the accuracy (") of weight averaging (WA) vs. accuracy (") of prediction averaging (ENS) for M models.

Figure 2: Each dot displays the accuracy (") gain of WA over its members vs. the prediction diversity [46] (") for M models.

results; in accordance with Equation (BVCL), this motivates averaging as many weights as possible. In contrast, large M is computationally impractical for ENS at test time, requiring M forwards.

Diversity and accuracy. We validate in Figure 2 that f WA beneﬁts from diversity. Here, we measure diversity with the ratio-error [46], i.e., the ratio Ndiff/Nsimul between the number of different errors Ndiff and of simultaneous errors Nsimul in test for a pair in {f( , m)}M

m=1. A higher average over the

pairs means that members are less likely to err on the same inputs. Speciﬁcally, the gain of Acc( WA) over the mean individual accuracy 1 M

m=1 Acc( m) increases with diversity. Moreover, this phenomenon intensiﬁes for larger M: the linear regression s slope (i.e., the accuracy gain per unit of diversity) increases with M. This is consistent with the (M 1)/M factor of cov(x) in Equation (BVCL), as further highlighted in Appendix E.1.2. Finally, in Appendix E.1.1, we show that the conclusion also holds with CKAC [47], another established diversity measure.

Increasing diversity thus accuracy via different runs. Now we investigate the difference between sampling the weights from a single run or from different runs. Figure 3 ﬁrst shows that diversity increases when weights come from different runs. Second, in Figure 4, this is reﬂected on the accuracies in OOD. Here, we rank by validation accuracy the 60 weights obtained (1) from 60 different runs and (2) along 1 well-performing run. We then consider the WA of the top M weights as M increases from 1 to 60. Both have initially the same performance and improve with M; yet, WA of weights from different runs gradually outperforms the single-run WA. Finally, Figure 5 shows that this holds only for mild hyperparameter ranges and with a shared initialization. Otherwise, when hyperparameter distributions are extreme (as deﬁned in Table 7) or when classiﬁers are not similarly initialized, Di WA may perform worse than its members due to a violation of the locality condition. These experiments conﬁrm that diversity is key as long as the weights remain averageable.

Figure 3: Frequencies of prediction diversities (") [46] across 2 weights obtained along a single run or from different runs.

Figure 4: WA accuracy (") as M increases, when the M weights are obtained along a single run or from different runs.

Figure 5: Each dot displays the accuracy (") gain of WA over its members vs. prediction diversity (") for 2 M < 10 models.

5 Experimental results on the Domain Bed benchmark

Datasets. We now present our evaluation on Domain Bed [12]. By imposing the code, the training procedures and the Res Net50 [52] architecture, Domain Bed is arguably the fairest benchmark for OOD generalization. It includes 5 multi-domain real-world datasets: PACS [51], VLCS [53], Ofﬁce Home [50], Terra Incognita [54] and Domain Net [55]. [19] showed that diversity shift dominates in these datasets. Each domain is successively considered as the target T while other domains are merged into the source S. The validation dataset is sampled from S, i.e., we follow Domain Bed s training-domain model selection. The experimental setup is further described in Appendix G.1. Our code is available at https://github.com/alexrame/diwa.

Baselines. ERM is the standard Empirical Risk Minimization. Coral [10] is the best approach based on domain invariance. SWAD (Stochastic Weight Averaging Densely) [14] and MA (Moving Average) [29] average weights along one training trajectory but differ in their weight selection strategy. SWAD [14] is the current state of the art (So TA) thanks to it overﬁt-aware strategy, yet at the cost of three additional hyperparameters (a patient parameter, an overﬁtting patient parameter and a tolerance rate) tuned per dataset. In contrast, MA [29] is easy to implement as it simply combines all checkpoints uniformly starting from batch 100 until the end of training. Finally, we report the scores obtained in [29] for the costly Deep Ensembles (DENS) [15] (with different initializations): we discuss other ensembling strategies in Appendix D.

Our runs. ERM and Di WA share the same training protocol in Domain Bed: yet, instead of keeping only one run from the grid-search, Di WA leverages M runs. In practice, we sample 20 conﬁgurations from the hyperparameter distributions detailed in Table 7 and report the mean and standard deviation across 3 data splits. For each run, we select the weights of the epoch with the highest validation accuracy. ERM and MA select the model with highest validation accuracy across the 20 runs, following standard practice on Domain Bed. Ensembling (ENS) averages the predictions of all M = 20 models (with shared initialization). Di WA-restricted selects 1 M 20 weights with Algorithm 1 while Di WA-uniform averages all M = 20 weights. Di WA averages uniformly the M = 3 20 = 60 weights from all 3 data splits. Di WA beneﬁts from larger M (without additional inference cost) and from data diversity (see Appendix E.1.3). However, we cannot report standard deviations for Di WA for computational reasons. Moreover, Di WA cannot leverage the restricted weight selection, as the validation is not shared across all 60 weights that have different data splits.

5.1 Results on Domain Bed

We report our main results in Table 1, detailed per domain in Appendix G.2. With a randomly initialized classiﬁer, Di WA -uniform is the best on PACS, VLCS and Ofﬁce Home: Di WA-uniform is the second best on PACS and Ofﬁce Home. On Terra Incognita and Domain Net, Di WA is penalized by some bad runs, ﬁltered in Di WA-restricted which improves results on these datasets. Classiﬁer initialization with linear probing (LP) [49] improves all methods on Ofﬁce Home, Terra Incognita and Domain Net. On these datasets, Di WA increases MA by 1.3, 0.5 and 1.1 points respectively. After averaging, Di WA with LP establishes a new So TA of 68.0%, improving SWAD by 1.1 points.

Table 1: Accuracy (%, ") on Domain Bed with Res Net50 (best in bold and second best underlined).

Algorithm Weight selection Init PACS VLCS Ofﬁce Home Terra Inc Domain Net Avg

85.5 0.2 77.5 0.4 66.5 0.3 46.1 1.8 40.9 0.1 63.3 Coral [10] N/A 86.2 0.3 78.8 0.6 68.7 0.3 47.6 1.0 41.5 0.1 64.6 SWAD [14] Overﬁt-aware 88.1 0.1 79.1 0.1 70.6 0.2 50.0 0.3 46.5 0.1 66.9 MA [29] Uniform 87.5 0.2 78.2 0.2 70.6 0.1 50.3 0.5 46.0 0.1 66.5 DENS [15, 29] Uniform: M = 6 87.6 78.5 70.8 49.2 47.7 66.8

85.5 0.5 77.6 0.2 67.4 0.6 48.3 0.8 44.1 0.1 64.6 MA [29] Uniform 87.9 0.1 78.4 0.1 70.3 0.1 49.9 0.2 46.4 0.1 66.6 ENS Uniform: M = 20 88.0 0.1 78.7 0.1 70.5 0.1 51.0 0.5 47.4 0.2 67.1 Di WA Restricted: M 20 87.9 0.2 79.2 0.1 70.5 0.1 50.5 0.5 46.7 0.1 67.0 Di WA Uniform: M = 20 88.8 0.4 79.1 0.2 71.0 0.1 48.9 0.5 46.1 0.1 66.8 Di WA Uniform: M = 60 89.0 79.4 71.6 49.0 46.3 67.1

85.9 0.6 78.1 0.5 69.4 0.2 50.4 1.8 44.3 0.2 65.6 MA [29] Uniform 87.8 0.3 78.5 0.4 71.5 0.3 51.4 0.6 46.6 0.0 67.1 ENS Uniform: M = 20 88.1 0.3 78.5 0.1 71.7 0.1 50.8 0.5 47.0 0.2 67.2 Di WA Restricted: M 20 88.0 0.3 78.5 0.1 71.5 0.2 51.6 0.9 47.7 0.1 67.5 Di WA Uniform: M = 20 88.7 0.2 78.4 0.2 72.1 0.2 51.4 0.6 47.4 0.2 67.6 Di WA Uniform: M = 60 89.0 78.6 72.8 51.9 47.7 68.0

Table 2: Accuracy (%, ") on Ofﬁce Home domain Art with various objectives.

Algorithm No WA MA Di WA Di WA

ERM 62.9 1.3 65.0 0.2 67.3 0.2 67.7 Mixup 63.1 0.7 66.2 0.3 67.8 0.6 68.4 Coral 64.4 0.4 64.4 0.4 67.7 0.2 68.2 ERM/Mixup N/A N/A 67.9 0.7 68.9 ERM/Coral N/A N/A 68.1 0.3 68.7 ERM/Mixup/Coral N/A N/A 68.4 0.4 69.1

Di WA with different objectives. So far we used ERM that does not leverage the domain information. Table 2 shows that Di WA-uniform beneﬁts from averaging weights trained with Interdomain Mixup [56] and Coral [10]: accuracy gradually improves as we add more objectives. Indeed, as highlighted in Appendix E.1.3, Di WA beneﬁts from the increased diversity brought by the various objectives. This suggests a new kind of linear connectivity across models trained with different objectives; the full analysis of this is left for future work.

5.2 Limitations of Di WA

Despite this success, Di WA has some limitations. First, Di WA cannot beneﬁt from additional diversity that would break the linear connectivity between weights as discussed in Appendix D. Second, Di WA (like all WA approaches) can tackle diversity shift but not correlation shift: this property is explained for the ﬁrst time in Section 2.4 and illustrated in Appendix H on Colored MNIST.

6 Related work

Generalization and ensemble. To generalize under distribution shifts, invariant approaches [8, 9, 11, 10, 57, 58] try to detect the causal mechanism rather than memorize correlations: yet, they do not

outperform ERM on various benchmarks [12, 19, 59]. In contrast, ensembling of deep networks [15, 60, 61] consistently increases robustness [16] and was successfully applied to domain generalization [29, 62, 63, 64, 65, 66]. As highlighted in [18] (whose analysis underlies our Equation (BVCL)), ensembling works due to the diversity among its members. This diversity comes primarily from the randomness of the learning procedure [15] and can be increased with different hyperparameters [67], data [68, 69, 70], augmentations [71, 72] or with regularizations [73, 65, 66, 74, 75].

Weight averaging. Recent works [13, 76, 77, 78] combine in weights (rather than in predictions) models collected along a single run. This was shown suboptimal in IID [17] but successful in OOD [14, 29]. Following the linear mode connectivity [24, 79] and the property that many independent models are connectable [80], a second group of works average weights with fewer constraints [26, 27, 28, 81, 82, 83]. To induce greater diversity, [84] used a high constant learning rate; [80] explicitly encouraged the weights to encompass more volume in the weight space; [83] minimized cosine similarity between weights; [85] used a tempered posterior. From a loss landscape perspective [20], these methods aimed at explor[ing] the set of possible solutions instead of simply converging to a single point , as stated in [84]. The recent Model soups introduced by Wortsman et al. [28] is a WA algorithm similar to Algorithm 1; yet, the theoretical analysis and the goals of these two works are different. Theoretically, we explain why WA succeeds under diversity shift: the bias/correlation shift, variance/diversity shift and diversity-based ﬁndings are novel and are conﬁrmed empirically. Regarding the motivation, our work aims at combining more diverse weights: it may be analyzed as a general framework to average weights obtained in various ways. In contrast, [28] challenges the standard model selection after a grid search. Regarding the task, [28] and our work complement each other: while [28] demonstrate robustness on several Image Net variants with distribution shift, we improve the So TA on the multi-domain Domain Bed benchmark against other established OOD methods after a thorough and fair comparison. Thus, Di WA and [28] are theoretically complementary with different motivations and applied successfully for different tasks.

7 Conclusion

In this paper, we propose a new explanation for the success of WA in OOD by leveraging its ensembling nature. Our analysis is based on a new bias-variance-covariance-locality decomposition for WA, where we theoretically relate bias to correlation shift and variance to diversity shift. It also shows that diversity is key to improve generalization. This motivates our Di WA approach that averages in weights models trained independently. Di WA improves the state of the art on Domain Bed, the reference benchmark for OOD generalization. Critically, Di WA has no additional inference cost removing a key limitation of standard ensembling. Our work may encourage the community to further create diverse learning procedures and objectives whose models may be averaged in weights.

Acknowledgements

We would like to thank Jean-Yves Franceschi for his helpful comments and discussions on our paper. This work was granted access to the HPC resources of IDRIS under the allocation AD011011953 made by GENCI. We acknowledge the ﬁnancial support by the French National Research Agency (ANR) in the chair VISA-DEEP (project number ANR-20-CHIA-0022-01) and the ANR projects DL4CLIM ANR-19-CHIA-0018-01, RAIMO ANR-20-CHIA-0021-01, OATMIL ANR-17-CE230012 and LEAUDS ANR-18-CE23-0020.

[1] John R. Zech, Marcus A. Badgeley, Manway Liu, Anthony B. Costa, Joseph J. Titano, and

Eric Karl Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine, 2018. (pp. 1 and 17)

[2] Alex J De Grave, Joseph D Janizek, and Su-In Lee. Ai for radiographic covid-19 detection

selects shortcuts over signal. Nature Machine Intelligence, 2021. (pp. 1 and 17)

[3] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common

corruptions and perturbations. In ICLR, 2019. (p. 1)

[4] Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli.

The pitfalls of simplicity bias in neural networks. In Neur IPS, 2020. (p. 1)

[5] Alexander D Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex

Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspeciﬁcation presents challenges for credibility in modern machine learning. JMLR, 2020. (pp. 1 and 4)

[6] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via

invariant feature representation. In ICML, 2013. (p. 1)

[7] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant

prediction: identiﬁcation and conﬁdence intervals. JSTOR, 2016. (p. 1)

[8] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini-

mization. ar Xiv preprint, 2019. (pp. 1, 4, 9, and 36)

[9] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas,

Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In ICML, 2021. (pp. 1 and 9)

[10] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation.

In AAAI, 2016. (pp. 1, 8, 9, 29, 33, 34, 35, and 36)

[11] Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances

for out-of-distribution generalization. In ICML, 2022. (pp. 1, 9, and 36)

[12] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In ICLR,

2021. (pp. 1, 2, 3, 6, 8, 9, 16, 17, 18, 27, 32, 33, and 36)

[13] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon

Wilson. Averaging weights leads to wider optima and better generalization. In UAI, 2018. (pp. 1, 2, 3, 9, and 20)

[14] Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee,

and Sungrae Park. Swad: Domain generalization by seeking ﬂat minima. In Neur IPS, 2021. (pp. 1, 3, 5, 6, 8, 9, 17, 18, 19, 20, 23, 32, 33, 34, and 35)

[15] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable

predictive uncertainty estimation using deep ensembles. In Neur IPS, 2017. (pp. 1, 3, 5, 8, 9, 18, 28, 34, and 35)

[16] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua

Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model s uncertainty? evaluating predictive uncertainty under dataset shift. In Neur IPS, 2019. (pp. 1 and 9)

[17] Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of

in-domain uncertainty estimation and ensembling in deep learning. In ICLR, 2020. (pp. 1 and 9)

[18] Naonori Ueda and Ryohei Nakano. Generalization error of ensemble estimators. In ICNN,

1996. (pp. 1, 3, 9, and 21)

[19] Nanyang Ye, Kaican Li, Lanqing Hong, Haoyue Bai, Yiting Chen, Fengwei Zhou, and Zhenguo

Li. Ood-bench: Benchmarking and understanding out-of-distribution generalization datasets and algorithms. CVPR, 2022. (pp. 1, 2, 4, 5, 6, 8, 9, 23, 33, and 36)

[20] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape

perspective. ar Xiv preprint, 2019. (pp. 2 and 9)

[21] Raphael Gontijo-Lopes, Yann Dauphin, and Ekin Dogus Cubuk. No one representation to rule

them all: Overlapping features of training methods. In ICLR, 2022. (pp. 2, 6, 29, and 30)

[22] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training

by reducing internal covariate shift. In ICML, 2015. (p. 2)

[23] Abien Fred Agarap. Deep learning using rectiﬁed linear units (relu). ar Xiv preprint, 2018.

[24] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear

mode connectivity and the lottery ticket hypothesis. In ICML, 2020. (pp. 2, 5, 6, and 9)

[25] Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. What is being transferred in transfer

learning? In Neur IPS, 2020. (pp. 2, 6, 28, and 30)

[26] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Hanna Hajishirzi, Ali Farhadi,

Hongseok Namkoong, and Ludwig Schmidt. Robust ﬁne-tuning of zero-shot models. In CVPR, 2022. (pp. 2, 9, and 28)

[27] Michael Matena and Colin Raffel. Merging models with ﬁsher-weighted averaging. In

Neur IPS, 2022. (pp. 2 and 9)

[28] Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-

Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple ﬁne-tuned models improves accuracy without increasing inference time. In ICML, 2022. (pp. 2, 3, 6, 9, 20, and 28)

[29] Devansh Arpit, Huan Wang, Yingbo Zhou, and Caiming Xiong. Ensemble of averages:

Improving model selection and boosting performance in domain generalization. In Neur IPS, 2022. (pp. 3, 5, 8, 9, 18, 19, 20, 23, 28, 33, 34, 35, and 36)

[30] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware

minimization for efﬁciently improving generalization. In ICLR, 2021. (pp. 3, 18, and 19)

[31] Jean Kaddour, Linqing Liu, Ricardo Silva, and Matt J. Kusner. When do ﬂat minima optimizers

work? In Neur IPS, 2022. (pp. 3 and 19)

[32] Ron Kohavi, David H Wolpert, et al. Bias plus variance decomposition for zero-one loss

functions. In ICML, 1996. (pp. 3, 21, and 27)

[33] Pedro Domingos. A uniﬁed bias-variance decomposition. In ICML, 2000. (p. 3)

[34] Thomas G Dietterich. Ensemble methods in machine learning. In MCS, 2000. (p. 3)

[35] Gavin Brown, Jeremy Wyatt, and Ping Sun. Between two extremes: Examining decompositions

of the ensemble objective function. In MCS, 2005. (pp. 3 and 21)

[36] Yangjun Ruan, Yann Dubois, and Chris J. Maddison. Optimal representations for covariate

shift. In ICLR, 2022. (pp. 4, 23, and 27)

[37] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and

generalization in neural networks. In Neur IPS, 2018. (pp. 4, 25, and 27)

[38] Amit Daniely. Sgd learns the conjugate kernel class of the network. In Neur IPS, 2017. (p. 4)

[39] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and

Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. In ICLR, 2017. (pp. 4 and 25)

[40] Julien Ah-Pine. Normalized kernels as similarity indices. In PAKDD, 2010. (pp. 4 and 25)

[41] Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Reproducing kernel

hilbert space, mercer s theorem, eigenfunctions, nystrom method, and use of kernels in machine learning: Tutorial and survey. ar Xiv preprint, 2021. (pp. 4 and 25)

[42] Jason Rennie. How to normalize a kernel matrix. MIT Computer Science - Artiﬁcial Intelligence

Lab Tech Rep, 2005. (pp. 4 and 25)

[43] Hangfeng He and Weijie Su. The local elasticity of neural networks. In ICLR, 2020. (pp. 4

[44] Mariia Seleznova and Gitta Kutyniok. Neural tangent kernel beyond the inﬁnite-width limit:

Effects of depth and initialization. ICML, 2022. (pp. 4 and 25)

[45] Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classiﬁer ensembles

and their relationship with the ensemble accuracy. Machine learning, 2003. (p. 5)

[46] Matti Aksela. Comparison of classiﬁer selection methods for improving committee perfor-

mance. In MCS, 2003. (pp. 5, 7, 19, 28, 29, 30, and 31)

[47] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey E. Hinton. Similarity of

neural network representations revisited. In ICML, 2019. (pp. 5, 7, 28, 29, and 30)

[48] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep

convolutional neural networks. In Neur IPS, 2012. (pp. 6 and 27)

[49] Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang.

Fine-tuning can distort pretrained features and underperform out-of-distribution. In ICLR, 2022. (pp. 6, 8, 27, 32, 33, 34, and 35)

[50] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan.

Deep hashing network for unsupervised domain adaptation. In CVPR, 2017. (pp. 6, 8, 18, and 33)

[51] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier

domain generalization. In ICCV, 2017. (pp. 6, 8, and 33)

[52] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image

recognition. In CVPR, 2016. (pp. 8 and 33)

[53] Chen Fang, Ye Xu, and Daniel N Rockmore. Unbiased metric learning: On the utilization of

multiple datasets and web images for softening bias. In ICCV, 2013. (pp. 8 and 33)

[54] Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In ECCV,

2018. (pp. 8 and 33)

[55] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment

matching for multi-source domain adaptation. In ICCV, 2019. (pp. 8, 33, and 36)

[56] Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain

adaptation with mixup training. ar Xiv preprint, 2020. (pp. 9, 29, and 33)

[57] Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally

robust neural networks. In ICLR, 2020. (pp. 9 and 17)

[58] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François

Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016. (p. 9)

[59] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay

Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara M Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. Wilds: A benchmark of in-the-wild distribution shifts. In ICML, 2021. (p. 9)

[60] Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE transactions on pattern

analysis and machine intelligence, 1990. (p. 9)

[61] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active

learning. In Neur IPS, 1995. (p. 9)

[62] Kowshik Thopalli, Sameeksha Katoch, Jayaraman J. Thiagarajan, Pavan K. Turaga, and

Andreas Spanias. Multi-domain ensembles for domain generalization. In Neur IPS Workshop, 2021. (p. 9)

[63] Yusuf Mesbah, Youssef Youssry Ibrahim, and Adil Mehood Khan. Domain generalization

using ensemble learning. In ISWA, 2022. (p. 9)

[64] Ziyue Li, Kan Ren, Xinyang Jiang, Bo Li, Haipeng Zhang, and Dongsheng Li. Domain

generalization using pretrained models without ﬁne-tuning. ar Xiv preprint, 2022. (p. 9)

[65] Yoonho Lee, Huaxiu Yao, and Chelsea Finn. Diversify and disambiguate: Learning from

underspeciﬁed data. ar Xiv preprint, 2022. (p. 9)

[66] Matteo Pagliardini, Martin Jaggi, François Fleuret, and Sai Praneeth Karimireddy. Agree to

disagree: Diversity through disagreement for better transferability. ar Xiv preprint, 2022. (p. 9)

[67] Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyperparameter ensembles

for robustness and uncertainty quantiﬁcation. In Neur IPS, 2020. (p. 9)

[68] Leo Breiman. Bagging predictors. Machine learning, 1996. (pp. 9 and 29)

[69] Jeremy Nixon, Balaji Lakshminarayanan, and Dustin Tran. Why are bootstrapped deep

ensembles not better? In Neur IPS Workshop, 2020. (p. 9)

[70] Teresa Yeo, Oguzhan Fatih Kar, and Amir Roshan Zamir. Robustness via cross-domain

ensembles. In ICCV, 2021. (p. 9)

[71] Yeming Wen, Ghassen Jerfel, Rafael Muller, Michael W Dusenberry, Jasper Snoek, Balaji

Lakshminarayanan, and Dustin Tran. Combining ensembles and data augmentation can harm your calibration. In ICLR, 2021. (p. 9)

[72] Alexandre Rame, Remy Sun, and Matthieu Cord. Mixmo: Mixing multiple inputs for multiple

outputs via deep subnetworks. In ICCV, 2021. (pp. 9 and 28)

[73] Alexandre Rame and Matthieu Cord. Dice: Diversity in deep ensembles via conditional

redundancy adversarial estimation. In ICLR, 2021. (pp. 9 and 28)

[74] Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Improving adversarial robustness

via promoting ensemble diversity. In ICML, 2019. (p. 9)

[75] Damien Teney, Ehsan Abbasnejad, Simon Lucey, and Anton van den Hengel. Evading

the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. ar Xiv preprint, 2021. (p. 9)

[76] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no

barriers in neural network energy landscape. In ICML, 2018. (p. 9)

[77] Hao Guo, Jiyong Jin, and Bin Liu. Stochastic weight averaging revisited. ar Xiv preprint, 2022.

[78] Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k

steps forward, 1 step back. Neur IPS, 32, 2019. (p. 9)

[79] Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain

generalization in deep learning. Neur IPS, 2019. (p. 9)

[80] Gregory Benton, Wesley Maddox, Sanae Lotﬁ, and Andrew Gordon Wilson. Loss surface

simplexes for mode connecting volumes and fast ensembling. In ICML, 2021. (p. 9)

[81] Vipul Gupta, Santiago Akle Serrano, and Dennis De Coste. Stochastic weight averaging in

parallel: Large-batch training that generalizes well. In ICLR, 2020. (p. 9)

[82] Leshem Choshen, Elad Venezian, Noam Slonim, and Yoav Katz. Fusing ﬁnetuned models for

better pretraining. ar Xiv preprint, 2022. (p. 9)

[83] Mitchell Wortsman, Maxwell Horton, Carlos Guestrin, Ali Farhadi, and Mohammad Rastegari.

Learning neural network subspaces. ICML, 2021. (p. 9)

[84] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon

Wilson. A simple baseline for bayesian uncertainty in deep learning. In Neur IPS, 2019. (p. 9)

[85] Pavel Izmailov, Wesley Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, and

Andrew Gordon Wilson. Subspace inference for bayesian deep learning. In UAI, 2019. (p. 9)

[86] Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is

sufﬁcient for robustness to spurious correlations. In ICML SCIS Workshop, 2022. (pp. 17 and 37)

[87] Su Lin Blodgett, Lisa Green, and Brendan O Connor. Demographic dialectal variation in

social media: A case study of african-american english. In EMNLP, 2016. (p. 17)

[88] Solon Barocas and Andrew D Selbst. Big data s disparate impact. Calif. L. Rev., 2016. (p. 17)

[89] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In ICML, 2017. (p. 18)

[90] Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley.

Relative ﬂatness and generalization. In Neur IPS, 2021. (p. 18)

[91] Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Pyhessian: Neural

networks through the lens of the hessian. In Big Data, 2020. (p. 18)

[92] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical

text-conditional image generation with clip latents. ar Xiv preprint, 2022. (p. 19)

[93] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer school on

machine learning, 2003. (p. 24)

[94] Fernando Pérez-Cruz, Steven Van Vaerenbergh, Juan José Murillo-Fuentes, Miguel Lázaro-

Gredilla, and Ignacio Santamaria. Gaussian processes for nonlinear signal processing: An overview of recent advances. EEE Signal Process. Mag., 2013. (p. 25)

[95] Greg Yang and Hadi Salman. A ﬁne-grained spectral perspective on neural networks. ar Xiv

preprint, 2019. (p. 25)

[96] Damien Brain and Geoffrey I Webb. On the effect of data set size on bias and variance in

classiﬁcation learning. In AKAW, 1999. (p. 25)

[97] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander

Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723 773, 2012. (p. 26)

[98] Jan R Magnus and Heinz Neudecker. Matrix differential calculus with applications in statistics

and econometrics. John Wiley & Sons, 2019. (p. 26)

[99] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,

Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. (p. 27)

[100] Saurabh Singh, Derek Hoiem, and David Forsyth. Swapout: Learning an ensemble of deep

architectures. In Neur IPS, 2016. (p. 28)

[101] Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics.

1992. (p. 29)

[102] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,

2015. (p. 33)

[103] Yann Le Cun, Corinna Cortes, and Chris Burges. Mnist handwritten digit database, 2010.

[104] Elan Rosenfeld, Pradeep Ravikumar, and Andrej Risteski. Domain-adjusted regression or:

Erm may already learn features sufﬁcient for out-of-distribution generalization. 2022. (p. 37)

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] In Section 5.2.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] In

Appendix A (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] Assumption 1

discussed in Appendix C.3.2 and Assumptions 2 and 3 discussed in Appendix C.4.2. (b) Did you include complete proofs of all theoretical results? [Yes] In Appendix C 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main exper-

imental results (either in the supplemental material or as a URL)? [Yes] Our code is available at https://github.com/alexrame/diwa. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See Section 5 and Appendix G.1 (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] Deﬁned by different data splits when possible. (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] Approximately 20000 hours of GPUs (Nvidia V100) on an internal cluster, mostly for the 2640 runs needed in Table 1. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Domain Bed bench-

mark [12] and its datasets. (b) Did you mention the license of the assets? [Yes] Domain Bed is under The MIT

License . (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]