# swad_domain_generalization_by_seeking_flat_minima__599a9b54.pdf

SWAD: Domain Generalization by Seeking Flat Minima

Junbum Cha1 Sanghyuk Chun2 Kyungjae Lee3

Han-Cheol Cho4 Seunghyun Park4 Yunsung Lee5 Sungrae Park6

1 Kakao Brain 2 NAVER AI Lab 3 Chung-Ang University 4 NAVER Clova 5 Korea University 6 Upstage AI Research

Domain generalization (DG) methods aim to achieve generalizability to an unseen target domain by using only training data from the source domains. Although a variety of DG methods have been proposed, a recent study shows that under a fair evaluation protocol, called Domain Bed, the simple empirical risk minimization (ERM) approach works comparable to or even outperforms previous methods. Unfortunately, simply solving ERM on a complex, non-convex loss function can easily lead to sub-optimal generalizability by seeking sharp minima. In this paper, we theoretically show that ﬁnding ﬂat minima results in a smaller domain generalization gap. We also propose a simple yet effective method, named Stochastic Weight Averaging Densely (SWAD), to ﬁnd ﬂat minima. SWAD ﬁnds ﬂatter minima and suffers less from overﬁtting than does the vanilla SWA by a dense and overﬁt-aware stochastic weight sampling strategy. SWAD shows state-of-the-art performances on ﬁve DG benchmarks, namely PACS, VLCS, Office Home, Terra Incognita, and Domain Net, with consistent and large margins of +1.6% averagely on outof-domain accuracy. We also compare SWAD with conventional generalization methods, such as data augmentation and consistency regularization methods, to verify that the remarkable performance improvements are originated from by seeking ﬂat minima, not from better in-domain generalizability. Last but not least, SWAD is readily adaptable to existing DG methods without modiﬁcation; the combination of SWAD and an existing DG method further improves DG performances. Source code is available at https://github.com/khanrc/swad.

1 Introduction

Independent and identically distributed (i.i.d.) condition is the underlying assumption of machine learning experiments. However, this assumption may not hold in real-world scenarios, i.e., the training and the test data distribution may differ signiﬁcantly by distribution shifts. For example, a self-driving car should adapt to adverse weather or day-to-night shifts [1, 2]. Even in a simple image recognition scenario, systems rely on wrong cues for their prediction, e.g., geographic distribution [3], demographic statistics [4], texture [5], or backgrounds [6]. Consequently, a practical system should require generalizability to distribution shift, which is yet often failed by traditional approaches.

Domain generalization (DG) aims to address domain shift simulated by training and evaluating on different domains. DG tasks assume that both task labels and domain labels are accessible. For example, PACS dataset [7] has seven task labels (e.g., dog , horse ) and four domain labels (e.g., photo , sketch ). Previous approaches explicitly reduced domain gaps in the latent space [8

Equal contribution Part of work done while at NAVER Clova Correspondence to: Junbum Cha <junbum.cha@kakaobrain.com>, Sungrae Park <sungrae.park@upstage.ai>

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Table 1: Comparisons with SOTA. The proposed SWAD outperforms other state-of-the-art DG methods on ﬁve different DG benchmarks with signiﬁcant gaps (+1.6pp in the average).

PACS VLCS Office Home Terra Inc Domain Net Avg.

ERM [29] 85.5 77.5 66.5 46.1 40.9 63.3 Best SOTA competitor 86.6 [30] 78.8 [31] 68.7 [31] 48.6 [32] 43.6 [15, 33] 65.3 SWAD (proposed) 88.1 79.1 70.6 50.0 46.5 66.9

Previous SOTA [31] + SWAD 88.3 78.9 71.3 51.0 46.8 67.3

12], obtained well-transferable model parameters by the meta-learning framework [13 16], data augmentation [17 19], or capturing causal relation [20, 21]. Despite numerous previous attempts for a decade, Gulrajani and Lopez-Paz [22] showed that a simple empirical risk minimization (ERM) approach works comparably or even outperforms the previous attempts on diverse DG benchmarks under a fair evaluation protocol, called Domain Bed .

Unfortunately, although ERM showed surprising empirical success on Domain Bed, simply minimizing the empirical loss on a complex and non-convex loss landscape is typically not sufﬁcient to arrive at a good generalization [23 26]. In particular, the connection between the generalization gap and the ﬂatness of loss landscapes has been actively discussed under the i.i.d. condition [23 28]. Izmailov et al. [25] argued that seeking ﬂat minima will lead to robustness against the loss landscape shift between training and test datasets, while a simple ERM converges to the boundary of a wide ﬂat minimum and achieves insufﬁcient generalization. In the DG scenario, because training and test loss landscapes differ more drastically due to the domain shift, we conjecture that the generalization gap between ﬂat and sharp minima is larger than expected in the i.i.d. scenario.

To show that ﬂatter minima generalize better to unseen domains, we formulate a robust risk minimization (RRM) problem deﬁned by the worst-case empirical risks within neighborhoods in parameter space [26, 34]. We theoretically show that the generalization gap of DG, i.e., the error on the target domain, is upper bounded by RRM, i.e., a ﬂat optimal solution. Based on our theoretical observation, we modify stochastic weight averaging (SWA) [25], one of the popular existing ﬂatness-aware solvers, by introducing a dense and overﬁt-aware stochastic weight sampling strategy. First, we suggest to sample weights densely, i.e., for every iteration. Also, we search the start and end iterations for averaging by considering the validation loss to avoid overﬁtting. We empirically show that the proposed Stochastic Weight Averaging Densely (SWAD) ﬁnds ﬂatter minima than the vanilla SWA does, resulting in better generalization to unseen domains.

Contribution. Our main contribution is introducing ﬂatness into DG, and showing remarkably outperforming performances against existing DG methods. As shown in Table 1, our SWAD improves the average DG performances by 3.6pp against the ERM baseline and 1.6pp against the existing best methods. Furthermore, by combining SWAD and previous SOTA [31], we even achieve 0.4pp improvements against the vanilla SWAD results. We also empirically show that while popular indomain generalization methods without considering ﬂatness, e.g., Mixup [35] or Cut Mix [36], are not effective to out-of-domain generalization (Table 3), ﬂatness-aware methods, e.g., SWA [25] or SAM [26], are only effective methods to both in-domain and out-of-domain generalization.

2 A Theoretical Relationship between Flatness and Domain Generalization

Let D := {Di}I i be a set of training domains, where Di is a distribution over input space X, and I is the total number of domains. From each domain, we observe n training data points which consist of input x and target label y, (xi j, yi j)n j=1 Di. We also deﬁne a set of target domain T := {Ti}T i similarly, where the number of target domains T is usually set to one. For the sake of simplicity, unlike Ben-David et al. [37], we assume that there exists a global labeling function h(x) that generates target label for multiple domains, i.e., yi j = h(xi j) for all i and j. Domain generalization (DG) aims to ﬁnd a model parameter θ Θ which generalizes well over both multiple training domains D and unseen target domain T . More speciﬁcally, let us consider a bounded instance loss function ℓ: Y Y [0, c], such that ℓ(y1, y2) = 0 holds if and only if y1 = y2 where Y is a set of labels. For simplicity, we set c to one in our proofs, but we note that ℓ( , ) can be generalized for any bounded loss function. Then, we can deﬁne a population loss over multiple domains by ED(θ) = 1

I PI i=1 Exi Di[ℓ(f(xi; θ), yi))], where f( ; θ) is a model parameterized by θ. Formally,

the goal of DG is to ﬁnd a model which minimizes both ED(θ) and ET (θ) by only minimizing an empirical risk ˆED(θ) := 1 In PI i=1 Pn j=1 ℓ(f(xi; θ), yi)) over training domains D.

In practice, ERM, i.e., arg minθ ˆED(θ), can have multiple solutions that provide similar values of the training losses but signiﬁcantly different generalizability on ED(θ) and ET (θ). Unfortunately, the typical optimization methods, such as SGD and Adam [38], often lead sub-optimal generalizability as ﬁnding sharp and narrow minima even under the i.i.d. assumption [23 28]. In the DG scenario, the generalization gap between empirical loss and target domain loss becomes even worse due to domain shift. Here, we provide a theoretical interpretation of the relationship between ﬁnding a ﬂat minimum and minimizing the domain generalization gap, inspired by previous studies [23 28].

Flat minimum Sharp minimum

Empirical risk Robust risk Minimum of the empirical risk Minimum of the robust risk

Figure 1: Robust risk minimization (RRM) and ﬂat minima. With proper γ, RRM will ﬁnd ﬂat minima.

We consider a robust empirical loss function deﬁned by the worst-case loss within neighborhoods in the parameter space as ˆEγ D(θ) := max γ ˆED(θ + ), where denotes the L2 norm and γ is a radius which deﬁnes neighborhoods of θ. Intuitively, if γ is sufﬁciently larger than the radius of a sharp optimum θs of ˆED(θ), θs is no longer an optimum of ˆEγ D(θ) as well as its neighborhoods within the γ-ball. On the other hand, if an optimum θf has larger radius than γ, there exists a local optimum within γ-ball See Figure 1. Hence, solving the robust risk minimization (RRM), i.e., arg minθ ˆEγ D(θ), will ﬁnd a near solution of a ﬂat optimum showing better generalizability [26, 34]. However, as domain shift worsen the generalization gap by breaking the i.i.d. assumption, it is not trivial that RRM will ﬁnd an optimum with better DG performance. To answer the question, we ﬁrst show the generalization bound between ˆEγ D and ET as follows:

Theorem 1. Consider a set of N covers {Θk}N k=1 such that the parameter space Θ N k Θk where

diam(Θ) := supθ,θ Θ θ θ 2, N := l (diam(Θ)/γ)dm and d is dimension of Θ. Let vk be a VC dimension of each Θk. Then, for any θ Θ, the following bound holds with probability at least 1 δ,

ET (θ) < ˆEγ D(θ) + 1

i=1 Div(Di, T ) + max k [1,N]

vk ln (m/vk) + ln(N/δ)

where m = n I is the number of the training samples and Div(Di, T ) := 2 sup A |PDi(A) PT (A)| is a divergence between two distributions.

Proof can be done similarly as [37] and [34]. In Theorem 1, the test loss ET (θ) is bounded by three terms: (1) the robust empirical loss ˆEγ D(θ), (2) the discrepancy between training distribution and test distribution, i.e., the quantity of domain shift, and (3) a conﬁdence bound related to the radius γ and the number of the training samples m. Our theorem is similar to Ben-David et al. [37], while our theorem does not have the term related to the difference in labeling functions across the domains. It is because we simply assume there is no difference between labeling functions for each domain for simplicity. If one assumes a different labeling function, the dissimilarity term can be derived easily because it is independent and compatible with our main proof. More details of Theorem 1, including proof and discussions on the conﬁdence bound, are in Appendix C.1 and C.2.

From Theorem 1, one can conjure that minimizing the robust empirical loss is directly related to the generalization performances on the target distribution. We show that the domain generalization gap on the target domain T by the optimal solution of RRM, ˆθγ, is upper bounded as follows:

Theorem 2. Let ˆθγ denote the optimal solution of the RRM, i.e., ˆθγ := arg minθ ˆEγ D(θ), and let v be a VC dimension of the parameter space Θ. Then, the gap between the optimal test loss, minθ ET (θ ), and the test loss of ˆθγ, ET (ˆθγ), has the following bound with probability at least 1 δ.

ET (ˆθγ) min θ ET (θ ) ˆEγ D(ˆθγ) min θ ˆED(θ ) + 1

i=1 Div(Di, T )

+ max k [1,N]

vk ln (m/vk) + ln (2N/δ)

v ln (m/v) + ln (2/δ)

minimum threshold stochastic weights

The number of iterations

Validation loss

Average sparsely

The number of iterations

Validation loss

Average densely

(b) SWAD (proposed) Figure 2: Comparison between SWA and SWAD. (a) SWA collects stochastic weights for every K epochs from the pre-deﬁned K0 epochs to the ﬁnal epoch. (b) Our SWAD collects stochastic weights densely, i.e., for every iteration, to obtain sufﬁciently many weights. SWAD collects the weights from the start iteration ts to the end iteration te, where ts and te are obtained by monitoring the validation loss (overﬁt-aware scheduling).

Proof is in Appendix C.3. It implies that if we ﬁnd the optimal solution of the RRM (i.e., ˆθγ), then the generalization gap in the test domain (i.e., ET (ˆθγ) minθ ET (θ )) is upper bounded by the gap between the RRM and ERM (i.e., ˆEγ D(ˆθγ) minθ ˆED(θ )). Other terms in Theorem 2 are the discrepancy between the train domains D and the target domain T , and the conﬁdence bounds caused by sample means. We remark that if we choose a proper γ, the optimal solution of the RRM will ﬁnd a point near a ﬂat optimum of ERM as shown in Figure 1. Hence, Theorem 2 and the intuition from Figure 1 imply that seeking a ﬂat minimum of ERM will lead to a better domain generalization gap.

3 SWAD: Domain Generalization by Seeking Flat Minima

We have shown that ﬂat minima will bring a better domain generalization. In this section, we propose Stochastic Weight Averaging Densely (SWAD) algorithm, and provide empirical quantitative and qualitative analyses on SWAD and ﬂatness to understand why SWAD works better than ERM.

3.1 A baseline method: stochastic weight averaging

Since the importance of ﬂatness in loss landscapes has emerged [23 28], several methods have been proposed to ﬁnd ﬂat minima [25, 26, 39]. We select stochastic weight averaging (SWA) [25] as a baseline, which ﬁnds ﬂat minima by a weight ensemble approach. More speciﬁcally, SWA updates a pretrained model (namely, a model trained with sufﬁciently enough training epochs, K0) with a cyclical [40] or high constant learning rate scheduling. SWA gathers model parameters for every K epochs during the update and averages them for the model ensemble. SWA ﬁnds an ensembled solution of different local optima found by a sufﬁciently large learning rate to escape a local minimum. Izmailov et al. [25] empirically showed that SWA ﬁnds ﬂatter minima than ERM. We also considered sharpness-aware minimization (SAM) [26], which is another popular ﬂatness-aware solver, but SWA ﬁnds ﬂatter minima than SAM (See Figure 3). We illustrate an overview of SWA in Figure 2a.

3.2 Dense and overﬁt-aware stochastic weight sampling strategy

Despite its advantages, directly applying SWA to DG task has two problems. First, SWA averages a few weights (usually less than ten) by sampling weights for every K epochs, results in an inaccurate approximation of ﬂat minima on a high-dimensional parameter space (e.g., 23M for Res Net-50 [41]). Furthermore, a common DG benchmark protocol uses relatively small training epochs (e.g., Gulrajani and Lopez-Paz [22] trained with less than two epochs for Domain Net benchmark), resulting in insufﬁcient stochastic weights for SWA. From this motivation, we propose a dense sampling strategy for gathering sufﬁciently enough stochastic weights.

In addition, widely used DG datasets, such as PACS ( 10K images, 7 classes) and VLCS ( 11K images, 5 classes), are relatively smaller than large-scale datasets, such as Image Net [42] ( 1.2M images, 1K classes). In this case, we observe that a simple ERM approach is rapidly reached to a local optimum only within a few epochs, and easily suffers from the overﬁtting issue, i.e., the validation loss is increased after a few training epochs. It implies that directly applying the vanilla SWA will suffer from the overﬁtting issue by averaging sub-optimal solutions (i.e., overﬁtted parameters). Hence, we need an overﬁt-aware sampling scheduling to omit the sub-optimal solutions for SWA.

ERM SAM SWA (cyclic) SWA (constant) SWAD

0 10 20 30 40 50 60 0.00

(a) Average train ﬂatness

0 10 20 30 40 50 60 0.00

(b) Average test ﬂatness

0 10 20 30 40 50 60 0.00

0 10 20 30 40 50 60 0.00

0 10 20 30 40 50 60 0.00

0 10 20 30 40 50 60 0.00

0 10 20 30 40 50 60 0.00

0 10 20 30 40 50 60 0.00

0 10 20 30 40 50 60 0.00

0 10 20 30 40 50 60 0.00

Art painting (test) Cartoon (test) Photo (test) Sketch (test)

(c) Flatness for each target domain

Figure 3: Local ﬂatness comparisons. We plot the local ﬂatness via loss gap, i.e., Fγ(θ) = E θ = θ +γ[E(θ ) E(θ)], of ERM, SAM, SWA, and SWAD by varying radius γ on different domains of PACS dataset. For each ﬁgure, Y-axis indicates the ﬂatness Fγ(θ) and X-axis indicates the radius γ. We measure the train ﬂatness F D γ (θ) on seen domains and the test ﬂatness F T γ (θ) on unseen domain. Each point is computed by Monte-Carlo approximation with 100 random samples. This comparisons show SWAD ﬁnds ﬂatter minima than not only ERM but also SAM and SWA.

The main idea of Stochastic Weight Averaging Densely (SWAD) is a dense and overﬁt-aware stochastic weight gathering strategy. First, instead of collecting weights for every K epochs, SWAD collects weights for every iteration. This dense sampling strategy easily collects sufﬁciently many weights than the sparse one. We also employ overﬁt-aware sampling scheduling by considering traces of the validation loss. Instead of sampling weights from K0 pretraining epochs to the ﬁnal epoch, we search the start iteration (when the validation loss achieves a local optimum for the ﬁrst time) and the end iteration (when the validation loss is no longer decreased, but keep increasing). More speciﬁcally, we introduce three parameters: an optimum patient parameter Ns, an overﬁtting patient parameter Ne, and the tolerance rate r for searching the start iteration ts and the end iteration te. First, we search ts which satisﬁes mini [0,...,Ns 1] E(ts+i) val = E(ts) val , where E(i) val denotes the validation loss at iteration i. Simply, ts is the ﬁrst iteration where the loss value is no longer decreased during Ns iterations. Then, we ﬁnd te satisfying mini [0,1,...,Ne 1] E(te+i) val > r E(ts) val . In other words, te is the ﬁrst iteration where the validation loss values exceed the tolerance r during Ne iterations.

We illustrate the overview of SWAD and the comparison of SWAD to SWA in Figure 2. Detailed pseudo code is provided in Appendix B.4. We compare SWAD with other possible SWA strategies in 4.3 and show that our design choice works better for DG tasks.

3.3 Empirical analysis of SWAD and ﬂatness

Here, we analyze solutions found by SWAD in terms of ﬂatness. We ﬁrst verify that the SWAD solution is ﬂatter than those of ERM, SWA, and SAM. Our loss surface visualization shows that the SWAD solution is located on the center of the ﬂat region, while ERM ﬁnds a boundary solution. Finally, we show that the sharp boundary solutions by ERM are not generalized well, resulting in sensitivity to the model selection. All following empirical analyses are conducted on PACS dataset, validating by all four domains (art painting, cartoon, photo, and sketch).

Local ﬂatness anaylsis. To begin with, we quantify the local ﬂatness of a model parameter θ by assuming that ﬂat minima will have smaller changes of loss value within its neighborhoods than sharp minima. For the given model parameter θ, we compute the expected loss value changes between θ and parameters on the sphere surrounding θ with radius γ, i.e., Fγ(θ) = E θ = θ +γ[E(θ ) E(θ)]. In

Art painting (test) Cartoon (test) Photo (test) Sketch (test)

Figure 4: Loss surfaces on model parameters in PACS dataset for each target domain. The three triangles indicate model weights chosen at the end of training phase with equal intervals. Each plane is deﬁned by the three weights and losses upon the plane are visualized with contours. The center cross mark is averaged point of the three weights. The ﬁrst and second rows show the averaged training loss and the test loss surfaces, respectively.

Art painting Cartoon Photo Sketch

0 1000 2000 3000 4000 5000 90

(a) Art painting (test)

0 1000 2000 3000 4000 5000 90

(b) Cartoon (test)

0 1000 2000 3000 4000 5000 90

(c) Photo (test)

0 1000 2000 3000 4000 5000 90

(d) Sketch (test)

Figure 5: Validation accuracies for in-domains. The Xand Y-axis indicate the training iterations and accuracy, respectively, about the validation domains (legend) and the test domain (caption). The vertical dot lines represent start and end iterations, ts and te, identiﬁed by the overﬁt-aware sampling strategy of SWAD.

practice, Fγ(θ) is approximated by Monte-Carlo sampling with 100 samples. Note that the proposed local ﬂatness Fγ(θ) is computationally efﬁcient than measuring curvature using the Hessian-based quantities. Also, Fγ(θ) has an unbiased ﬁnite sample estimator, while the worst-case loss value, i.e., max θ = θ +γ[E(θ ) E(θ)] has no unbiased ﬁnite sample estimator.

In Figure 3, we compare Fγ(θ) of ERM, SAM, SWA with cyclic learning rate, SWA with constant learning rate, and SWAD by varying radius γ. SAM and SWA ﬁnd the solutions with lower local ﬂatness than ERM on average. SWAD ﬁnds the most ﬂat minimum in every experiment.

Loss surface visualization. We visualize the loss landscapes by choosing three model weights on the optimization trajectory (θ1, θ2, θ3)2, and computing the loss values by linear combinations of θ1, θ2, θ33 as [25]. More details are in Appendix B.5. In Figure 4, we observe that for all cases, ERM solutions are located at the boundary of a ﬂat minimum of training loss, resulting in poor generalizability in test domains, that is aligned with our theoretical analysis and empirical ﬂatness analysis. Since ERM solutions are located on the boundary of a ﬂat loss surface, we observe that ERM solutions are very sensitive to model selection. In Figure 5, we illustrate the validation accuracies for each train-test domain combination of PACS by ERM, over training iterations (one epoch is equivalent to 83 iterations). We ﬁrst observe that ERM rapidly reaches the best accuracy within only a few training epochs, namely less than 6 epochs. Furthermore, the ERM validation accuracies ﬂuctuate a lot, and the ﬁnal performance is very sensitive to the model selection criterion.

On the other hand, we observe that SWA solutions are located on the center of the training loss surfaces as well as of the test loss surfaces (Figure 4). Also, our overﬁt-aware stochastic weight gathering strategy (denoted as the vertical dot lines in Figure 5) prevents the ensembled weight from overﬁtting and makes SWAD model selection-free.

2We choose weights at iteration 2500, 3500, 4500 during the training. 3Each point is deﬁned by two axes u and v computed by u = θ2 θ1 and v = (θ3 θ1) θ3 θ1,θ2 θ1

θ2 θ1 2 (θ2 θ1) .

Table 2: Comparison with domain generalization methods and SWAD. Out-of-domain accuracies on ﬁve domain generalization benchmarks are shown. We highlight the best results and the second best results. Note that ERM (reproduced), Mixstyle are reproduced numbers, and other numbers are from the original literature and Gulrajani and Lopez-Paz [22] (denoted with ). Our experiments are repeated three times.

Algorithm PACS VLCS Office Home Terra Inc Domain Net Avg.

MASF [14] 82.7 - - - - - DMG [33] 83.4 - - - 43.6 - Meta Reg [15] 83.6 - - - 43.6 - ER [12] 85.3 - - - - - p Ada IN [47] 85.4 - - - - - EISNet [48] 85.8 - - - - - DSON [30] 86.6 - - - - - ERM [29] 85.5 77.5 66.5 46.1 40.9 63.3 ERM (reproduced) 84.2 77.3 67.6 47.8 44.0 64.2 IRM [20] 83.5 78.6 64.3 47.6 33.9 61.6 Group DRO [49] 84.4 76.7 66.0 43.2 33.3 60.7 I-Mixup [50 52] 84.6 77.4 68.1 47.9 39.2 63.4 MLDG [13] 84.9 77.2 66.8 47.8 41.2 63.6 CORAL [31] 86.2 78.8 68.7 47.7 41.5 64.5 MMD [53] 84.7 77.5 66.4 42.2 23.4 58.8 DANN [9] 83.7 78.6 65.9 46.7 38.3 62.6 CDANN [10] 82.6 77.5 65.7 45.8 38.3 62.0 MTL [54] 84.6 77.2 66.4 45.6 40.6 62.9 Sag Net [32] 86.3 77.8 68.1 48.6 40.3 64.2 ARM [16] 85.1 77.6 64.8 45.5 35.5 61.7 VREx [21] 84.9 78.3 66.4 46.4 33.6 61.9 RSC [55] 85.2 77.1 65.5 46.6 38.9 62.7 Mixstyle [17] 85.2 77.9 60.4 44.0 34.0 60.3

SWAD (ours) 88.1 79.1 70.6 50.0 46.5 66.9 ( 0.1) ( 0.1) ( 0.2) ( 0.3) ( 0.1)

4 Experiments

4.1 Evaluation protocols

Dataset and optimization protocol. Following Gulrajani and Lopez-Paz [22], we exhaustively evaluate our method and comparison methods on various benchmarks: PACS [7] (9,991 images, 7 classes, and 4 domains), VLCS [43] (10,729 images, 5 classes, and 4 domains), Office Home [44] (15,588 images, 65 classes, and 4 domains), Terra Incognita [45] (24,788 images, 10 classes, and 4 domains), and Domain Net [46] (586,575 images, 345 classes, and 6 domains).

For a fair comparison, we follow training and evaluation protocol by Gulrajani and Lopez-Paz [22], including the dataset splits, hyperparameter (HP) search and model selection (while SWAD does not need it) on the validation set, and optimizer HP, except the HP search space and the number of iterations for Domain Net. We use a reduced HP search space to reduce the computational costs. We also tripled the number of iterations for Domain Net from 5,000 to 15,000 because we observe that 5,000 is not sufﬁcient to convergence. We re-evaluate ERM with 15,000 iterations, and observe 3.1pp average performance improvement (40.9% 44.0%) in Domain Net. For training, we choose a domain as the target domain and use the remaining domains as the training domain where 20% samples are used for validation and model selection. Image Net [42] trained Res Net-50 [41] is employed as the initial weight, and optimized by Adam [38] optimizer with a learning rate of 5e-5. We construct a mini-batch containing all domains where each domain has 32 images. We set SWAD HPs Ns to 3, Ne to 6, and r to 1.2 for VLCS and 1.3 for the others by HP search on the validation sets. Additional implementation details, such as other HPs, are given in Appendix B.

Evaluation metrics. We report out-of-domain accuracies for each domain and their average, i.e., a model is trained and validated on training domains and evaluated on the unseen target domain. Each out-of-domain performance is an average of three different runs with different train-validation splits.

4.2 Main results

Table 3: Comparison between generalization methods on PACS. The scores are averaged over all settings using different target domains. ( ) and ( ) indicate statistically signiﬁcant improvement and degradation from ERM.

Out-of-domain In-domain

ERM 85.3 0.4 96.6 0.0 EMA 85.5 0.4(-) 97.0 0.1( ) SAM 85.5 0.1(-) 97.4 0.1( ) Mixup 84.8 0.3(-) 97.3 0.1( ) Cut Mix 83.8 0.4( ) 97.6 0.1( ) VAT 85.4 0.6(-) 96.9 0.2( ) Π-model 83.5 0.5( ) 96.8 0.2( )

SWA 85.9 0.1( ) 97.1 0.1( ) SWAD 87.1 0.2( ) 97.7 0.1( )

Comparison with domain generalization methods. We report the full out-of-domain performances on ﬁve DG benchmarks in Table 2. The full tables including outof-domain accuracies for each domain are in Appendix E. In all experiments, our SWAD achieves signiﬁcant performance gain against ERM as well as the previous best results: +2.6pp in PACS, +0.3pp in VLCS, +1.4pp in Terra Incognita, +1.9pp in Office Home, and +2.9pp in Domain Net comparing to the previous best results. We observe that SWAD provides two practical advantages comparing to previous methods. First, SWAD does not need any modiﬁcation on training objectives or model architecture, i.e., it is universally applicable to any other methods. As an example, we show that SWAD actually improves the performances of other DG methods, such as CORAL [31] in Table 4. Moreover, as we discussed before, SWAD is free to the model selection, resulting in stable performances (i.e., small standard errors) on various benchmarks. Note that we only compare results with Res Net-50 backbone for a fair comparison. We describe the implementation details of each comparison method and the hyperparameter search protocol in Appendix B.

Comparison with conventional generalization methods. We also compare SWAD with other conventional generalization methods to show that the remarkable domain generalization gaps by SWAD is not achieved by better generalization, but by seeking ﬂat minima. The comparison methods include ﬂatness-aware optimization methods, such as SAM [26], ensemble methods, such as EMA [56], data augmentation methods, such as Mixup [35] and Cut Mix [36], and consistency regularization methods, such as VAT [57] and Π-model [58]. We also split in-domain datasets into training (60%), validation (20%), and test (20%) splits, while no in-domain test set used for Table 2. Every experiment is repeated three times.

The results are shown in Table 3. We observe that all conventional methods helps in-domain generalization, i.e., performing better than ERM on in-domain test set. However, their out-of-domain performances are similar to or even worse than ERM. For example, Cut Mix and Π-model improve in-domain performances by 1.0pp and 0.2pp but degrade out-of-domain performances by 1.5pp and 1.8pp. SAM, another method for seeking ﬂat minima, slightly increases both in-domain and out-of-domain performances but the out-of-domain performance is not statistically signiﬁcant. We will discuss performances of SAM in other benchmarks later. In contrast, the vanilla SWA and our SWAD signiﬁcantly improve both in-domain and out-of-domain performances. SWAD improves the performances by SWA with statistically signiﬁcantly gaps: 1.2pp on the out-of-domain and 0.6pp on the in-domain. Further comparison between SWA and SWAD is provided in 4.3.

Table 4: Combination of SWAD and other methods. The scores are averaged over every target domain case. The performances of ERM, CORAL, and SAM are optimized by HP searches of Domain Bed. In contrast, for the SWAD combination cases, CORAL and SAM use default HPs without additional HP search. We additionally compare SWAD to SWAw/ const. Note that ERM + SWAD is same as SWAD in Table 2.

PACS VLCS Office Home Terra Inc Domain Net Avg. ( )

ERM 85.5 0.2 77.5 0.4 66.5 0.3 46.1 1.8 40.9 0.1 63.3 ERM + SWAw/ const 86.9 0.2 76.6 0.1 69.3 0.3 49.2 1.2 45.9 0.0 65.6 (+2.3) ERM + SWAD 88.1 0.1 79.1 0.1 70.6 0.2 50.0 0.3 46.5 0.1 66.9 (+3.6)

CORAL 86.2 0.3 78.8 0.6 68.7 0.3 47.6 1.0 41.5 0.1 64.5 CORAL + SWAD 88.3 0.1 78.9 0.1 71.3 0.1 51.0 0.1 46.8 0.0 67.3 (+2.8)

SAM 85.8 0.2 79.4 0.1 69.6 0.1 43.3 0.7 44.3 0.0 64.5 SAM + SWAD 87.1 0.2 78.5 0.2 69.9 0.1 45.3 0.9 46.5 0.1 65.5 (+1.0)

Combinations with other methods. Since SWAD does not require any modiﬁcation on training procedures and model architectures, SWAD is universally applicable to any other methods. Here, we combine SWAD with ERM, CORAL [31], and SAM [26]. Results are shown in Table 4. Both CORAL and SAM solely show better performances than ERM with +1.2pp average out-of-domain accuracy gap. Note that SAM is not a DG method but a sharpness-aware optimization method to ﬁnd ﬂat minima. It supports our theoretical motivation: DG can be achieved by seeking ﬂat minima.

By applying SWAD on the baselines, the performances are consistently improved by 3.6pp on ERM, 2.8pp on CORAL, and 1.0pp on SAM. Interestingly, CORAL + SWAD show the best performances with both incorporating different advantages of utilizing domain labels and seeking ﬂat minima. We also observe that SAM + SWAD shows worse performance than ERM + SWAD, while SAM performs better than ERM. We conjecture that it is because the objective control by SAM restricts the model parameter diversity durinig training, reducing the diversity for SWA ensemble. However, applying SWAD on SAM still leads to better performances than the sole SAM. The results demonstrate that the application of SWAD on other baselines is a simple yet effective method for DG.

4.3 Ablation study

Table 5: Ablation studies of the stochastic weights selection strategies on PACS and VLCS. In the conﬁguration, ts , te , lr , and interval indicate start and end iterations of sampling, a learning rate schedule, and a stochastic weight sampling interval, respectively. Opt and Overﬁt indicate the start and end iterations identiﬁed by our overﬁt-aware sampling strategy, and Val means the start and end iterations whose averaging shows the best accuracy on the validation set. Cyclic and Const represent cyclic and constant learning rate schedules. All experiments are repeated three times.

Conﬁguration Out-of-domain In-domain ts te lr interval PACS VLCS Avg. PACS VLCS Avg.

SWAw/ cyclic 4000 5000 Cyclic 100 85.9 0.1 76.6 0.1 81.2 97.1 0.1 85.0 0.2 91.0 SWAw/ const 4000 5000 Const 100 86.5 0.3 76.7 0.2 81.6 97.3 0.1 85.0 0.2 91.1 SWADw/o Dense Opt Overﬁt Const 100 86.5 0.4 78.0 0.7 82.2 97.6 0.1 85.8 0.4 91.7 SWADw/o Opt-Overﬁt 4000 5000 Const 1 86.6 0.6 76.9 0.3 81.7 97.5 0.1 85.2 0.1 91.3 SWADw/o Overﬁt Opt 5000 Const 1 87.1 0.3 77.6 0.1 82.4 97.7 0.1 85.8 0.3 91.8 SWADﬁt-on-val Val Val Const 1 86.2 0.2 78.6 0.1 82.4 97.5 0.2 85.8 0.3 91.7 SWAD (proposed) Opt Overﬁt Const 1 87.1 0.2 78.9 0.2 83.0 97.7 0.1 86.1 0.5 91.9

Table 5 provides ablative studies on the starting and ending iterations for averaging, the learning rate schedule, and the sampling interval. SWAw/ cyclic (SWA in Table 3) and SWAw/ constant are vanilla SWAs with ﬁxed sampling positions. We also report SWAD by eliminating three factors: the dense sampling strategy, and searching the start iteration, searching the end iteration. The dense sampling strategy lets SWAD estimate a more accurate approximation of ﬂat minima: showing 0.8pp degeneration in the average out-of-domain accuracy (SWADw/o Dense). When we take an average from ts to the ﬁnal iteration, the out-of-domain performance degrades by 0.6pp (SWADw/o Overﬁt). Similarly, a ﬁxed scheduling without the overﬁt-aware scheduling only shows very marginal improvements from the vanilla SWA (SWADw/o Opt-Overﬁt). We also evaluate SWADﬁt-on-val that uses the range achieving the best performances on the validation set, but it becomes overﬁtted to the validation, results in lower performances than SWAD. The results demonstrate the beneﬁts of combining dense and overﬁt-aware sampling strategies of SWAD.

4.4 Exploring the other applications: Image Net robustness

Table 6: Image Net robustness benchmarks. We show the Image Net generalization performances on Image Net-C, background challenge (BGC), and Image Net-R.

Method Image Net (%) Image Net-C (m CE) BGC (%) Image Net-R (%)

ERM 76.5 57.6 8.7 36.7 SWA 76.9 56.8 10.9 37.5 SWAD (ours) 77.0 55.7 11.8 38.8

Since SWAD does not rely on domain labels, it can be applied to other robustness tasks not containing domain labels. Table 6 show the generalizability of SWAD on Image Net [42] and its shifted benchmarks, namely, Image Net-C [59], Image Net-R [60], and background challenge (BGC) [61]. SWAD consistently improves robustness performances against the ERM baseline and the SWA baseline. These results support that our method is robustly and widely applicable to improve both in-domain and out-of-domain generalizability. The detailed setup is provided in Appendix B.6.

5 Discussion and Limitations

Despite many beneﬁts from SWAD, such as the signiﬁcant performance improvements, model selection-free property, working plug-and-play manner for various methods, there are some potential limitations. Here, we discuss the limitations of SWAD for further improvements.

Conﬁdence error in Theorem 1. While the conﬁdence error in Theorem 1 tells the effect of γ on generalization error bound, there exists a limitation in that the conﬁdence error term shows improper behavior with respect to γ if γ is close to zero. The behavior we expect is that the conﬁdence error of RRM converges to the conﬁdence error of ERM as γ decreases to zero, however, the current theorem does not show such tendency since the conﬁdence bound diverges to inﬁnity when γ goes to zero. However, we would like to note that this limitation is not a drawback of RRM, but it is caused by the looseness of the union bound which is a mathematical technique used to derive the conﬁdence error of RRM. Our RRM formulation has a similarity to previous works [26, 34] and we note that the counter-intuitive behavior of the conﬁdence bound and γ also appears in Foret et al. [26].

SWAD is not a perfect ﬂatness-aware optimization method. Note that SWAD is not a perfect and theoretically guaranteed solver for ﬂat minima, but a heuristic approximation with empirical beneﬁts. However, even if a better ﬂatness-aware optimization method is proposed, our theoretical contribution still holds: showing the relationship between ﬂat minima and DG.

SWAD does not strongly utilize domain-speciﬁc information. In Theorem 2, the domain generalization gap is bounded by three factors: ﬂat minima, domain discrepancy, and conﬁdence bound. Most of the existing approaches focus on domain discrepancy, reducing the difference between the source domains and the target domain by domain invariant learning [8 12]. SWAD focuses on the ﬁrst factor, the ﬂat minima. While the domain labels are used to construct a mini-batch, SWAD does not strongly utilize domain-speciﬁc information. It implies that if one can consider both ﬂatness and domain discrepancy, better domain generalization can be achievable. Table 4 gives us a clue: the combination of CORAL (utilizing domain-speciﬁc information) and SWAD (seeking ﬂat minima) shows the best performance among all comparison methods. As a future research direction, we encourage studying a method that can achieve both ﬂat optima and small domain discrepancy.

6 Concluding Remarks

In this paper, we theoretically and empirically demonstrate that domain generalization (DG) is achievable by seeking ﬂat minima. We propose SWAD that captures ﬂatter minima than the vanilla SWA does. The extensive experiments on ﬁve DG benchmarks show superior performances of SWAD compared with existing DG methods. In addition, combinations of SWAD and existing DG methods even show better performances than the vanilla SWAD. We theoretically and empirically observe that seeking ﬂat minima can achieve better generalizability to both in-domain and out-of-domain, while strong in-domain generalization methods without consideration of ﬂatness, e.g., Mixup or Cut Mix, cannot guarantee to achieve out-of-domain generalizability in both theory and practice. This study ﬁrst brings the concept of ﬂatness into DG tasks, and shows strong empirical performances not only in DG but also in Image Net benchmarks. We hope that this study promotes a new research direction of seeking ﬂat minima for domain generalization and other robustness tasks.

Acknowledgments and Disclosure of Funding

NAVER Smart Machine Learning (NSML) [62] and Kakao Brain Cloud platform have been used in experiments. This work was supported by IITP grant funded by the Korea government (MSIT) (No. 2021-0-01341, AI Graduate School Program, CAU).

[1] Dengxin Dai and Luc Van Gool. Dark model adaptation: Semantic image segmentation from daytime to nighttime. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3819 3824. IEEE, 2018.

[2] Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking robustness in object detection: Autonomous driving when winter is coming. ar Xiv preprint ar Xiv:1907.07484, 2019.

[3] Terrance de Vries, Ishan Misra, Changhan Wang, and Laurens van der Maaten. Does object recognition work for everyone? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 52 59, 2019.

[4] Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 547 558, 2020.

[5] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.

[6] Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), pages 456 473, 2018.

[7] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In IEEE International Conference on Computer Vision, pages 5542 5550, 2017.

[8] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10 18. PMLR, 2013.

[9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of machine learning research, 17(1):2096 2030, 2016.

[10] Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain generalization via conditional invariant representations. In AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

[11] Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In International Conference on Machine Learning (ICML), 2020.

[12] Shanshan Zhao, Mingming Gong, Tongliang Liu, Huan Fu, and Dacheng Tao. Domain generalization via entropy regularization. Neural Information Processing Systems, 33, 2020.

[13] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Metalearning for domain generalization. In AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

[14] Qi Dou, Daniel C Castro, Konstantinos Kamnitsas, and Ben Glocker. Domain generalization via model-agnostic learning of semantic features. Neural Information Processing System, 2019.

[15] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards domain generalization using meta-regularization. Neural Information Processing Systems, 31:998 1008, 2018.

[16] Marvin Zhang, Henrik Marklund, Abhishek Gupta, Sergey Levine, and Chelsea Finn. Adaptive risk minimization: A meta-learning approach for tackling group shift. ar Xiv preprint ar Xiv:2007.02931, 2020.

[17] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle. In International Conference on Learning Representations, 2021.

[18] Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Siddhartha Chaudhuri, Preethi Jyothi, and Sunita Sarawagi. Generalizing across domains via cross-gradient training. In International Conference on Learning Representations, 2018.

[19] Fabio M Carlucci, Antonio D Innocente, Silvia Bucci, Barbara Caputo, and Tatiana Tommasi. Domain generalization by solving jigsaw puzzles. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2229 2238, 2019.

[20] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

[21] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). ar Xiv preprint ar Xiv:2003.00688, 2020.

[22] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. In International Conference on Learning Representations, 2021.

[23] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.

[24] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Neural Information Processing Systems, 2018.

[25] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artiﬁcial Intelligence, 2018.

[26] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efﬁciently improving generalization. In International Conference on Learning Representations, 2021.

[27] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Uncertainty in Artiﬁcial Intelligence, 2017.

[28] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to ﬁnd them. ar Xiv preprint ar Xiv:1912.02178, 2019.

[29] V Vapnik. Statistical learning theory. NY: Wiley, 1998.

[30] Seonguk Seo, Yumin Suh, Dongwan Kim, Jongwoo Han, and Bohyung Han. Learning to optimize domain speciﬁc normalization for domain generalization. European Conference on Computer Vision, 2020.

[31] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443 450. Springer, 2016.

[32] Hyeonseob Nam, Hyun Jae Lee, Jongchan Park, Wonjun Yoon, and Donggeun Yoo. Reducing domain gap by reducing style bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8690 8699, 2021.

[33] Prithvijit Chattopadhyay, Yogesh Balaji, and Judy Hoffman. Learning to balance speciﬁcity and invariance for in and out of domain generalization. In European Conference on Computer Vision, pages 301 318. Springer, 2020.

[34] Matthew Norton and Johannes O Royset. Diametrical risk minimization: Theory and computations. ar Xiv preprint ar Xiv:1910.10844, 2019.

[35] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. International Conference on Learning Representations, 2018.

[36] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classiﬁers with localizable features. In International Conference on Computer Vision (ICCV), 2019.

[37] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1): 151 175, 2010.

[38] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

[39] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann Le Cun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12): 124018, 2019.

[40] Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464 472. IEEE, 2017.

[41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

[42] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

[43] Chen Fang, Ye Xu, and Daniel N Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In IEEE International Conference on Computer Vision, pages 1657 1664, 2013.

[44] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In IEEE conference on computer vision and pattern recognition, pages 5018 5027, 2017.

[45] Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In European Conference on Computer Vision, pages 456 473, 2018.

[46] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In IEEE/CVF International Conference on Computer Vision, pages 1406 1415, 2019.

[47] Oren Nuriel, Sagie Benaim, and Lior Wolf. Permuted adain: Reducing the bias towards global statistics in image classiﬁcation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.

[48] Shujun Wang, Lequan Yu, Caizi Li, Chi-Wing Fu, and Pheng-Ann Heng. Learning from extrinsic and intrinsic supervisions for domain generalization. In European Conference on Computer Vision, pages 159 176. Springer, 2020.

[49] Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2020.

[50] Minghao Xu, Jian Zhang, Bingbing Ni, Teng Li, Chengjie Wang, Qi Tian, and Wenjun Zhang. Adversarial domain adaptation with domain mixup. In AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 6502 6509, 2020.

[51] Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training. ar Xiv preprint ar Xiv:2001.00677, 2020.

[52] Yufei Wang, Haoliang Li, and Alex C Kot. Heterogeneous domain generalization via domain mixup. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3622 3626. IEEE, 2020.

[53] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5400 5409, 2018.

[54] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning. Journal of Machine Learning Research, 22(2):1 55, 2021.

[55] Zeyi Huang, Haohan Wang, Eric P Xing, and Dong Huang. Self-challenging improves crossdomain generalization. European Conference on Computer Vision, 2, 2020.

[56] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838 855, 1992.

[57] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):1979 1993, 2018.

[58] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017.

[59] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018.

[60] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. ar Xiv preprint ar Xiv:2006.16241, 2020.

[61] Kai Yuanqing Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. In International Conference on Learning Representations, 2020.

[62] Hanjoo Kim, Minkyu Kim, Dongjoo Seo, Jinwoong Kim, Heungseok Park, Soeun Park, Hyunwoo Jo, Kyung Hyun Kim, Youngil Yang, Youngkwan Kim, et al. Nsml: Meet the mlaas platform with a real-world case study. ar Xiv preprint ar Xiv:1810.09957, 2018.