# dna_domain_generalization_with_diversified_neural_averaging__02627058.pdf DNA: Domain Generalization with Diversified Neural Averaging Xu Chu * 1 Yujie Jin * 2 Wenwu Zhu 1 Yasha Wang 2 Xin Wang 1 Shanghang Zhang 2 Hong Mei 2 The inaccessibility of the target domain data causes domain generalization (DG) methods prone to forget target discriminative features, and challenges the pervasive theme in existing literature in pursuing a single classifier with an ideal joint risk. In contrast, this paper investigates model misspecification and attempts to bridge DG with classifier ensemble theoretically and methodologically. By introducing a pruned Jensen-Shannon (PJS) loss, we show that the target square-root risk w.r.t. the PJS loss of the ρ-ensemble (the averaged classifier weighted by a quasi-posterior ρ) is bounded by the averaged source square-root risk of the Gibbs classifiers. We derive a tighter bound by enforcing a positive principled diversity measure of the classifiers. We give a PAC-Bayes upper bound on the target square-root risk of the ρ-ensemble. Methodologically, we propose a diversified neural averaging (DNA) method for DG, which optimizes the proposed PAC-Bayes bound approximately. The DNA method samples Gibbs classifiers transversely and longitudinally by simultaneously considering the dropout variational family and optimization trajectory. The ρ-ensemble is approximated by averaging the longitudinal weights in a single run with dropout shut down, ensuring a fast ensemble with low computational overhead. Empirically, the proposed DNA method achieves the state-of-the-art classification performance on standard DG benchmark datasets. 1. Introduction In real-world challenges, the identically distributed condition between the training (source domain) data and the *Equal contribution 1Tsinghua University 2Peking University. Correspondence to: Wenwu Zhu , Yasha Wang , Xin Wang . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). testing (target domain) data is likely to be violated. For example, the patient population in the medical image processing task (Li et al., 2020), the weather condition in the autonomous driving recognition task (Michaelis et al., 2019), and the visual resolution in multi-modality data analysis (Zhu et al., 2015) could be varied. Traditional machine learning algorithms based on the empirical risk minimization (ERM) (Vapnik, 1999) are error-prone to such distributional shifts, whence the empirical source risk does not necessarily converge to the population target risk. Deep learning algorithms are acknowledged as being especially sensitive to distributional shifts (Su et al., 2019; Recht et al., 2019). Such concerns on deploying machine learning algorithms in critical systems urge increasing attention to domain generalization (DG). DG aims at generalizing a classifier trained by the source domain(s) data to an unseen target domain in the presence of distributional shift (Blanchard et al., 2011; Muandet et al., 2013). A significant challenge of DG manifests as the inaccessibility of the target data, differing DG from domain adaptation (DA) where comparing covariates of the source domain and the target domain is possible during training (Ben-David et al., 2007). Many principles have been proposed to tackle DG over the last decade (Muandet et al., 2013; Volpi et al., 2018; Piratla et al., 2020). Nevertheless, a predominant theme of existing DG literature is setting the model misspecification (whence the hypothesis space s support may not contain an ideal classifier David et al. (2010)) at negligible, and assuming the support of the deep hypothesis space during training contains an ideal classifier, whose joint risks over the source domain and the target domain are ideal (Muandet et al., 2013; Zhang et al., 2021a). Such an ideal joint risk assumption is challenged by the information bottleneck principle of the deep neural networks (DNNs) (Tishby et al., 2000; Tishby & Zaslavsky, 2015): When training on the source data, a deep classifier tends to remember only discriminative features of seen training data and forget any other information, including those that might be target-discriminative. The inaccessibility of the target data during training implies that the deep classifiers hypothesis space tends to support on the subspace of low source risk, not necessarily on the subspace of low target risk. In a word, the ideal classifier is possibly depart from the training stage hypothesis space s effective support. DNA: Domain Generalization with Diversified Neural Averaging A practical approach to handle model misspecification is classifier ensemble (model/classifier averages) (Masegosa, 2020), which works by enriching the support set of the hypothesis space (Domingos, 1997). Empirical studies have found ensembles outperform thereof Bayesian rivals under distributional shift (Fort et al., 2019). There is literature discussing ensembles for DA (Germain et al., 2013; 2016b). However, there is no adequate research on bridging DG and ensembles from the lens of model misspecification, possibly due to the hiatus of a theoretically convenient loss function. This paper attempts to bridge DG and classifier ensemble both theoretically and methodologically. We connect a domain s generalization risk for a given classifier with a pruned Jensen-Shannon (PJS) divergence by proposing a novel pruned Jensen-Shannon (PJS) loss function. The risk with respect to (w.r.t.) the PJS loss function has two intriguing properties: Firstly, the square-root risk inherits the satisfaction of triangle inequality from the original JS divergence (Endres & Schindelin, 2003). Secondly, the square-root risk enjoys convexity w.r.t. the classifiers, a property that the original JS divergence does not possess. With the assurance of the convexity and satisfaction of triangle inequality, we show that the target square-root risk of the ρ-ensemble (the averaged classifier weighted by a quasiposterior ρ) is bounded by the averaged source square-root risk of the Gibbs classifiers. We prove that the upper bound on the target square-root risk can be tighter by enforcing a positive diversity measure of classifiers in an ensemble. The diversity measure can be seen as a principled extension of the measure proposed recently (Masegosa, 2020; Ortega et al., 2021) concerning distributional shift. To enable a principled algorithm, we give a PAC-Bayes bound on the square-root target risk in terms of empirical source risks. Grounded on the proposed PAC-Bayes bound on the squareroot target risk, we propose a practical fast-ensemble method for DG with low computation overhead, dubbed the Diversified Neural Averaging (DNA). The DNA method samples Gibbs classifiers transversely and longitudinally by simultaneously considering the dropout variational family and the optimization trajectory. For dropout realizations, the proposed DNA encourages a positive diversity measure explicitly. The diversity for trajectory realizations is induced implicitly during the training. In order to reduce the computational overhead, inspired by the recent advance in the fast neural ensemble (Izmailov et al., 2018), the proposed DNA method approximates the ρ-ensemble by DNN weight averaging in a single run, without saving checkpoints. We highlight the contributions of this paper: 1. This paper theoretically and methodologically bridges DG and classifier ensemble from the lens of model misspecification, and this paper firstly discusses a principled diversity measure in the DG scenario. 2. This paper proposes a novel PJS loss, connecting the generalization risk and the PJS divergence. Thence it allows an upper bound on the square-root target risk of the ρ-ensemble. This paper also proposes a PAC-Bayes upper bound in terms of empirical source risks. 3. This paper proposes a principled DNA method for DG. The DNA method is a fast-ensemble method with low computational overhead and a competitive method for DG with high performance. The proposed DNA method achieves competitive classification performance on five DG benchmarks (PACS (Li et al., 2017), VLCS (Fang et al., 2013), Office Home (Venkateswara et al., 2017), Terra Incognita (Beery et al., 2018) and Domain Net (Peng et al., 2019)) in extensive experiments. 2. Related Work Over the past decades, there has been a long line of DG literature. Many efforts are devoted to mitigating the domain gap (Muandet et al., 2013) or the risk gap (Arjovsky et al., 2019) in a learned latent space: (a) Directly optimizing statistical distance (Sun & Saenko, 2016) such as Wasserstein distance (Ganin et al., 2016; Li et al., 2018d) and maximum mean discrepancy (MMD) (Li et al., 2018b; Blanchard et al., 2021). (b) Indirectly reducing domain gap by methods such as data augmentation (Wang et al., 2020b) or normalization (Nam & Kim, 2018). c) Adjusting target distribution at test-time, such as meta-learning methods (Li et al., 2018a; 2019; Zhang et al., 2020) and uncertainty minimization methods (Wang et al., 2020a; Iwasawa & Matsuo, 2021). Another line of work tries to learn stable/causal features across domains by learning disentangled features (Mahajan et al., 2021; Zhang et al., 2021b; Li et al., 2021). A more relevant line of DG methods is increasing the model robustness. The class of input-robust methods learns a robust classifier against input perturbations (Sinha et al., 2018; Volpi et al., 2018; Sagawa et al., 2020; Yi et al., 2021; Krueger et al., 2021) by optimizing the worst-case risk over a set of perturbations on the inputs. On the other hand, SWAD (Cha et al., 2021) and Transfer (Zhang et al., 2021a) consider perturbations in the hypothesis (parameter) space. Although SWAD and our DNA employ stochastic weight averaging (SWA) (Izmailov et al., 2018) from the methodological perspective, the motivations differ. SWAD adopts SWA to seek a perturbation-resistant parameter subspace. In contrast, the proposed DNA adopts SWA to approximate the diversified ρ-ensemble to reduce computational overhead. The Bayesian model averaging (BMA) (Xiao et al., 2021) is a close class of methods to the ensemble. BMA reduces the uncertainty to distinguish the best single model with limited data (Hoeting et al., 1999). Classifier ensemble works by enriching the hypothesis space (Domingos, 1997). There are a few works proposing heuristic-based multi-expert meth- DNA: Domain Generalization with Diversified Neural Averaging ods (Seo et al., 2020; Zhou et al., 2021; Zhang et al., 2021c), which aim to exploits the complementary information in various source domains. The domains labels are essential for those multi-expert methods. Whereas the domains labels is not necessary for our method. There is a recent work (Thomas et al., 2021) assumes an environment B and proposes an upper bound for the averaged-case target risk EQ BE(x,y) Q[ℓ]. In contrast, we propose a stronger result, an upper bound for the worst-case target risk, E(x,y) Q[ℓ] for any Q. A contemporary work (Arpit et al., 2021) proposes an Eo A method which also adopts ensemble to tackle DG. We highlight three inherent differences: a) DNA is more theoretically grounded by optimizing the proposed PAC-Bayes target risk bound w.r.t. the PJS loss function. b) DNA promotes a principled measure of diversity and our theoretical results can be viewed as a justification for Eo A. c) DNA is a fast-ensemble method that approximates a diversified classifier averaging in a single run, instead of averaging classifiers out of multiple runs with multiplied computational cost. A more detailed discussion on the theoretical connections with existing literature is in Sect. 3.3. In this section1, we first formulate the DG problem. Then we introduce the definition and properties of the proposed PJS divergence and PJS loss function, where we set up a link between the generalization risk and the proposed divergence. After that, we introduce theoretical results on the ρ-ensembles, demonstrating the benefits of considering a diversified classifier ensemble for DG. Lastly, we discuss the connections with existing DG theories. Problem Formulation Denote the input space by X Rn and the output space by the C 1 simplex Y = C 1. Denote by M+(X Y) the space of all probability measures on the sample space X Y (provided with a σ-algebra Σ). The common setting of domain generalization (DG) focuses on the C-class single label classification tasks. The source domain(s)2 refers to a probability measure P M+(X I) that is available for sampling, where I = {y {0, 1}C : PC j=1 yj = 1} Y is the space of one-hot label vectors. In contrast, the target domain refers to a related measure Q M+(X I) that is invisible during training. Given a sample Dn = {(xi, yi)}n i=1 independently identically distributed (i.i.d.) drawn from Pn, a hypothesis space H of stochastic classifiers, and a loss function ℓ: X I H R+, the goal of DG is to learn a classifier ˆh H with a minimal generalization risk on the target domain Q, ˆh = arg min h H Rℓ Q(h) = arg min h H E(x,y) Q [ℓ(x, y, h)] . (1) 1Please refer to the Appendix A for the proofs of this section. 2We consider a single source domain and a single target domain for easiness and fairness concerns (Hashimoto et al., 2018). The extension for multiple source domains is straightforward. 3.1. The PJS Divergence and PJS Loss Definition 3.1 (pruned Jensen-Shannon divergence). Consider a subset of M+(X Y) induced by a dominating measure λ, M λ + = {µ M+(X Y) : µ λ}, such that each measure in M λ + is absolutely continuous (w.r.t.) the the dominating measure λ. For measures Υ, P M λ + and P Υ, let p = d P dλ and υ = dΥ dλ denote the Radon Nikodym derivatives accordingly. The pruned Jensen-Shannon (PJS) divergence DP JS(P Υ) : M λ + M λ + R from a measure Υ to a measure P is defined by an integral over the support set AP of measure P, that is DP JS(P Υ) = Z AP p log 2p p + υ + υ log 2υ p + υ dλ. (2) The integral over the support set AP of measure P implies that DP JS(P Υ) only describes the behavior of measures P and Υ on AP, but not the behavior on the different set of the supports AΥ AP. Whereas the original Jensen Shannon (JS) divergence (Endres & Schindelin, 2003) is defined by the intergal of the same function (differs by a factor of 1 2) over the union of the supports. The subtle difference makes the proposed PJS divergence violate the identity of indiscernibles (DP JS(P Υ) = 0 P = Υ) so that it is not a statistical divergence. However, as we will see in Thm. 3.2, the PJS divergence inherits the ideal property of satisfying the triangle inequality from the JS divergence. Moreover, the proposed PJS divergence is more compatible with a diversified ensemble learning than the JS divergence. Theorem 3.2 (properties of PJS divergence). Consider measures Q, P, Υ M λ +. Suppose that Q Υ, P Υ and AP = AQ, then the PJS divergence satisfies (a) DP JS(P Q) = 2DJS(P Q), where DJS is the vanilla JS divergence. (b) DP JS(P Υ) 0. (c) p DP JS(Q Υ) p DP JS(P Υ) + p Next, we introduce the PJS loss ℓP JS that connects the generalization risk RℓP JS and the PJS divergence DP JS( ). Definition 3.3 (pruned Jensen-Shannon loss). Let A be the support of the underlying (empirical) distribution where the realization (x, y) is drawn from. The pruned Jensen Shannon (PJS) loss ℓP JS : X I H R+ is ℓP JS(x, y, h) = (log 2 h + 1 + h log 2h h + 1)1{(x,y) A}, (3) where h (0, 1) is the vector component of h that corresponds to the correct class, and 1{ } is the indicator function. Provided a domain P M+(X I), assume the existence of the density function over the input space X, denoted by p P(x). Then p P(x)h induces a (set of) measure(s) ΥP M+(X Y) whose density over X Y is p P(x)h. DNA: Domain Generalization with Diversified Neural Averaging Proposition 3.4. (connecting risk and PJS divergence) For a classifier h and a domain P M+(X I). Let ΥP be the induced measure. Assuming p P(y|x) {0, 1}3, then RℓP JS P (h) = E (x,y) P[ℓP JS(x, y, h)] = DP JS(P ΥP) (4) Founding on the connection of the risk and the PJS divergence in the Prop. 3.4, we have bounds of the target risk by invoking the triangle inequality (c) in Thm 3.2. Theorem 3.5 (bounds of the target risk). Let Q and P be the (fixed) target domain and the source domain in M+(X I), with p Q(x) and p P(x) being their density functions over the input space X, respectively. Suppose the supports of Q and P are identical, i.e., AP = AQ, with p P(y|x), p Q(y|x) {0, 1}. Let H be a hypothesis space, such that for any h H, the density p P(x)h induced measure ΥP dominates4 P, and ΥQ dominates Q, i.e., P ΥP and Q ΥQ. Then for any h H, RℓP JS Q (h) q RℓP JS P (h) + 2 p 2DJS(P Q). (5) q RℓP JS Q (h) q RℓP JS P (h) 2 p 2DJS(P Q). (6) 3.2. The ρ-ensemble One could arrive at similar bounds within the framework of the original JS divergence. Whereas in this paper we aim at (diversified) ensembles. The pruned JS divergence enables a guaranteed training towards the ensemble quasi-posterior ρ. Formally, we introduce the ρ-ensemble. Definition 3.6 (ρ-ensemble). Suppose that ρ is a measure on a hypothesis space H. The ρ-ensemble ρ is the ρ-weighted averaged classifier, ρ = Eh ρ(h). Directly training the averaged classifier ρ is often impracticable in modern deep architectures. A practical solution is sampling classifiers (Gibbs classifier) from H according to ρ first, and then training the Gibbs classifiers. However, the sampling-training paradigm relies on a crucial assumption: The generalization risk of the ρ is no larger than the averaged risk of the Gibbs classifiers, i.e., R(ρ) Eρ[R(h)]. The proposed PJS divergence satisfies this requirement. Theorem 3.7 (inequalities related to the ρ-ensemble). Let H be a hypothesis space as stated in the Thm. 3.5. For any ρ M+(H), any measure P, take DP = EP[V arρ( ℓP JS)], q RℓP JS P (ρ) q Eρ[RℓP JS P (h)] DP Eρ[ q RℓP JS P (h)]. (7) 3It says a real-world input x cannot in two categories, combined with AP = AQ would imply a covariate shift assumption with deterministic conditional p(y|x). Since invariance is not the focus of this paper, we left the relaxation of invariance for future work. 4This is not a strong requirement as long as the component of the correct class in vector h is non-zero. Moreover, DP > 0 is a necessary condition for the second inequality becoming strict. If ℓP JS(x1,y,h) ℓP JS(x2,y,h) is varying on (a non-zero measured subset of) the set {(x, y, h) : V arρ( ℓP JS) > 0} for x1 = x2 , then DP > 0 is a sufficient condition for the second inequality being strict. The Thm. 3.7 says that in order to obtain a ρ-ensemble with an ideal source risk, we may sample Gibbs classifiers first and optimize the (square-root) risk of each Gibbs classifier. There are at least two options according to the two inequalities in Eq. (7). The first inequality implies a diversitypromoting strategy: Minimizing Eρ[RℓP JS P (h)] DP prefers a quasi-posterior ρ supported on diversified classifiers to produce a large DP = EP[V arρ( ℓP JS)]. The second option is minimizing Eρ[ q RℓP JS P (h)] without the explicit emphasis on diversity. When DP > 0 and ℓP JS(x1,y,h) ℓP JS(x2,y,h) is non-constant on a non-zero measured set, the former diversity promoting strategy will lead to a strictly tighter upper bound on the ρ-ensemble source risk than the latter one. Invoking the Thm. 3.5 gives an upper bound on the squareroot target risk of the ρ-ensemble. Corollary 3.8 (target risk upper bound of the ensembles). Given a fixed source domain P and a target domain Q, for any measure ρ M+(H) on hypothesis space H, q RℓP JS Q (ρ) q Eρ[RℓP JS P (h)] DP+2 p 2DJS(P Q). (8) Note that the Thm. 3.7 holds for any measure. Therefore we may apply Eq. (7) to the target measure and derive the following corollary on the joint risk. Corollary 3.9 (joint risk upper bound of the ensembles). Given a fixed source domain P and a target domain Q, for any measure ρ M+(H) on the hypothesis space H, q RℓP JS P (ρ) + q RℓP JS Q (ρ) Eρ[RℓP JS P (h)] DP + q Eρ[RℓP JS Q (h)] DQ RℓP JS P (h)] + Eρ[ q RℓP JS Q (h)]. Cor. 3.9 says the joint risk of the ρ-ensemble is no more significant than that of the Gibbs classifier, validating the motivation of employing classifier ensemble to reduce the undue influence of model misspecification. We conclude this section by introducing a PAC-Bayes generalization upper bound on the ρ-ensemble target risk in terms of empirical estimates (denoted by ˆR( ) or ˆD( )). Theorem 3.10 (PAC-Bayesian generalization upper bound). For a fixed source domain P and a fixed target domain Q, let H be a hypothesis space as stated in the Thm. 3.5. Suppose that π is a prior over H, which is independent of DNA: Domain Generalization with Diversified Neural Averaging draws of source realizations Dn = {(xi, yi)}n i=1 i.i.d. Pn. Then for any c > 0, ρ M+(H), and any δ (0, 1), with probability over 1 δ q RℓP JS Q (ρ) 2 p 2DJS(P Q)+ s Eρ[ ˆRℓP JS P (h)] ˆDP + 2DKL(ρ π) + log 1 δ + ΨℓP JS P,π (c, n) cn , where log Eπ2EPn[ecn EP ℓ(h)] = Ψℓ P,π(c, n) is constant w.r.t. ρ for fixed c, n, π, ℓ, and δ. 3.3. Connection with Existing DG Theory Bridging DG with classifier ensemble is a complement rather than a substitute for existing DG literature. (a) From the transferability perspective: Existing literature assume the optimal Bayes classifier for various domains is in the hypothesis space (Blanchard et al., 2011; Muandet et al., 2013), or the excess risk of the target domain is comparable to those of the source domains in a hypothesis subspace (Zhang et al., 2021a), thence assuring a classifier with an ideal joint risk. We argue that this requirement might be too restrictive for the hypothesis space of deep classifiers without access to the target data, and we may ease the dilemma by enriching the hypothesis with classifier ensembles. (b) From the invariance perspective: We assume the identical support of the target domain and the source domain with a deterministic conditional density p(y|x), which is equivalent to the Dirac delta posterior and the covariate shift assumption in the existing literature (Blanchard et al., 2011; Muandet et al., 2013). (c) From the discrepancy perspective: The upper bounds on the target risks usually include a discrepancy measure between the target domain and the source domain. The term discrepancy measure is related to the underlying loss function. For example, our PJS loss leads to the JS divergence. The 0-1 loss usually leads to the total variation divergence (Zhao et al., 2019; Cha et al., 2021). The JS divergence is tighter than the variation divergence (Polyanskiy & Wu, 2014), implying narrow risk gaps. (d) Last but not least: we discuss classifier ensemble in the DG scenario without target access, which is different from the domain adaptation literature discussing classifier ensemble w.r.t. the 0-1 loss (Germain et al., 2013; 2016b). In this section, we propose a Diversified Neural Averaging (DNA) method for DG, whose principle is to optimize the PAC-Bayesian generalization upper bound introduced in Thm. 3.10. We start by quoting the necessary preliminaries from existing deep learning literature before diving into the detailed algorithm. Dropout variational family (Gal & Ghahramani, 2016). Suppose that θ = {Ai}L i=1 Θ are NN weight matrices in each layer. Let ϵi be a Bernoulli random vector. The wellknown dropout regularization (Srivastava et al., 2014) on activations of i-th layer is equivalent to multiplying Ai by ϵi, resulting in a stochastic weight matrix Wi = ϵi Ai. The resulting ω = {Wi}L i=1 can be viewed as draws from a quasiposterior distribution ρθ parametrized by θ = {Ai}L i=1. The transformation from ϵ = {ϵi}L i=1 to ω, ω = r(ϵ, θ), is referred to as the reparameterization trick. The JS divergence between the source domain and the target domain in this bound is inestimable and often treated as a relatively small constant in DG literature, which is implicitly comprised in the underlying assumption of DG that the target domain is similar to the source domain. Thus, by discarding constant terms of the bound w.r.t. ρ, the minimization problem can be written as, minimize ρ M+(H) Eρ[ ˆRℓP JS P (h)] ˆDP(ρ) + 2DKL(ρ π) To optimize this objective, we need to sample Gibbs classifiers from the quasi-posterior ρ. The proposed DNA method samples Gibbs classifiers transversely and longitudinally by simultaneously considering two aspects. On the one hand, the optimization trajectory of DNNs produces a natural sampling distribution of parameters. On the other hand, we adopt the dropout variational family as a practical choice because it provides sampling and computing convenience. Formally, we focus on the situation when the hypothesis space H is the space of parametric functions induced by a DNN. Suppose that H = {hω( )|ω Ω}, where h( ) is the functional form determined by fixed network architecture and ω parameterizes the function hω( ). Ωis the parameter space that parameters ω live in. As mentioned above, ω is reparameterized by the DNN weight matrices parameters θ and the Bernoulli vector ϵ, where θ is sampled from an optimization trajectory distribution T M+(Θ) and ϵ is sampled from a Bernoulli random vector B with each dimension being 1 with probability 0.5. For the first term in the Eq. (11), we optimize the empirical risk of each sampled Gibbs classifier, Eρ|θ[ ˆRℓP JS P (h)] = 1 nm ϵ B ℓP JS(xi, yi, hr(ϵ,θ)). For the diversity measure ˆDP(ρ) in the Eq. (11), it is difficult to handle this term among the Gibbs classifiers sampled along the optimization trajectory simultaneously. Therefore, we only explicitly model and promote the diversity measure for each fixed θ on the trajectory with different dropout realizations, ˆDP(ρ|θ) = 1 i=1 V arϵ|θ( q ℓP JS(xi, yi, hr(ϵ,θ))), (13) DNA: Domain Generalization with Diversified Neural Averaging where V arϵ|θ( ) is the variance calculated on the m samples of ϵ B for a fixed θ. The usage of the KL term DKL(ρ π) in Eq. (11) is regularizing the quasi-posterior ρ to be not far from the prior π over the hypothesis space. The DKL(ρ π) is introduced into the upper bound by applying the Donsker and Varadhan s variational formula (cf. the proof for Thm. 3.10 in Appendix A for more details). Assured by the KL condition (Gal, 2016; Gal & Ghahramani, 2016), when the prior is approximately Gaussian, DKL(ρ π) can be approximated by the weight decay for the Bernoulli (and Gaussian) variational family. Moreover, there have been literature (Loshchilov & Hutter, 2018; Maddox et al., 2019) suggesting the approximate Gaussian prior of modern deep learning. Thus, the optimization objective of the proposed DNA is minimize θ Θ Eρ|θ[ ˆRℓP JS P (h)] ηˆDP(ρ|θ), (14) where η > 0 is a hyperparameter controlling the tradeoff between minimizing the source risk and promoting the diversity, and the weight decay regularization is used in DNA as a default setting. In DNA, the target samples are classified by the ρensemble. ρ-ensemble is the averaged classifier over the quasi-posterior ρ. The classifier averaging over ρ, i.e., ρ = Eθ T Eϵ B(hr(ϵ,θ)), comprises two steps. In the first step, classifier averaging is performed across different dropout realizations for a fixed θ, which is approximated well by shutting down the dropout (Srivastava et al., 2014). In the second step, classifier averaging is performed along the optimization trajectory. In order to reduce computational overhead, we do not save every checkpoint and ensemble these classifiers. Considering the strong connections between the classifiers and the parameter space of DNN (Fort et al., 2019), we approximate classifier averaging by adopting a dense stochastic weight averaging (SWA) (Izmailov et al., 2018; Cha et al., 2021) over the parameter space in a single run, making our DNA a fast-ensemble method. To be specific, we use SWAD (Cha et al., 2021) for weight averaging, which is a modified version of SWA (Izmailov et al., 2018) with a dense and overfit-aware stochastic weight sampling strategy. The entire procedure of DNA method is given in Algorithm 15. 4.1. Complexity analysis. Compared to the baseline (Gulrajani & Lopez-Paz, 2021), the additonal time complexity of DNA incurs from higher evaluation frequency and dropout sampling. We use the same technique as in (Cha et al., 2021) to analyze the overhead: let the total number of source domain samples n, training-validation split ratio rs, and evaluation frequency 5Codes are avalable at https://github.com/Jin Yujie99/DNA f. For one epoch, let tf be the forward time, tb be the backward time, rt be the ratio of forward (backward) time cost of the fully connected layer to that of the convolutional backbone in a DNN. For simplicity, we assume that tf = tb = t. For one epoch, the training time is 2tnrs(1 + mrt)/(rs + 1)(rt + 1), the evaluation time is ftn/(rs + 1). Thus, the total time of DNA is tn rs+1[f + 2rs(1 + mrt)/(rt + 1)], while the total time of ERM is tn rs+1[fd + 2rs], where fd is the default evaluation frequency in Domain Bed (Gulrajani & Lopez-Paz, 2021). The final time overhead ratio is [f + 2rs(1 + mrt)/(rt + 1)]/(fd + 2rs). Since f 3fd (in our experimental settings), fd < 1, rs = 4 and rt 1, the ratio indicates a low computational overhead. Algorithm 1 DNA: DG with Diversified Neural Averaging. Input: source dataset D = {(xi, yi)}n i=1, dropout model h with weight matrices θ, batch size b, total iterations for training I, trade-off hyperparameter η, number of dropout samples m Output: learned ρ-ensemble ρ iter 0, initialize θ, θ while iter < I do L 0 {(xi, yi)}b i=1 randomly sample b instances from D {ϵj}m j=1 randomly sample m dropout realizations L L + 1 bm Pb i=1 Pm j=1 ℓP JS(xi, yi, hr(ϵj,θ)) b Pb i=1 V ar( q ℓP JS(xi, yi, hr(ϵj,θ))) update θ using θL θ (iter θ + θ)/(iter + 1) iter = iter + 1 end while ρ hθ with dropout shutdown 5. Experiment 5.1. Experimental Settings Datasets. Following Gulrajani & Lopez-Paz (2021), we exhaustively conduct experiments on various benchmark datasets to validate the proposed DNA: PACS (Li et al., 2017) comprises four domains d {photo, art, cartoon, sketch}, containing 9991 images of 7 categories. VLCS (Fang et al., 2013) comprises four photographic domains d {VOC2007, Label Me, Caltech101, SUN09}, with 10729 samples of 5 classes. Office Home (Venkateswara et al., 2017) has four domains d {art, clipart, product, real}, containing 15500 images with a larger label sets of 65 categories. Terra Incognita (Beery et al., 2018) comprises photos of wild animals taken by cameras at different locations. Following (Gulrajani & Lopez-Paz, 2021), we use domains of d {L100, L38, L43, L46}, which include 24778 samples and 10 classes. Domain Net (Peng et al., 2019) is a large scale dataset containing 596006 images and 345 classes, over six domains d {clipart, infograph, painting, quickdraw, real, sketch}. Evaluation protocol. For a fair comparison, we follow the DNA: Domain Generalization with Diversified Neural Averaging Table 1. Benchmark Comparisons. Out-of-domain classification accuracies(%) on PACS, VLCS, Office Home, Terra Incognita, and Domain Net are shown. The results of m DSDI and SWAD are from the original literature, and the other numbers are from Domain Bed (Gulrajani & Lopez-Paz, 2021). Each experiment is repeated 3 times and we highlight the best results. Method PACS VLCS Office Home Terra Incognita Domain Net Avg ERM (Vapnik, 1999) 85.5 0.2 77.5 0.4 66.5 0.3 46.1 1.8 40.9 0.1 63.3 IRM (Arjovsky et al., 2019) 83.5 0.8 78.5 0.5 64.3 2.2 47.6 0.8 33.9 2.8 61.6 DRO (Sagawa et al., 2020) 84.4 0.8 76.7 0.6 66.0 0.7 43.2 1.1 33.3 0.2 60.7 MMD (Li et al., 2018c) 84.6 0.5 77.5 0.9 66.3 0.1 42.2 1.6 23.4 9.5 58.8 DANN (Ganin et al., 2016) 83.6 0.4 78.6 0.4 65.9 0.6 46.7 0.5 38.3 0.1 62.6 CDANN (Li et al., 2018d) 82.6 0.9 77.5 0.1 65.8 1.3 45.8 1.6 38.3 0.3 62.0 MTL (Blanchard et al., 2021) 84.6 0.5 77.2 0.4 66.4 0.5 45.6 1.2 40.6 0.1 62.9 ARM (Zhang et al., 2020) 85.1 0.4 77.6 0.3 64.8 0.3 45.5 0.3 35.5 0.2 61.7 VREx (Krueger et al., 2021) 84.9 0.6 78.3 0.2 66.4 0.6 46.4 0.6 33.6 2.9 61.9 RSC (Huang et al., 2020) 85.2 0.9 77.1 0.5 65.5 0.9 46.6 1.0 38.9 0.5 62.7 Mixup (Wang et al., 2020b) 84.6 0.6 77.4 0.6 68.1 0.3 47.9 0.8 39.2 0.1 63.4 MLDG (Li et al., 2018a) 84.9 1.0 77.2 0.4 66.8 0.6 47.7 0.9 41.2 0.1 63.6 Sag Net (Nam et al., 2021) 86.3 0.2 77.8 0.5 68.1 0.1 48.6 1.0 40.3 0.1 64.2 CORAL (Sun & Saenko, 2016) 86.2 0.3 78.8 0.6 68.7 0.3 47.6 1.0 41.5 0.1 64.5 m DSDI (Bui et al., 2021) 86.2 0.2 79.0 0.3 69.2 0.4 48.1 1.4 42.8 0.1 65.1 SWAD (Cha et al., 2021) 88.1 0.4 79.1 0.4 70.6 0.3 50.0 0.4 46.5 0.2 66.9 DNA (ours) 88.4 0.1 79.0 0.1 71.2 0.1 52.2 0.4 47.2 0.1 67.6 training and evaluation protocol in Domain Bed (Gulrajani & Lopez-Paz, 2021). We choose a domain as the target domain for testing and use the remaining domains as source domains for training. We split each source domain into 8:2 training/validation splits and integrate the validation subsets of each source domain to create an overall validation set, which is used for validation and model selection. The chosen model is tested on the unseen target domain, and we report the mean and standard deviation of out-of-domain classification accuracies from three different runs with different training-validation splits. Implementation details. We use Res Net-50 (He et al., 2016) pre-trained on Image Net (Deng et al., 2009) as the backbone network for all datasets. All the batch normalization (BN) layers are frozen during training. We replace the last FC layer of the Res Net-50 with a 2-layer classifier with 1024 hidden units and employ the dropout regularization on the 1024-dimensional output (2048 hidden units are used for Domain Net, adjusting for the larger label space). The network is trained using the Adam (Kingma & Ba, 2015) optimizer. The number of dropout samples m is set to 5. For weight averaging, we use the densely and overfit-aware sampling strategy in (Cha et al., 2021). We follow the hyperparameter search protocol in Domain Bed (Gulrajani & Lopez-Paz, 2021) and follow Cha et al. (2021) to use a reduced search space for computational efficiency. We search the trade-off hyperparameter η {0.01, 0.1, 1.0} and set η = 0.1 by model selection on the validation sets6. 6More implementation details are given in Appendix B.2. 5.2. Results We compare our proposed DNA with various domain generalization methods under the same evaluation protocol and report the out-of-domain classification accuracies7 in Tab. 1. For all baseline methods, the Res Net-50 is used as the backbone network. It is shown that our DNA method achieves state-of-the-art performance on four of the five benchmarks, including PACS, Office Home, Terra Incognita, and Domain Net. While on VLCS, the performance of our method (79.0%) is on par with the previous best results (79.1%). Thus our DNA method achieves a significantly better average accuracy of 67.6% against the baselines. Moreover, the smaller standard errors of DNA manifest that our method is relatively more stable and robust. 5.3. Ablation Studies To verify the effectiveness of the design of our DNA method, we do ablation studies with regard to the weight averaging, the diversity enforcement DP, the dropout variational family and the PJS loss. In all experiments, the DNN architecture of different DNA variants is kept consistent. Effects of weight averaging. To inspect the effect of weight averaging, we use only the PJS loss (no dropout or any diversity enforcement term DP in the optimization objective) to train the DNN, with (w/) and without (w/o) SWA on Office Home, Terra Incognita and Domain Net. The results are 7Baseline descriptions and full results per dataset per domain are detailed in Appendix B.1 and Appendix C. DNA: Domain Generalization with Diversified Neural Averaging (a) PJS(single) (b) PJS(SWA) (c) CE(single) (d) CE(SWA) Figure 1. Losses visualization. single and SWA indicate the single model and the SWA (from the beginning of the optimization trajectory) ensemble model. tr loss , val loss and te loss denote the loss on the training set, validation set, and test set, respectively. Settings: Office Home dataset, the target domain is art while source domains include clipart, product, and real. Table 2. Out-of-domain accuracies(%) of the PJS loss with (w/) or without (w/o) SWA. Each experiment is repeated 3 times. Office Home Terra Incognita Domain Net w/ SWA 70.5 0.1 51.4 0.4 46.5 0.1 w/o SWA 66.5 0.2 49.4 0.3 43.7 0.1 Table 3. Out-of-domain accuracies(%) with (w/) or without (w/o) DP for optimization. Each experiment is repeated 3 times. PACS VLCS Domain Net w/ DP 88.2 0.2 78.7 0.1 46.8 0.1 w/o DP 88.4 0.1 79.0 0.1 47.2 0.1 shown in Tab. 2, instantiating the advantage of ensembles and result of Thm. 78. To validate the reasonableness of approximating the ρensemble by DNN weight averaging in a single run, we do an ablation study to compare the high-cost classifier averaging with the low-cost weight averaging. The results are shown in Tab. 4. Specifically, we save checkpoints at every evaluation steps in every single run. Then we uniformly sample these models and ensemble them for prediction. The results show that the performance of DNA can be approached by classifier averaging as the number of models for ensemble increases. While DNA is a fast-ensemble method without saving checkpoints, it greatly reduces the computational overhead. Effects of explicitly enforcing diversity and the choice of variational family. To validate the efficacy of explicitly enforcing diversity of classifiers for ensemble, i.e., encouraging a positive DP in the optimization objective, we compare the (w/) and without (w/o) DP variants of DNA on all the 8Tab. 2 rediscovers the advantage of applying SWA for DG (Cha et al., 2021). In contrast to SWAD (Cha et al., 2021), the proposed DNA is motivated from the lens of fast classifier ensemble, and the results are developed by optimizing the PJS loss. five benchmarks. The results are shown in Tab. 3 (PACS, VLCS, and Domain Net) and Tab. 5 (Office Home and Terra Incognita according to the 1st column). The performance of explicitly enforcing diversity of classifiers consistently outperforms, which substantiates the first and the second inequalities in Eq.(7) of Thm. 3.7. We also compare DNA with variants regarding the Gaussian dropout variational family. The Bernoulli (the default configuration for DNA) and Gaussian variational families are different choices for optimizing the PAC-Bayes bound approximately. The Bernoulli family is more appropriate for multi-modality distributions. The results of two variational family are reported in Tab. 5 (according to the 2nd column). We can find that the performances of DNA with the Bernoulli dropout variational family are better than that of the Gaussian variational family. We interpret the superiority as a result of better multi-modality posterior tolerance. Effects of the PJS loss. We compare the models trained with the proposed pruned Jensen-Shannon (PJS) loss function with the models trained with cross-entropy (CE) loss function. The results of optimizing the CE loss and PJS loss with SWA and with explicitly enforcing diversity DP are reported in Tab. 6. Note that the diversity measure proposed recently in Masegosa (2020) is used to adapt the CE loss for the CE scenario. From the results, we can find that each PJS model outperforms its CE variant, indicating the effectiveness of the proposed PJS loss and the diversity measure in the presence of distributional shift. The similar accuracies of PJS and CE on Office Home (15,500 images of 65 classes) are more of the consequence of the dataset limitation than the limitation of the method. When the dataset is easy or small, the DNN tends to give a high-confidence output, i.e., a h = prob(ycorrect|x) close to 1. From the formulation of the PJS loss (cf. Eq. (3)) and the CE loss, the PJS loss and CE loss will behave similarly when h is close to 1. On the other hand, when the dataset is hard or large, two losses behave dissimilarly, and the target risk optimization w.r.t. the PJS loss is theoretically guaranteed by our theory, leading to better accuracy. Additionally, comparing the with DNA: Domain Generalization with Diversified Neural Averaging Table 4. Out-of-domain accuracies(%) of ensemble of different number of models are shown. K denotes the number of sampled checkpoints for averaging. Each experiment is repeated 3 times. Office Home Terra Incognita A C P R Avg L100 L38 L43 L46 Avg K=5 64.6 0.3 56.1 0.4 77.8 0.2 79.5 0.3 69.5 53.0 1.4 42.0 0.2 58.2 0.5 42.6 1.6 49.0 K=10 65.9 0.3 56.6 0.1 78.5 0.2 80.0 0.1 70.3 55.9 1.3 43.9 1.2 60.0 0.4 42.8 1.0 50.6 K=20 66.4 0.1 57.0 0.2 78.8 0.1 80.5 0.1 70.7 57.1 1.2 43.7 0.9 60.0 0.4 43.1 1.0 51.0 DNA 67.7 0.2 57.7 0.3 78.9 0.2 80.5 0.2 71.2 56.8 1.2 47.0 0.9 61.0 0.5 44.0 1.0 52.2 Table 5. Out-of-domain accuracies (%) of different DNA variants are shown. DP indicates whether the diversity measure is explicitly enforced. Bernoulli and Gaussian denote the type of dropout variational family. Each experiment is repeated 3 times. DP Variational Family Office Home Terra Incognita A C P R Avg L100 L38 L43 L46 Avg Gaussian 67.1 0.1 57.4 0.1 78.6 0.1 80.2 0.4 70.8 54.8 1.6 48.2 0.4 60.2 1.0 43.2 0.9 51.6 Gaussian 67.2 0.2 57.5 0.1 78.7 0.3 80.3 0.4 70.9 56.0 0.9 47.2 1.3 60.9 0.7 43.4 0.7 51.9 Bernoulli 67.1 0.3 57.6 0.1 78.7 0.2 80.4 0.2 70.8 55.6 1.8 47.9 1.0 60.5 0.4 43.2 0.7 51.8 Bernoulli 67.7 0.2 57.7 0.3 78.9 0.2 80.5 0.2 71.2 56.8 1.2 47.0 0.9 61.0 0.5 44.0 1.0 52.2 Table 6. Out-of-domain accuracies(%) by optimizing the crossentropy (CE) loss and pruned Jensen-Shannon (PJS) loss with SWA and with explicitly enforcing diversity DP. Each experiment is repeated 3 times. loss Office Home Terra Incognita Domain Net CE 71.1 0.1 51.9 0.5 46.9 0.1 PJS 71.2 0.1 52.2 0.4 47.2 0.1 SWA results in Tab. 2 to the ERM results in Tab. 1 demonstrates the individual influence of loss functions: similar accuracies of two losses on Office Home, accuracies of PJS on Terra Incognita and Domain Net outperform. We also inspect the optimization process by observing the loss functions during the training time between the model trained with the PJS loss and the model trained with the CE loss. Specifically, we visualize the empirical estimate of PJS/CE loss on the training set, validation set and test set for the single model and the SWA ensemble model along the optimization trajectory. The results are shown in Fig. 1. We observe that in the averaging and non-averaging scenario, our proposed PJS loss suffers a lower level of overfitting, i.e., smaller gaps between training losses and testing/validation losses. Whereas the CE loss suffers a relatively higher level of overfitting, in the sense that the test loss goes up in late iterations. The second intriguing aspect of our PJS loss is the smoothness over the validation and test data, indicating that our method suffers less influence of model selection. For both loss functions, the averaging version exhibits a lower level of overfitting and a smoother plot of loss values in late iterations, substantiating our motivation of considering classifier ensemble. 6. Conclusion and Future Works This paper bridges DG and classifier ensemble both theoretically and methodologically. We propose a novel PJS divergence and a novel PJS loss, upon which we derive generalization bounds for the risk of the target domain. Grounded on the theoretical results, we propose a fastensemble method for DG, DNA. The DNA method is a highly competitive method with a low computational overhead. Extensive experiments validate the effectiveness of the proposed DNA. There are many possible directions for future works, to name a few: (a) For fairness without demographics (Hashimoto et al., 2018). The proposed DNA does not utilize the domain labels, highlighting its potential in protecting private features of the minority subgroups. (b) For learning under label noise. The proposed PJS loss is a bounded and symmetric loss. The boundedness and symmetry of a loss are empirically beneficial for robustness against label noise (Wang et al., 2019; Englesson & Azizpour, 2021). (c) For more competitive DG methods. DNA is a drop-in framework that tackles DG from the model misspecification perspective and is potentially compatible with existing DG principles for higher empirical performance. Acknowledgements This work is supported by the National Key Research and Development Program of China No. 2020AAA0106300 and National Natural Science Foundation of China No. 62102222. Alquier, P., Ridgway, J., and Chopin, N. On the properties of variational approximations of gibbs posteriors. The DNA: Domain Generalization with Diversified Neural Averaging Journal of Machine Learning Research, 17(1):8374 8414, 2016. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Arpit, D., Wang, H., Zhou, Y., and Xiong, C. Ensemble of averages: Improving model selection and boosting performance in domain generalization. ar Xiv preprint ar Xiv:2110.10832, 2021. Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision, pp. 456 473, 2018. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F., et al. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, 2007. Blanchard, G., Lee, G., and Scott, C. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems, volume 24, pp. 2178 2186, 2011. Blanchard, G., Deshmukh, A. A., Dogan, U., Lee, G., and Scott, C. Domain generalization by marginal transfer learning. Journal of Machine Learning Research, 2021. Bui, M.-H., Tran, T., Tran, A., and Phung, D. Exploiting domain-specific features to enhance domain generalization. In Advances in Neural Information Processing Systems, volume 34, 2021. Cha, J., Chun, S., Lee, K., Cho, H.-C., Park, S., Lee, Y., and Park, S. Swad: Domain generalization by seeking flat minima. In Advances in Neural Information Processing Systems, 2021. David, S. B., Lu, T., Luu, T., and P al, D. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 129 136. JMLR Workshop and Conference Proceedings, 2010. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2009. Domingos, P. M. Why does bagging work? a bayesian account and its implications. In KDD, pp. 155 158. Citeseer, 1997. Donsker, M. D. and Varadhan, S. R. S. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183 212, 1983. Endres, D. M. and Schindelin, J. E. A new metric for probability distributions. IEEE Transactions on Information Theory, 49(7):1858 1860, 2003. Englesson, E. and Azizpour, H. Generalized jensen-shannon divergence loss for learning with noisy labels. In Advances in Neural Information Processing Systems, 2021. Fang, C., Xu, Y., and Rockmore, D. N. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1657 1664, 2013. Fort, S., Hu, H., and Lakshminarayanan, B. Deep ensembles: A loss landscape perspective. ar Xiv preprint ar Xiv:1912.02757, 2019. Gal, Y. Computational Complexity of Machine Learning. Ph D thesis, Department of Engineering, University of Cambridge, 2016. Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050 1059. PMLR, 2016. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1), 2016. Germain, P., Habrard, A., Laviolette, F., and Morvant, E. A pac-bayesian approach for domain adaptation with specialization to linear classifiers. In International Conference on Machine Learning, pp. 738 746. PMLR, 2013. Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. Pac-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1876 1884, 2016a. Germain, P., Habrard, A., Laviolette, F., and Morvant, E. A new pac-bayesian perspective on domain adaptation. In International Conference on Machine Learning, pp. 859 868. PMLR, 2016b. Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. In International Conference on Learning Representations, 2021. Hardy, G. H., Littlewood, J. E., P olya, G., P olya, G., et al. Inequalities. Cambridge University Press, 1952. Hashimoto, T., Srivastava, M., Namkoong, H., and Liang, P. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pp. 1929 1938. PMLR, 2018. DNA: Domain Generalization with Diversified Neural Averaging He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2016. Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. Bayesian model averaging: a tutorial (with comments by m. clyde, david draper and ei george, and a rejoinder by the authors. Statistical science, 14(4):382 417, 1999. Huang, Z., Wang, H., Xing, E. P., and Huang, D. Selfchallenging improves cross-domain generalization. In Proceedings of the European Conference on Computer Vision, 2020. Iwasawa, Y. and Matsuo, Y. Test-time classifier adjustment module for model-agnostic domain generalization. In Advances in Neural Information Processing Systems, 2021. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence, 2018. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Le Priol, R., and Courville, A. Outof-distribution generalization via risk extrapolation. In International Conference on Machine Learning, 2021. Li, B., Shen, Y., Wang, Y., Zhu, W., Reed, C. J., Zhang, J., Li, D., Keutzer, K., and Zhao, H. Invariant information bottleneck for domain generalization. ar Xiv preprint ar Xiv:2106.06333, 2021. Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542 5550, 2017. Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018a. Li, H., Pan, S. J., Wang, S., and Kot, A. C. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400 5409, 2018b. Li, H., Pan, S. J., Wang, S., and Kot, A. C. Domain generalization with adversarial feature learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2018c. Li, H., Wang, Y., Wan, R., Wang, S., Li, T.-Q., and Kot, A. Domain generalization for medical imaging classification with linear-dependency regularization. In Advances in Neural Information Processing Systems, volume 33, pp. 3118 3129, 2020. Li, Y., Gong, M., Tian, X., Liu, T., and Tao, D. Domain generalization via conditional invariant representations. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018d. Li, Y., Yang, Y., Zhou, W., and Hospedales, T. Featurecritic networks for heterogeneous domain generalization. In International Conference on Machine Learning, pp. 3915 3924. PMLR, 2019. Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018. Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems, 32, 2019. Mahajan, D., Tople, S., and Sharma, A. Domain generalization using causal matching. In International Conference on Machine Learning, pp. 7313 7324. PMLR, 2021. Masegosa, A. Learning under model misspecification: Applications to variational and ensemble methods. In Advances in Neural Information Processing Systems, volume 33, pp. 5479 5491, 2020. Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A. S., Bethge, M., and Brendel, W. Benchmarking robustness in object detection: Autonomous driving when winter is coming. ar Xiv preprint ar Xiv:1907.07484, 2019. Muandet, K., Balduzzi, D., and Sch olkopf, B. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10 18. PMLR, 2013. Nam, H. and Kim, H.-E. Batch-instance normalization for adaptively style-invariant neural networks. In Advances in Neural Information Processing Systems, 2018. Nam, H., Lee, H., Park, J., Yoon, W., and Yoo, D. Reducing domain gap by reducing style bias. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. Ortega, L. A., Caba nas, R., and Masegosa, A. R. Diversity and generalization in neural network ensembles. ar Xiv preprint ar Xiv:2110.13786, 2021. DNA: Domain Generalization with Diversified Neural Averaging Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1406 1415, 2019. Piratla, V., Netrapalli, P., and Sarawagi, S. Efficient domain generalization via common-specific low-rank decomposition. In International Conference on Machine Learning, pp. 7728 7738. PMLR, 2020. Polyanskiy, Y. and Wu, Y. Lecture notes on information theory. Lecture Notes for ECE563 (UIUC) and, 6(20122016):7, 2014. Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp. 5389 5400. PMLR, 2019. Royden, H. L. and Fitzpatrick, P. Real analysis, volume 32. Macmillan New York, 1988. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations, 2020. Seo, S., Suh, Y., Kim, D., Kim, G., Han, J., and Han, B. Learning to optimize domain specific normalization for domain generalization. In European Conference on Computer Vision, pp. 68 83. Springer, 2020. Sinha, A., Namkoong, H., and Duchi, J. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014. Su, J., Vargas, D. V., and Sakurai, K. One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 23(5):828 841, 2019. Sun, B. and Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In Proceedings of the European Conference on Computer Vision, 2016. Thomas, X., Mahajan, D., Pentland, A., and Dubey, A. Adaptive methods for aggregated domain generalization. ar Xiv preprint ar Xiv:2112.04766, 2021. Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1 5. IEEE, 2015. Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. ar Xiv preprint physics/0004057, 2000. Vapnik, V. N. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5), 1999. Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017. Volpi, R., Namkoong, H., Sener, O., Duchi, J. C., Murino, V., and Savarese, S. Generalizing to unseen domains via adversarial data augmentation. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31, 2018. Wang, D., Shelhamer, E., Liu, S., Olshausen, B., and Darrell, T. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations, 2020a. Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., and Bailey, J. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 322 330, 2019. Wang, Y., Li, H., and Kot, A. C. Heterogeneous domain generalization via domain mixup. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3622 3626, 2020b. Xiao, Z., Shen, J., Zhen, X., Shao, L., and Snoek, C. G. A bit more bayesian: Domain-invariant learning with uncertainty. In International Conference on Machine Learning, 2021. Yi, M., Hou, L., Sun, J., Shang, L., Jiang, X., Liu, Q., and Ma, Z. Improved ood generalization via adversarial training and pretraing. In International Conference on Machine Learning, 2021. Zhang, G., Zhao, H., Yu, Y., and Poupart, P. Quantifying and improving transferability in domain generalization. In Advances in Neural Information Processing Systems, 2021a. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. Zhang, H., Zhang, Y.-F., Liu, W., Weller, A., Sch olkopf, B., and Xing, E. P. Towards principled disentanglement for domain generalization. ar Xiv preprint ar Xiv:2111.13839, 2021b. DNA: Domain Generalization with Diversified Neural Averaging Zhang, J., Qi, L., Shi, Y., and Gao, Y. More is better: A novel multi-view framework for domain generalization. ar Xiv preprint ar Xiv:2112.12329, 2021c. Zhang, M., Marklund, H., Dhawan, N., Gupta, A., Levine, S., and Finn, C. Adaptive risk minimization: A metalearning approach for tackling group distribution shift. ar Xiv, 2020. Zhao, H., Des Combes, R. T., Zhang, K., and Gordon, G. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, 2019. Zhou, K., Yang, Y., Qiao, Y., and Xiang, T. Domain adaptive ensemble learning. IEEE Transactions on Image Processing, 30:8008 8018, 2021. Zhu, W., Cui, P., Wang, Z., and Hua, G. Multimedia big data computing. IEEE multimedia, 22(3):96 c3, 2015. DNA: Domain Generalization with Diversified Neural Averaging A. Proofs for Section 3 Definition A.1 (pruned Jensen-Shannon divergence). Consider a subset of M+(X Y) induced by a dominating measure λ, M λ + = {µ M+(X Y) : µ λ}, such that each measure in M λ + is absolutely continuous (w.r.t.) the the dominating measure λ. For measures Υ, P M λ + and P Υ, let p = d P dλ and υ = dΥ dλ denote the Radon Nikodym derivatives accordingly. The pruned Jensen-Shannon (PJS) divergence DP JS(P Υ) : M λ + M λ + R from a measure Υ to a measure P is defined by an integral over the support set AP of measure P, that is DP JS(P Υ) = Z AP p log 2p p + υ + υ log 2υ p + υ dλ. (1) Theorem A.2 (properties of PJS divergence). Consider measures Q, P, Υ M λ +. Suppose that Q Υ, P Υ and AP = AQ, then the PJS divergence satisfies (a) DP JS(P Q) = 2DJS(P Q), where DJS is the vanilla JS divergence. (b) DP JS(P Υ) 0. (c) p DP JS(Q Υ) p DP JS(P Υ) + p Proof. Let p = d P dλ , q = d Q dλ and υ = dΥ dλ denote the Radon Nikodym derivatives accordingly. (a) The proof is straightforward by recalling the definition of the original Jensen-Shannon divergence (Endres & Schindelin, 2003). Taking A = AP = AQ, DJS(P Q) = Z 1 2p log 2p p + q + 1 2q log 2q p + q dλ AP p log 2p p + q + q log 2q p + q dλ = 1 2DP JS(P Q). (2) (b) We show equivalently that DP JS(P Υ) 0, DP JS(P Υ) = Z AP p log 2p p + υ + υ log 2υ p + υ dλ dλ log 2 d P dλ log 2 dΥ AP d P log 2d P d P + dΥ + dΥ log 2dΥ d P + dΥ AP d P log d P + dΥ 2d P + dΥ log d P + dΥ AP d P(d P + dΥ 2d P 1) + dΥ(d P + dΥ AP (d P + dΥ 2 d P) + (d P + dΥ the inequality (i) uses the fact that log x x 1 for x > 0. (c) The triangle inequality simply relies on the lemma given in (Endres & Schindelin, 2003), Lemma A.3 (Lemma 2 in Endres & Schindelin (2003)). Let p, q, υ R+ and let L(p, q) p log 2p p+q + q log 2q p+q. Then L(p, υ) + p L(p, q) (4) We skip the proof for this lemma, please refer to the Lemma 2 in Endres & Schindelin (2003) for the detailed proof. DNA: Domain Generalization with Diversified Neural Averaging By definition of DP JS(P Υ) and take A = AP = AQ, there is DP JS(Q Υ) = A q log 2q q + υ + υ log 2υ p + υ dλ L(q, υ))2dλ L(p, υ) + p L(p, q))2dλ L(p, υ))2dλ + L(p, q))2dλ DP JS(P Υ) + p DP JS(P Q) = p DP JS(P Υ) + p where the inequality (i) invokes the Lem. A.3 and the inequality (ii) invokes the Minkowski s inequality (Royden & Fitzpatrick (1988),Sect.7.2). Definition A.4 (pruned Jensen-Shannon loss). Let A be the support of the underlying (empirical) distribution where the realization (x, y) is drawn from. The pruned Jensen-Shannon (PJS) loss ℓP JS : X I H R+ is ℓP JS(x, y, h) = (log 2 h + 1 + h log 2h h + 1)1{(x,y) A}, (6) where h [0, 1] is the vector component of h that corresponds to the correct class, and 1{ } is the indicator function. Proposition A.5 (connecting risk and PJS divergence). For a classifier h and a domain P M+(X I). Let ΥP be the induced measure. Assuming p P(y|x) {0, 1}, denoting υP(x, y) = p(x)h(y|x) as the induced density of ΥP, then RℓP JS P (h) = E (x,y) P[ℓP JS(x, y, h)] = DP JS(P ΥP) (7) Proof. For ease of notations, we omit the superscript referring P of p P(x) and ΥP, by using p(x) and Υ instead. By definition of DP JS(P Υ) and the facts: (a) p(y |x) = 1 for the correct class y and p(y |x) = 0 for the wrong class y (Uniqueness: one-hot labels),(b)1 = PC j=1 p(yj|x, y). We have (dy referring to the counting measure) DP JS(P Υ) = Z AP p(x, y) log 2p(x, y) p(x, y) + υ(x, y) + υ(x, y) 2υ(x, y) p(x, y) + υ(x, y)dxdy AP p(x)p(y|x) log 2p(x, y) p(x, y) + υ(x, y) + p(x)h(y|x) log 2υ(x, y) p(x, y) + υ(x, y)dxdy AP p(x)p(y|x) log 2p(x, y) p(x, y) + υ(x, y) + h(y|x) p(y|x) log 2υ(x, y) p(x, y) + υ(x, y) {x:p(x)>0} p(x) X j=1 p(yj|x)1{(x,y) AP}[ 2p(x)p(yj|x) p(x)p(yj|x) + p(x)hj + hj p(yj|x) log 2p(x)hj p(x)p(yj|x) + p(x)hj ]dx invoking(a) = Z {x:p(x)>0} p(x) X y I 1{(x,y) AP} log 2p(x) p(x) + p(x)h + h log 2p(x)h p(x) + p(x)h {x:p(x)>0} p(x) X y I 1{(x,y) AP} log 2 1 + h + h log 2h invoking(b) = Z {x:p(x)>0} p(x) X y I 1{(x,y) AP} j=1 p(yj|x) log 2 1 + h + h log 2h dx = RℓP JS P (h). DNA: Domain Generalization with Diversified Neural Averaging Theorem A.6 (bounds of the target risk). Let Q and P be the (fixed) target domain and the source domain in M+(X I), with p Q(x) and p P(x) being their density functions over the input space X, respectively. Suppose the supports of Q and P are identical, i.e., AP = AQ, with p P(y|x), p Q(y|x) {0, 1}. Let H be a hypothesis space, such that for any h H, the density p P(x)h induced measure ΥP dominates9 P, and ΥQ dominates Q, i.e., P ΥP and Q ΥQ. Then for any h H, RℓP JS Q (h) q RℓP JS P (h) + 2 p 2DJS(P Q). (9) q RℓP JS Q (h) q RℓP JS P (h) 2 p 2DJS(P Q). (10) Proof. We use the following lemma to prove the first inquality. Lemma A.7. Given P, Q M+(X I), suppose that ΥQ is a measure induced by a classifier h and the marginal density function p Q(x) of Q, i.e., the density function υQ of ΥQ is υQ = p Q(x)h. Similarly, let ΥP be a measure induced by h and the marginal density function p P (x) of P, with υP = p P (x)h. Then if p P(y|x), p Q(y|x) {0, 1}, P Q, P ΥP and Q ΥQ , λ, then the following holds, DP JS(ΥP ΥQ) DP JS(P Q) (11) proof of lemma: by definition of the PJS divergence, we have DP JS(ΥP ΥQ) = Z AP υP log 2υP υP + υQ + υQ log 2υQ AP p P (x)h log 2p P (x)h p P (x)h + p Q(x)h + p Q(x)h log 2p Q(x)h p P (x)h + p Q(x)hdλ {x:p P (x)>0} p P (x) X y I 1{(x,y) AP } j=1 h(yj|x) log 2p P (x)hj p P (x)hj + p Q(x)hj y I 1{(x,y) AP } j=1 h(yj|x) log 2p Q(x)hj p P (x)hj + p Q(x)hj dx {x:p P (x)>0} p P (x) X y I 1{(x,y) AP } log 2p P (x) p P (x) + p Q(x) j=1 h(yj|x) y I 1{(x,y) AP } log 2p Q(x) p P (x) + p Q(x) j=1 h(yj|x)dx {x:p P (x)>0} p P (x) X y I 1{(x,y) AP } log 2p P (x) p P (x) + p Q(x) + p Q(x) X y I 1{(x,y) AP } log 2p Q(x) p P (x) + p Q(x)dx {x:p P (x)>0} p P (x) X y I 1{(x,y) AP }p P (y |x) log 2p P (x)p P (y |x) p P (x)p P (y |x) + p Q(x)p Q(y |x) y I 1{(x,y) AP }p Q(y |x) log 2p Q(x)p Q(y |x) p P (x)p P (y |x) + p Q(x)p Q(y |x)dx {x:p P (x)>0} y I 1{(x,y) AP }p P (x, y) log 2p P (x, y) p P (x, y) + p Q(x, y) + X y I 1{(x,y) AP }p Q(x, y) log 2p Q(x, y) p P (x, y) + p Q(x, y)dx = DP JS(P Q), (12) where equality (i) invokes the facts that p P (y |x) = 1, p P (y |x) = 1, p Q(y |x) = 1 and P Q, since the labels are one-hot vectors and the probability is 1 for being in the correct class. Q.E.D. of the lemma A.7. 9This is not a strong requirement as long as the component of the correct class in vector h is non-zero. DNA: Domain Generalization with Diversified Neural Averaging We then apply the triangle inequality in Thm. A.2(c) to the measures P,Q,ΥP and ΥQ, leading to q DP JS(Q ΥQ) q DP JS(Q ΥP) + q DP JS(ΥQ ΥP) DP JS(P ΥP) + p DP JS(P Q) + q DP JS(ΥQ ΥP) (ii) q DP JS(P ΥP) + 2 p 2DJS(P Q), (13) where Lem. A.7 and Thm. A.2(a) are applied to the inequality (ii). Invoking Prop. A.5, there are ( RℓP JS P (h) = DP JS(P ΥP), (14) RℓP JS Q (h) = DP JS(Q ΥQ), (15) thereof combined with Eq. (13) would lead to the desired inequality q RℓP JS Q (h) q RℓP JS P (h) + 2 p 2DJS(P Q). (16) Similarly, by switching argument Q and P, we have q DP JS(P ΥP) q DP JS(P ΥQ) + q DP JS(ΥP ΥQ) DP JS(Q ΥQ) + p DP JS(P Q) + q DP JS(ΥP ΥQ) q DP JS(Q ΥQ) + 2 p 2DJS(P Q), (17) Therefore invoking Prop. A.5 again would result in the second inequality q RℓP JS P (h) q RℓP JS Q (h) + 2 p 2DJS(P Q) q RℓP JS Q (h) q RℓP JS P (h) 2 p 2DJS(P Q). (18) Definition A.8 (ρ-ensemble). Suppose that ρ is a measure on a hypothesis space H. The ρ-ensemble ρ is the ρ-weighted averaged classifier, ρ = Eh ρ(h). (19) Theorem A.9 (inequalities related to the ρ-ensemble). Let H be a hypothesis space as stated in the Thm. A.6. For any ρ M+(H), any measure P, take DP = EP[V arρ( ℓP JS)], q RℓP JS P (ρ) q Eρ[RℓP JS P (h)] DP Eρ[ q RℓP JS P (h)]. (20) Moreover, DP > 0 is a necessary condition for the second inequality becoming strict. If ℓP JS(x1,y,h) ℓP JS(x2,y,h) is varying on (a non-zero measured subset of) the set {(x, y, h) : V arρ( ℓP JS) > 0} for x1 = x2 , then DP > 0 is a sufficient condition for the second inequality being strict. Proof. The inequalities are relying on the convexity induced by the design of the PJS loss. We prove the inequalities based on the following lemma. Lemma A.10. The function g(t) is convex on t (0, 1) log 2 t + 1 + t log 2t t + 1. (21) proof of lemma: we check the sign of the second derivative of g(t). dt = log 2t t+1 2g(t) . (22) DNA: Domain Generalization with Diversified Neural Averaging dt2 = 2g2(t) t(t + 1) log2( 2t 4t(t + 1)g3(t) = 2 log 2 t+1 + 2t log 2t t+1 t(t + 1) log2( 2t 4t(t + 1)g3(t) . (23) It suffices to check the sign of numerator of Eq. (23) since the denominator 4t(1 + t)g3(t) is positive on the interval t (0, 1). Let G(t) = 2 log 2 t+1 + 2t log 2t t+1 t(t + 1) log2( 2t dt = (2t + 1) log2( 2t t + 1) < 0, for t (0, 1). (24) Thence G(t) is decreasing on t (0, 1) and we have G(t) > G(1) = 2 log 2 1 + 1 + 2 log 2 1 + 1 2 log2( 2 1 + 1) = 0. (25) Therefore d2g(t) dt2 > 0 on the interval t (0, 1), which implies g(t) is convex on t (0, 1). Q.E.D. of the lemma A.10. Remark A.11. The inequalities in the Thm. 3.7, Cor. 3.8, Cor. 3.9 and Thm. 3.10 rely on the design of the PJS divergence. Specifically, the key is the convexity of the function g(t) = q log 2 t+1 + t log 2t t+1 on t (0, 1). Whereas the function log 2 t+1 + t log 2t t+1 + (1 t)log2 is not convex nor concave on t (0, 1). Thence within the framework of the original JS divergence, the corresponding inequalities do not necessarily hold. Next we prove the first inequality q RℓP JS P (ρ) q Eρ[RℓP JS P (h)] DP with DP = EP[V arρ( ℓP JS)]. By Lem. A.10 and Jensen s inequality (Royden & Fitzpatrick (1988),Sect.6.6), g(Eρ(h )) Eρ[g(h )]. (26) Since g(t) is non-negative, then g2(Eρ(h )) (Eρ[g(h )])2. (27) Taking expectation over P on both side of Eq. (27) leads to EP[g2(Eρ(h ))] EP[Eρg(h )]2. (28) It suffices to show the left hand side (L.H.S.) of Eq. (28) is RℓP JS P (ρ) and the right hand side (R.H.S.) of Eq. (28) is Eρ[RℓP JS P (h)] DP with DP = EP[V arρ( ℓP JS)], which are straightforward by the definitions, L.H.S : g2(Eρ(h )) = log 2 Eρ(h ) + Eρ(h ) log Eρ(h ) Eρ(h ) + 1 = EP[g2(Eρ(h ))] = EP[ℓP JS(x, y, ρ)] = RℓP JS P (ρ). (29) R.H.S : EP[Eρg(h )]2 = EP[Eρg2(h ) V arρ(g(h ))] = EP[Eρg2(h )] EP[V arρ(g(h ))] = EP[EρℓP JS(x, y, h)] EP[V arρ( p ℓP JS)] (i) = Eρ[EPℓP JS(x, y, h)] DP = Eρ[RℓP JS P (h)] DP, where the equality (i) invokes the Fubini s theorem (Royden & Fitzpatrick (1988),Sect.20.1). Thus we have proved the first inequality RℓP JS P (ρ) = Eρ[RℓP JS P (h)] DP = q RℓP JS P (ρ) q Eρ[RℓP JS P (h)] DP. (31) Next we prove the second inequality q Eρ[RℓP JS P (h)] DP Eρ[ q RℓP JS P (h)]. By Eq. (30), it suffices to show p EP[Eρg(h )]2 Eρ[ q RℓP JS P (h)]. This is equivalent to show EP[Eρg(h )]2 Eρ[ p Aρ g(h )dρ]2d P AP g2(h )d P] 1 2 dρ, (32) DNA: Domain Generalization with Diversified Neural Averaging which is guaranteed to be true by the Minkowski s integral inequality (Hardy et al. (1952),Thm.202). The inequality in Eq. (32) becomes equality if and only if ℓP JS(x, y, h) = g2(h )1{x,y AP} = ψ(x, y)ϕ(h) for some fixed measurable functions ψ( ) and ϕ( ) almost everywhere. When there is a non-zero measured set B {AP Aρ} such that ℓP JS(x1,y,h) ℓP JS(x2,y,h) is varying, the existence of the non-zero measured set B would result in a contradiction of the existence of function ψ( ) and ϕ( ). Assuming such set B, functions ψ( ) and ϕ( ) exist at the same time, then on B, when h1 = h2, x1 = x2, ℓP JS(x1, y, h1) ℓP JS(x2, y, h1) = ℓP JS(x1, y, h2) ℓP JS(x2, y, h2) ψ(x1, y)ϕ(h1) ψ(x2, y)ϕ(h1) = ψ(x1, y)ϕ(h2) ψ(x2, y)ϕ(h2) = ψ(x1, y) ψ(x2, y) = ψ(x1, y) ψ(x2, y), (33) which reduces to absurdity. Therefore when such a set B exists, the second inequaility of the theorem is strict. The necessity condition of DP > 0: If DP = 0, then ℓP JS(x, y, h) is constant on AP Aρ, then there are infinitely many solutions for funcitons ψ( ) and ϕ( ) being exist, thus the second inequality of the theorem would be equality. Therefore if the inequality is strict, DP can not be 0, which proves the necessity condition. Corollary A.12 (target risk upper bound of the ensembles). Given a fixed source domain P and a target domain Q, for any measure ρ M+(H) on hypothesis space H, q RℓP JS Q (ρ) q Eρ[RℓP JS P (h)] DP + 2 p 2DJS(P Q). (34) Proof. The proof is straightforward by combining Thm. A.6 and Thm. A.9. q RℓP JS Q (ρ) q RℓP JS P (h) + 2 p 2DJS(P Q) q Eρ[RℓP JS P (h)] DP + 2 p 2DJS(P Q) (35) Corollary A.13 (joint risk upper bound of the ensembles). Given a fixed source domain P and a target domain Q, for any measure ρ M+(H) on hypothesis space H, q RℓP JS P (ρ) + q RℓP JS Q (ρ) q Eρ[RℓP JS P (h)] DP + q Eρ[RℓP JS Q (h)] DQ Eρ[ q RℓP JS P (h)] + Eρ[ q RℓP JS Q (h)]. (36) Proof. It s easy to check the requirement of applying Thm. A.9 is satisfied by both source domain P and the target domain Q. The proof is straightforward by invoking Thm. A.9 twice w.r.t. P and Q. RℓP JS P (ρ) q Eρ[RℓP JS P (h)] DP Eρ[ q RℓP JS P (h)]. (37) q RℓP JS Q (ρ) q Eρ[RℓP JS Q (h)] DQ Eρ[ q RℓP JS Q (h)]. (38) Therefore we have q RℓP JS P (ρ) + q RℓP JS Q (ρ) q Eρ[RℓP JS P (h)] DP + q Eρ[RℓP JS Q (h)] DQ Eρ[ q RℓP JS P (h)] + Eρ[ q RℓP JS Q (h)]. (39) Theorem A.14 (PAC-Bayesian generalization upper bound). For a fixed source domain P and a fixed target domain Q, let H be a hypothesis space as stated in the Thm. A.6. Suppose that π is a prior over H, which is independent of draws of source realizations Dn = {(xi, yi)}n i=1 i.i.d. Pn. Then for any c > 0, ρ M+(H), and any δ (0, 1), with probability over 1 δ RℓP JS Q (ρ) 2 p 2DJS(P Q) + Eρ[ ˆRℓP JS P (h)] ˆDP + 2DKL(ρ π) + log 1 δ + ΨℓP JS P,π (c, n) where Ψℓ P,π(c, n) = log Eπ2EPn[ecn EP ℓ(h)] is constant w.r.t. ρ for fixed c, n, π, ℓ, and δ. DNA: Domain Generalization with Diversified Neural Averaging Proof. The main part of the proof of this theorem employs the same technique as in the proof of Thm.3 in Masegosa (2020), which relies on a lemma (Thm.3 in Germain et al. (2016a)) derived from the acknowledged Donsker and Varadhan s variational formula (Donsker & Varadhan, 1983). Lemma A.15 (Alquier et al. (2016),Germain et al. (2016a)). Given a distribution P M+(X Y), a hypothesis space H, a loss function ℓ: X Y H R, a prior distribution π M+(H). Given a δ (0, 1) and a real number ζ > 0, with probability at least 1 δ over the choice Pn, for any ρ M+(H), Eh ρ[Lℓ P(h)] Eh ρ[ˆLℓ P(h)] + 1 ζ [DKL(ρ π) + log 1 δ + ΨP,π(ζ, n)], (41) where ΨP,π(ζ, n) = log Eh πEPn[eζ(Lℓ P(h) ˆLℓ P(h))]. We skip the proof for this lemma, please refer to the Thm. 3 in Germain et al. (2016a) for the detailed proof. Recalling the Eq. 30 in the proof of Thm. A.9, where we got Eρ[RℓP JS P (h)] DP = EP[Eρg(h )]2, then Eρ[RℓP JS P (h)] DP = E(x,y) P[Eh ρ p ℓP JS(x, y, h)]2 = E(x,y) P[Eh ρ p ℓP JS(x, y, h)Eh ρ p ℓP JS(x, y, h)] = E(x,y) P[Eh ρ q ℓP JS(x, y, h )Eh ρ p ℓP JS(x, y, h)] = E(h ,h) ρ2[E(x,y) P q ℓP JS(x, y, h )E(x,y) P p ℓP JS(x, y, h)] Let LℓP JS P (h , h) = E(x,y) P p ℓP JS(x, y, h )E(x,y) P p ℓP JS(x, y, h). Regarding Lℓ P( ) in Lem. A.15 as LℓP JS P (h , h), and invoking the fact DKL(ρ2 π2) = 2DKL(ρ π), from Lem. A.15, there is Eρ[RℓP JS P (h)] DP = E(h ,h) ρ2[LℓP JS P (h , h)] E(h ,h) ρ2[LℓP JS P (h , h)] + 1 cn[2DKL(ρ π) + log 1 δ + ΨP,π(c, n)] = Eρ[ ˆRℓP JS P (h)] ˆDP + 1 cn[2DKL(ρ π) + log 1 δ + ΨP,π(c, n)], where we take ζ = cn. Note that Eρ[RℓP JS P (h)] DP = E(x,y) P[Eh ρ p ℓP JS(x, y, h)]2 > 0, thus we have Eρ[RℓP JS P (h)] DP Eρ[ ˆRℓP JS P (h)] ˆDP + 1 cn[2DKL(ρ π) + log 1 δ + ΨP,π(c, n)]. (44) Combining Cor. A.12 and Eq. (44) completes the proof, q RℓP JS Q (ρ) q RℓP JS P (h) + 2 p 2DJS(P Q) q Eρ[RℓP JS P (h)] DP + 2 p Eρ[ ˆRℓP JS P (h)] ˆDP + 1 cn[2DKL(ρ π) + log 1 δ + ΨP,π(c, n)] + 2 p 2DJS(P Q). (45) DNA: Domain Generalization with Diversified Neural Averaging B. Experimental details. B.1. Baseline details This appendix provides a detailed description of the 14 baseline methods used for benchmark comparisons. Empirical Risk Minimization (ERM, (Vapnik, 1999)) aggregates the data from all source domains together, and minimizes the cross entropy loss for classification. Invariant Risk Minimization (IRM, (Arjovsky et al., 2019)) learns a feature mapping such that the optimal linear classifier on top of that representation matches across source domains. Group Distributionally Robust Optimization (DRO, (Sagawa et al., 2020)) performs ERM while increasing the importance of domains with larger error by re-weighting minibatches. Inter-domain Mixup (Mixup, (Wang et al., 2020b)) employs mixup (Zhang et al., 2018) technique across multiple domains and performs ERM on the augmented heterogeneous mixup distribution. Meta-Learning for Domain Generalization (MLDG, (Li et al., 2018a)) divides the source domains into meta-traindomains and meta-test-domain to simulate domain shift, and regulate the model trained on meta-train-domains to perform well on meta-test-domain. Deep CORrelation ALignment (CORAL, (Sun & Saenko, 2016)) matches the first-order (mean) and the second-order (covariance) statistics of feature distributions across source domains. Maximum Mean Discrepancy (MMD, (Li et al., 2018c)) achieves distribution alignment in the latent space of an autoencoder by using adversarial learning and the maximum mean discrepancy criteria. Domain Adversarial Neural Network (DANN, (Ganin et al., 2016)) employs a domain discriminator to align feature distributions across domains using adversarial learning. Class-conditional Domain Adversarial Neural Network (CDANN, (Li et al., 2018d)) matches conditional feature distributions across domains, enabling alignment of multimodal distributions for all class labels. Marginal Transfer Learning (MTL, (Blanchard et al., 2021)) estimates a kernel mean embedding per domain, passed as a second argument to the classifier. Then, these embeddings are estimated using single test examples at test time. Style Agnostic Networks (Sag Net, (Nam et al., 2021)) disentangle style encodings from class categories to prevent style biased predictions and focus more on the contents. Adaptive Risk Minimization (ARM, (Zhang et al., 2020) is an extension of MLDG and introduces an additional module to compute domain embeddings, which are used by the prediction module to infer information about the input distribution. Variance Risk Extrapolation (VREx, (Krueger et al., 2021)) is a form of robust optimization over a perturbation set of extrapolated domains and minimizes the variance of training risks across domains. Representation Self-Challenging (RSC, (Huang et al., 2020)) iteratively discards the dominant features activated on the training data, and forces the CNN to activate remaining features that correlates with labels. Meta-Domain Specific-Domain Invariant (m DSDI, (Bui et al., 2021)) disentangles features in the latent space while jointly learning the domain-invariant and domain-specific features in a unified meta-learning framework. The domain-specific information is utilized to enhance predictions. Stochastic Weight Averaging Densely (SWAD, (Cha et al., 2021)) finds flatter minima and suffers less from overfitting by modifying the vanilla SWA with a dense and overfit-aware sampling strategy. DNA: Domain Generalization with Diversified Neural Averaging B.2. Implementation details. Training and Evaluation protocol. We follow the training and evaluation protocol described in (Gulrajani & Lopez-Paz, 2021) and our implementation is built upon the released SWAD10 library. For training, we choose a domain as the target domain and use the remaining domains as source domains. We split each source domain into 8:2 training/validation splits and integrate the validation subsets of each source domain to create an overall validation set, which is used for validation and model selection. Model Architectures. We use the Res Net-50 pre-trained on Image Net as the backbone network and all the BN layers are frozen during training. We replace the last FC layer of the Res Net-50 with a 2-layer classifier with 1024 hidden units and employ the dropout regularization on the 1024-dimensional output (except for Domain Net, 2048 hidden units are used since its label set is much larger). Hyperparameters. We use the hyperparameter search protocol in (Gulrajani & Lopez-Paz, 2021) with a slight modification following (Cha et al., 2021): we do not search independently for each test domain, use hyperparameters searched in the first random data split to the other splits and search algorithm-specific hyperparameters independently from the universal ones. We use a smaller search space (shown in Table 1) for computational efficiency. Batch size and Res Net dropout rate are fixed as 32 and 0. All the SWAD-specific hyperparameters are not searched and the default values are used. For DNA hyperparameters, we search the FC dropout rate and η. Here, the FC dropout rate is searched in [0.1, 0.3, 0.5] depending on datasets, hence 0.1 is used in Domain Net and 0.5 is used in the other datasets. η is searched in [0.01, 0.1, 1.0] on PACS and the searched value 0.1 is used for all experiments. The total number of training iterations is 20000 for Domain Net and 5000 for the other datasets, which are enough numbers for our method to be converged. The evaluation frequency is: 500 for Domain Net, 50 for VLCS and 100 for the others. Table 1. Hyperparameter search space comparisons. Uni and list denote uniform distribution and random choice, respectively. Hyper-Parameter Default Value Domain Bed Ours Batch size 32 2Uni(3,5.5) 32 Learning rate 5e-5 10Uni( 5, 3.5) [3e-5, 4e-5, 5e-5] Res Net dropout 0 [0, 0.1, 0.5] 0 Weight decay 0 10Uni( 6, 2) [1e-6,1e-4] B.3. Experimental environments. Hardware environments. We perform our experiments on three machines: two with 8 Nvidia RTX3090s and Xeon E5-2680, and one with 4 Nvidia V100 and Xeon Platinum 8163. Software environments. Our experiments are conducted with Python 3.7.9, and the following packages are used: Py Torch 1.7.1, torchvision 0.8.2 and Num Py 1.19.4. 10https://github.com/khanrc/swad DNA: Domain Generalization with Diversified Neural Averaging C. Full results. This section provides full results of Table 1 for each domain of each dataset. Table 2. Out-of-domain accuracies(%) on PACS. Method A C P S Avg ERM (Vapnik, 1999) 84.7 0.4 80.8 0.6 97.2 0.3 79.3 1.0 85.5 IRM (Arjovsky et al., 2019) 84.8 1.3 76.4 1.1 96.7 0.6 76.1 1.0 83.5 DRO (Sagawa et al., 2020) 83.5 0.9 79.1 0.6 96.7 0.3 78.3 2.0 84.4 Mixup (Wang et al., 2020b) 86.1 0.5 78.9 0.8 97.6 0.1 75.8 1.8 84.6 MLDG (Li et al., 2018a) 85.5 1.4 80.1 1.7 97.4 0.3 76.6 1.1 84.9 CORAL (Sun & Saenko, 2016) 88.3 0.2 80.0 0.5 97.5 0.3 78.8 1.3 86.2 MMD (Li et al., 2018c) 86.1 1.4 79.4 0.9 96.6 0.2 76.5 0.5 84.6 DANN (Ganin et al., 2016) 86.4 0.8 77.4 0.8 97.3 0.4 73.5 2.3 83.6 CDANN (Li et al., 2018d) 84.6 1.8 75.5 0.9 96.8 0.3 73.5 0.6 82.6 MTL (Blanchard et al., 2021) 87.5 0.8 77.1 0.5 96.4 0.8 77.3 1.8 84.6 Sag Net (Nam et al., 2021) 87.4 1.0 80.7 0.6 97.1 0.1 80.0 0.4 86.3 ARM (Zhang et al., 2020) 86.8 0.6 76.8 0.5 97.4 0.3 79.3 1.2 85.1 VREx (Krueger et al., 2021) 86.0 1.6 79.1 0.6 96.9 0.5 77.7 1.7 84.9 RSC (Huang et al., 2020) 85.4 0.8 79.7 1.8 97.6 0.3 78.2 1.2 85.2 m DSDI (Bui et al., 2021) 87.7 0.4 80.4 0.7 98.1 0.3 78.4 1.2 86.2 SWAD (Cha et al., 2021) 89.3 0.2 83.4 0.6 97.3 0.3 82.5 0.5 88.1 DNA (ours) 89.8 0.2 83.4 0.4 97.7 0.1 82.6 0.2 88.4 Table 3. Out-of-domain accuracies(%) on VLCS. Method C L S V Avg ERM (Vapnik, 1999) 97.7 0.4 64.3 0.9 73.4 0.5 74.6 1.3 77.5 IRM (Arjovsky et al., 2019) 98.6 0.1 64.9 0.9 73.4 0.6 77.3 0.9 78.5 DRO (Sagawa et al., 2020) 97.3 0.3 63.4 0.9 69.5 0.8 76.7 0.7 76.7 Mixup (Wang et al., 2020b) 98.3 0.6 64.8 1.0 72.1 0.5 74.3 0.8 77.4 MLDG (Li et al., 2018a) 97.4 0.2 65.2 0.7 71.0 1.4 75.3 1.0 77.2 CORAL (Sun & Saenko, 2016) 98.3 0.1 66.1 1.2 73.4 0.3 77.5 1.2 78.8 MMD (Li et al., 2018c) 97.7 0.1 64.0 1.1 72.8 0.2 75.3 3.3 77.5 DANN (Ganin et al., 2016) 99.0 0.3 65.1 1.4 73.1 0.3 77.2 0.6 78.6 CDANN (Li et al., 2018d) 97.1 0.3 65.1 1.2 70.7 0.8 77.1 1.5 77.5 MTL (Blanchard et al., 2021) 97.8 0.4 64.3 0.3 71.5 0.7 75.3 1.7 77.2 Sag Net (Nam et al., 2021) 97.9 0.4 64.5 0.5 71.4 1.3 77.5 0.5 77.8 ARM (Zhang et al., 2020) 98.7 0.2 63.6 0.7 71.3 1.2 76.7 0.6 77.6 VREx (Krueger et al., 2021) 98.4 0.3 64.4 1.4 74.1 0.4 76.2 1.3 78.3 RSC (Huang et al., 2020) 97.9 0.1 62.5 0.7 72.3 1.2 75.6 0.8 77.1 m DSDI (Bui et al., 2021) 97.6 0.1 66.4 0.4 74.0 0.6 77.8 0.7 79.0 SWAD (Cha et al., 2021) 98.8 0.1 63.3 0.3 75.3 0.5 79.2 0.6 79.1 DNA (ours) 98.8 0.1 63.6 0.2 74.1 0.1 79.5 0.4 79.0 DNA: Domain Generalization with Diversified Neural Averaging C.3. Office Home Table 4. Out-of-domain accuracies(%) on Office Home. Method A C P R Avg ERM (Vapnik, 1999) 61.3 0.7 52.4 0.3 75.8 0.1 76.6 0.3 66.5 IRM (Arjovsky et al., 2019) 58.9 2.3 52.2 1.6 72.1 2.9 74.0 2.5 64.3 DRO (Sagawa et al., 2020) 60.4 0.7 52.7 1.0 75.0 0.7 76.0 0.7 66.0 Mixup (Wang et al., 2020b) 62.4 0.8 54.8 0.6 76.9 0.3 78.3 0.2 68.1 MLDG (Li et al., 2018a) 61.5 0.9 53.2 0.6 75.0 1.2 77.5 0.4 66.8 CORAL (Sun & Saenko, 2016) 65.3 0.4 54.4 0.5 76.5 0.1 78.4 0.5 68.7 MMD (Li et al., 2018c) 60.4 0.2 53.3 0.3 74.3 0.1 77.4 0.6 66.3 DANN (Ganin et al., 2016) 59.9 1.3 53.0 0.3 73.6 0.7 76.9 0.5 65.9 CDANN (Li et al., 2018d) 61.5 1.4 50.4 2.4 74.4 0.9 76.6 0.8 65.8 MTL (Blanchard et al., 2021) 61.5 0.7 52.4 0.6 74.9 0.4 76.8 0.4 66.4 Sag Net (Nam et al., 2021) 63.4 0.2 54.8 0.4 75.8 0.4 78.3 0.3 68.1 ARM (Zhang et al., 2020) 58.9 0.8 51.0 0.5 74.1 0.1 75.2 0.3 64.8 VREx (Krueger et al., 2021) 60.7 0.9 53.0 0.9 75.3 0.1 76.6 0.5 66.4 RSC (Huang et al., 2020) 60.7 1.4 51.4 0.3 74.8 1.1 75.1 1.3 65.5 m DSDI (Bui et al., 2021) 68.1 0.3 52.1 0.4 76.0 0.2 80.4 0.2 69.2 SWAD (Cha et al., 2021) 66.1 0.4 57.7 0.4 78.4 0.1 80.2 0.2 70.6 DNA (ours) 67.7 0.2 57.7 0.3 78.9 0.2 80.5 0.2 71.2 C.4. Terra Incognita Table 5. Out-of-domain accuracies(%) on Terra Incognita. Method L100 L38 L43 L46 Avg ERM (Vapnik, 1999) 49.8 4.4 42.1 1.4 56.9 1.8 35.7 3.9 46.1 IRM (Arjovsky et al., 2019) 54.6 1.3 39.8 1.9 56.2 1.8 39.6 0.8 47.6 DRO (Sagawa et al., 2020) 41.2 0.7 38.6 2.1 56.7 0.9 36.4 2.1 43.2 Mixup (Wang et al., 2020b) 59.6 2.0 42.2 1.4 55.9 0.8 33.9 1.4 47.9 MLDG (Li et al., 2018a) 54.2 3.0 44.3 1.1 55.6 0.3 36.9 2.2 47.7 CORAL (Sun & Saenko, 2016) 51.6 2.4 42.2 1.0 57.0 1.0 39.8 2.9 47.6 MMD (Li et al., 2018c) 41.9 3.0 34.8 1.0 57.0 1.9 35.2 1.8 42.2 DANN (Ganin et al., 2016) 51.1 3.5 40.6 0.6 57.4 0.5 37.7 1.8 46.7 CDANN (Li et al., 2018d) 47.0 1.9 41.3 4.8 54.9 1.7 39.8 2.3 45.8 MTL (Blanchard et al., 2021) 49.3 1.2 39.6 6.3 55.6 1.1 37.8 0.8 45.6 Sag Net (Nam et al., 2021) 53.0 2.9 43.0 2.5 57.9 0.6 40.4 1.3 48.6 ARM (Zhang et al., 2020) 49.3 0.7 38.3 2.4 55.8 0.8 38.7 1.3 45.5 VREx (Krueger et al., 2021) 48.2 4.3 41.7 1.3 56.8 0.8 38.7 3.1 46.4 RSC (Huang et al., 2020) 50.2 2.2 39.2 1.4 56.3 1.4 40.8 0.6 46.6 m DSDI (Bui et al., 2021) 53.2 3.0 43.3 1.0 56.7 0.5 39.2 1.3 48.1 SWAD (Cha et al., 2021) 55.4 0.0 44.9 1.1 59.7 0.4 39.9 0.2 50.0 DNA (ours) 56.8 1.2 47.0 0.9 61.0 0.5 44.0 1.0 52.2 DNA: Domain Generalization with Diversified Neural Averaging C.5. Domain Net Table 6. Out-of-domain accuracies(%) on Domain Net. Method C I P Q R S Avg ERM (Vapnik, 1999) 58.1 0.3 18.8 0.3 46.7 0.3 12.2 0.4 59.6 0.1 49.8 0.4 40.9 IRM (Arjovsky et al., 2019) 48.5 2.8 15.0 1.5 38.3 4.3 10.9 0.5 48.2 5.2 42.3 3.1 33.9 DRO (Sagawa et al., 2020) 47.2 0.5 17.5 0.4 33.8 0.5 9.3 0.3 51.6 0.4 40.1 0.6 33.3 Mixup (Wang et al., 2020b) 55.7 0.3 18.5 0.5 44.3 0.5 12.5 0.4 55.8 0.3 48.2 0.5 39.2 MLDG (Li et al., 2018a) 59.1 0.2 19.1 0.3 45.8 0.7 13.4 0.3 59.6 0.2 50.2 0.4 41.2 CORAL (Sun & Saenko, 2016) 59.2 0.1 19.7 0.2 46.6 0.3 13.4 0.4 59.8 0.2 50.1 0.6 41.5 MMD (Li et al., 2018c) 32.1 13.3 11.0 4.6 26.8 11.3 8.7 2.1 32.7 13.8 28.9 11.9 23.4 DANN (Ganin et al., 2016) 53.1 0.2 18.3 0.1 44.2 0.7 11.8 0.1 55.5 0.4 46.8 0.6 38.3 CDANN (Li et al., 2018d) 54.6 0.4 17.3 0.1 43.7 0.9 12.1 0.7 56.2 0.4 45.9 0.5 38.3 MTL (Blanchard et al., 2021) 57.9 0.5 18.5 0.4 46.0 0.1 12.5 0.1 59.5 0.3 49.2 0.1 40.6 Sag Net (Nam et al., 2021) 57.7 0.3 19.0 0.2 45.3 0.3 12.7 0.5 58.1 0.5 48.8 0.2 40.3 ARM (Zhang et al., 2020) 49.7 0.3 16.3 0.5 40.9 1.1 9.4 0.1 53.4 0.4 43.5 0.4 35.5 VREx (Krueger et al., 2021) 47.3 3.5 16.0 1.5 35.8 4.6 10.9 0.3 49.6 4.9 42.0 3.0 33.6 RSC (Huang et al., 2020) 55.0 1.2 18.3 0.5 44.4 0.6 12.2 0.2 55.7 0.7 47.8 0.9 38.9 m DSDI (Bui et al., 2021) 62.1 0.3 19.1 0.4 49.4 0.4 12.8 0.7 62.9 0.3 50.4 0.4 42.8 SWAD (Cha et al., 2021) 66.0 0.1 22.4 0.3 53.5 0.1 16.1 0.2 65.8 0.4 55.5 0.3 46.5 DNA (ours) 66.1 0.2 23.0 0.1 54.6 0.1 16.7 0.1 65.8 0.2 56.8 0.1 47.2