# quantifying_and_improving_transferability_in_domain_generalization__91daee4b.pdf Quantifying and Improving Transferability in Domain Generalization Guojun Zhang School of Computer Science University of Waterloo Vector Institute guojun.zhang@uwaterloo.ca Han Zhao Department of Computer Science University of Illinois at Urbana-Champaign hanzhao@illinois.edu Yaoliang Yu School of Computer Science University of Waterloo Vector Institute yaoliang.yu@uwaterloo.ca Pascal Poupart School of Computer Science University of Waterloo Vector Institute ppoupart@uwaterloo.ca Out-of-distribution generalization is one of the key challenges when transferring a model from the lab to the real world. Existing efforts mostly focus on building invariant features among source and target domains. Based on invariant features, a high-performing classifier on source domains could hopefully behave equally well on a target domain. In other words, we hope the invariant features to be transferable. However, in practice, there are no perfectly transferable features, and some algorithms seem to learn more transferable features than others. How can we understand and quantify such transferability? In this paper, we formally define transferability that one can quantify and compute in domain generalization. We point out the difference and connection with common discrepancy measures between domains, such as total variation and Wasserstein distance. We then prove that our transferability can be estimated with enough samples and give a new upper bound for the target error based on our transferability. Empirically, we evaluate the transferability of the feature embeddings learned by existing algorithms for domain generalization. Surprisingly, we find that many algorithms are not quite learning transferable features, although few could still survive. In light of this, we propose a new algorithm for learning transferable features and test it over various benchmark datasets, including Rotated MNIST, PACS, Office-Home and WILDS-FMo W. Experimental results show that the proposed algorithm achieves consistent improvement over many state-of-the-art algorithms, corroborating our theoretical findings.1 1 Introduction One of the cornerstone assumptions underlying the recent success of deep learning models is that the test data should share the same distribution as the training data. However, faced with ubiquitous distribution shifts in various real-world applications, such assumption hardly holds in practice. For example, a self-driving recognition system trained using data collected in the daytime may continually degrade its performance during nightfall. The system may also encounter weather or traffic conditions in a new city that never appear in the training set. In light of these potentially unseen scenarios, it is 1Code available at https://github.com/Gordon-Guojun-Zhang/Transferability-Neur IPS2021. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). of paramount importance that the trained model can generalize Out-Of-Distribution (OOD): even if the target domain is not exactly the same as the source domain(s), the learned model should hopefully behave robustly under slight distribution shift. To this end, one line of works focuses on learning the so-called invariant representations [2, 16, 57, 58]. At a colloquial level, the goal here is to learn feature embeddings that lead to indistinguishable feature distributions from different domains. In practice, both the feature embeddings and the domain discriminators are often parametrized by neural networks, leading to an adversarial game between these two. Furthermore, in order to avoid degenerate solutions, the learned features are required to be informative about the output variable as well. This is enforced by placing a predictor over the features and minimize the corresponding supervised loss simultaneously [17, 32, 48, 49]. Another line of recent works aims to learn features that can induce invariant predictors, first termed as the invariant risk minimization (IRM) [3, 39] paradigm. Roughly speaking, the goal of IRM is to discover a feature embedding, upon which the optimal predictors, i.e., the Bayes predictor, are invariant across the training domains. Again, at the same time, the features should be informative about the output variable as well. However, the optimization problem of IRM is rather difficult, and several follow-up works have proposed different relaxations to the original formulation [1, 26]. Despite being extensively studied, both theoretical [41, 59] and empirical [20, 23] works have shown the insufficiency of existing algorithms for domain generalization (DG). Methods based on invariant features ignore the potential shift in the marginal label distributions across domains [59] and the methods based on invariant predictors are not robust to covariate shift [26]. Perhaps surprisingly, empirical works have shown that with proper data augmentation and careful model tuning, the very basic algorithm of empirical risk minimization (ERM) demonstrates superior performance on domain generalization over existing methods on benchmark image datasets [20, 23]. This sharp gap between theory and practice calls for a fundamental understanding of the following question: What kind of invariance should we look for, in order to ensure that a good model on source domains also achieves decent accuracy on a related target domain? In this work we attempt to answer the above question by proposing a criterion for models to look at, dubbed as transferability, which asks for an invariance of the excess risks of a predictor across domains. Different from existing proposals of invariant features and invariant predictors, which seek to find feature embeddings that respectively induce invariant marginal and conditional distributions, our notion of transferability depends on the excess risk, hence it directly takes into account the joint distribution over both the features and the labels. We show how it can be used to naturally derive a new upper bound for the target error, and then we discuss how to estimate the transferability empirically with enough samples. Our definition also inspires a method that aims to find more transferable features via representation learning using adversarial training. Figure 1: The target and source (test) accuracies of ERM on MNIST. Empirically, we perform experiments to measure the transferability of several existing algorithms, on both small and large scale datasets. We show that many algorithms, including ERM, are not quite transferable under the definition (Fig. 1, see more details in 5): when we go away from the optimal classifier (with distance δ in the parameter space), it could happen that the source accuracy remains high but the target accuracy drops significantly. This implies that during the training process, an existing algorithm may find a good source classifier with low target accuracy, hence violating the requirement for invariance of excess risks. In contrast, our algorithm is more transferable, and achieves consistent improvement over existing state-of-the-art algorithms, corroborating our findings. 2 What is Transferability? In this section we present our definition of transferability in the classification setting. The setup of domain generalization is the following: Settings and Notation Given n labeled source domains S1, . . . , Sn, the problem of domain generalization is to learn a model from these source domains, in the hope that it performs well on an unseen target domain T that is similar to the source domains. Throughout the paper, we assume that both the source domains and the unseen target domain share the same input and output spaces, denoted as X and Y, respectively. For multi-class classification, the output space Y = [K] is a set of labels for multi-class classification. For binary classification, we consider Y = { 1, +1}. Denote H as the hypothesis class. We define the classification error of a classifier h H on a domain D (or S for source domains, or T for target domains) as:2 ϵD(h) = E(x,y) D[ℓ(h(x), y)]. (1) For ℓ(h(x), y) = 1(h(x) = y), where 1( ) is the usual indicator function, we use ϵ0 1 D (h) to denote it is the 0-1 loss. In domain generalization, we often have several source domains. For the ease of presentation, we only consider a single source domain in this section, and later extend to the general case in Section 5. Given two domains, the source domain S and the target domain T , the task of domain generalization is to transfer a classifier h that performs well on S to T . We ask: how much of the success of h on S can be transferred to T ? Note that in order to evaluate the transferability from S to T , we need information from the target domain, similar to the test phase in traditional supervised learning. We believe a good criterion of transferability should satisfy the following properties: 1. Quantifiable: the notion should be quantifiable and can be computed in practice; 2. Any near-optimal source classifier should be near-optimal on the target domain. 3. If the two domains are similar, as measured by e.g., total variation, then they are transferable to each other, but the converse may not be true. At first glance the second criterion above might seem too strong and restrictive. However, we argue that in the task of domain generalization, we only have labeled source data and there is no clue to distinguish a classifier from another if both of them perform equally well on the source domain. Based on the second property, we first propose the following definition of transferability: Definition 1 (transferability). S is (δS, δT )H-transferable to T if for δS > 0, there exists δT > 0 such that argmin(ϵS, δS)H argmin(ϵT , δT )H, where: argmin(ϵD, δD)H := {h H : ϵD(h) inf h H ϵD(h) + δD}. In the literature the set argmin(ϵD, δD)H is also known as a δD-minimal set [24] of ϵD, which represents the near-optimal set of classifiers. Note that the δ-minimal set depends on the hypothesis class H. Throughout the paper, we omit the subscript H in the definition when there is no confusion. Def. 1 says that near-optimal source classifiers are also near-optimal target classifiers. Furthermore, it is easy to verify that our transferability is transitive: if S is (δS, δP)-transferable to P, and P is (δP, δT )-transferable to T , then S is (δS, δT )-transferable to T . Next we define transfer measures, which we will show to be equivalent with Def. 1 in Prop. 5. Definition 2 (quantifiable transfer measures). Given some Γ H, ϵ S := infh Γ ϵS(h) and ϵ T := infh Γ ϵT (h) we define the one-sided transfer measure, symmetric transfer measure and the realizable transfer measure respectively as: TΓ(S T ) := sup h Γ ϵT (h) ϵ T (ϵS(h) ϵ S), (2) TΓ(S, T ) := max{TΓ(S T ), TΓ(T S)} = sup h Γ |ϵS(h) ϵ S (ϵT (h) ϵ T )|, (3) Tr Γ(S, T ) := sup h Γ |ϵS(h) ϵT (h)|. (4) The distinction between Γ and H will become apparent in Prop. 5. Note that the one-sided transfer measure is not symmetric. If we want the two domains S and T to be mutually transferable to each other, we can use the symmetric transfer measure. We call both quantities as transfer measures. Furthermore, the symmetric transfer measure reduces to (4) in the realizable case when ϵ S = ϵ T = 0. In statistical learning theory, ϵD(h) ϵ D is often known as an excess risk [24], which is the relative error compared to the optimal classifier. The transfer measures can thus be represented with the difference of excess risks. With Def. 2, we can immediately obtain the following result that upper bounds the target error: 2Throughout the paper, we will use the terms domain and distribution interchangeably. Proposition 3 (target error bound). Given Γ H, for any h Γ, the target error is bounded by: ϵT (h) ϵS(h) + ϵ T ϵ S + TΓ(S T ) ϵS(h) + ϵ T ϵ S + TΓ(S, T ). (5) The first error bound of such type for a target domain uses H-divergence [8, 9, 12] for binary classification (or more rigorously, the H H-divergence). The main difference between ours and H-divergence is that H-divergence only concerns about the marginal input distributions, whereas the transfer measures depend on the joint distributions over both the inputs and the labels. We note that Proposition 3 is general and works in the multi-class case as well. Moreover, even in the binary classification case we can prove that our Proposition 3 is tighter than H-divergence (see Proposition 27 in the appendix). In practice we may not know the optimal errors. In this case, we can use the realizable transfer measure to upper bound the symmetric transfer measure (note that ϵ S or ϵ T may not be zero): Proposition 4. For Γ H and domains S, T we have: TΓ(S, T ) 2Tr Γ(S, T ). Since Def. 1 essentially asks that the excess risks of approximately optimal classifiers on the source domain are comparable between the source and target domains, we can show that Def. 1 and Def. 2 are equivalent if Γ is a δ-minimal set: Proposition 5 (equivalence between transferability and transfer measures). Let δS > 0 and Γ = argmin(ϵS, δS) and suppose infh Γ ϵT (h) = infh H ϵT (h). If TΓ(S T ) δ or TΓ(S, T ) δ, then S is (δS, δ + δS)-transferable to T . Furthermore, if S is (δS, δT )-transferable to T , then TΓ(S T ) δT and TΓ(S, T ) max{δS, δT }. In Prop. 5, we do not require Γ = H since it is unnecessary to impose that all classifiers in H have similar excess risks on source and target domains. Instead, we only constrain Γ to be a δ-minimal set, i.e., Γ includes approximately optimal classifiers of S. See also Example 8. An additional assumption is that Γ also includes the optimal classifier of T which can be ensured by controlling δS. 2.1 Comparison with other discrepancy measures between domains In this subsection, we compare the realizable transfer measure (4) with other discrepancy measures between domains and focus on the 0-1 loss ϵ0 1 D . We first note that Tr Γ(S, T ) can be written as an integral probability metric (IPM) [36, 46]. The l.h.s. of (4) can be written as: Tr Γ(S, T ) := d FΓ(S, T ), where d F(S, T ) = sup f F Z f(x, y)(p S(x, y) p T (x, y))dx and FΓ := {(x, y) 7 1(h(x) = y), h Γ}. Typical IPMs [46] include MMD, Wasserstein distance, Dudley metric and the Kolmogorov Smirnov distance (see Appendix B.2 for more details). However, FΓ is fundamentally different from these IPMs since it relies on an underlying function class Γ. Our realizable transfer measure shares some similarity with Arora et al. [4], where a changeable function class is used, but the exact choices of the function class are different. Even though the transferability can be written in terms of IPM, it is in fact a pseudo-metric: Proposition 6 (pseudo-metric). For a general loss ϵD as in (1), Tr Γ(S, T ) is a pseudo-metric, i.e., for any distributions S, T , P on the same underlying space, we have Tr Γ(S, S) = 0, Tr Γ(S, T ) = T r Γ (T , S) (symmetry), and Tr Γ(S, T ) Tr Γ(S, P) + Tr Γ(P, T ) (triangle inequality). However in general Tr Γ(S, T ) is not a metric since Tr Γ(S, T ) = 0 even if S = T . For instance, taking Γ = {h } to be the optimal classifier on both S and T . we have Tr Γ(S, T ) = 0, but S and T could differ a lot (see Figure 2). In the next result we discuss the connection between realizable transfer measures and total variation (c.f. Appendix B.2). Proposition 7 (equivalence with total variation). For binary classification with labels { 1, 1}, given the 0-1 loss ϵD = ϵ0 1 D , we have Tr Γ(S, T ) d TV(S, T ) for domains S, T and any Γ H. Denote Ht to be the set of all binary classifiers. Then we have d TV(S, T ) 4Tr Ht(S, T ). Prop. 7 tells us that transfer measures (see also Prop. 4) are no stronger than total variation, and in the realizable case, (3) is equivalent to the similarity of domains (as measured by total variation) if Γ is unconstrained. We can moreover show that transfer measures are strictly weaker, if we choose Γ to be some δ-minimal set: Example 8 (very dissimilar joint distributions but transferable). We study the distributions described in Figure 2. The joint distributions are very dissimilar, i.e., for any X, Y in the domain, |p S(X, Y ) p T (X, Y )| = 0.8. Define hρ(X) = 1 if 1 X < ρ 1 if ρ X < 1 . (7) We choose the hypothesis class H = {hρ, ρ [ 1, 1]} and Γ = {hρ, |ρ| δ/0.8} (for small δ, say δ < 0.01) to be some neighborhood of the optimal source classifier h = h0. Then TΓ(S, T ) = suph Γ |ϵS(h) ϵT (h)| = δ, and S is (δS, δ +δS)-transferable to T on Γ for any δS > 0 according to Prop. 5. Note that ϵ S = ϵ T = 0. 3 Computing Transferability Figure 2: Visualization of Example 8. Source domain: PS(Y = 1, 1 X < 0) = 0.1, PS(Y = 1, 0 X < 1) = 0.9. Target domain: PT (Y = 1, 1 X < 0) = 0.9, PT (Y = 1, 0 X < 1) = 0.1. The dark and light colors show the intensity of the probability mass. The vertical axis denotes whether it is the target or source domain (above or below x-axis). In the last section we proposed a new concept called transferability. However, although Def. 1 provides a theoretically sound result for transferability, it is hard to verify it in practice, since we cannot exhaust all approximately good classifiers, especially for rich models such as deep neural networks. Nevertheless, Prop. 3 and Prop. 5 provide a framework to compute transferability through transfer measures, despite their simplicity. In this section we discuss how to compute these quantities by making necessary approximations based on transfer measures. There are two difficulties we need to overcome: (1) In practice we only have finite samples drawn from true distributions; (2) We need a surrogate loss such as cross entropy for training and the 0-1 loss for evaluation. In 3.1 we show that our transfer measures can be estimated with enough samples, and in 3.2 we discuss transferability with a surrogate loss. These results will be used in our algorithms in the next section. 3.1 Estimation of transferability We show how to estimate the transfer measure TΓ(S T ) from finite samples. Other versions of transfer measures in Def. 1 follow analogously (see Appendix A for more details). Lemma 9 (reduction of estimation error). Given general loss ϵD as in (1), suppose b S and b T are i.i.d. sample distributions drawn from distributions of S and T , then for any Γ H we have: TΓ(S T ) TΓ( b S b T ) + 2estΓ(S) + 2estΓ(T ), with the estimation errors estΓ(S) = suph Γ |ϵS(h) ϵ b S(h)|, estΓ(T ) = suph Γ |ϵT (h) ϵ b T (h)|. This lemma tells us that estimating transferability is no harder than computing the estimation errors of both domains. If the function class Γ has uniform convergence property [44], then we can guarantee efficient estimation of transferability. We first bound the sample complexity through Rademacher complexity, which is a standard tool in bounding estimation errors [5]: Theorem 10 (estimation error with Rademacher complexity). Given the 0-1 loss ϵD = ϵ0 1 D , suppose b S and b T are sample sets with m and k samples drawn i.i.d. from distributions S and T , respectively. For any Γ H the following holds with probability 1 δ: TΓ(S T ) TΓ( b S b T ) + 4Rm(FΓ) + 4Rk(FΓ) + 2 where FΓ := {(x, y) 7 1(h(x) = y), h Γ}. If furthermore, Γ is a set of binary classifiers with labels { 1, 1}, then 2Rm(FΓ) = Rm(Γ), 2Rk(FΓ) = Rk(Γ). We also provide estimation error results using Vapnik Chervonenkis (VC) dimension and Natarajan dimension in Appendix B.3. It is worth mentioning that the VC dimension of piecewise-polynomial neural networks has been upper bounded in Bartlett et al. [7]. Since transfer measures can be estimated, in later sections we do not distinguish the sample sets b S, b T and the underlying distributions S, T . 3.2 Transferability with a surrogate loss Due to the intractability of minimizing the 0-1 loss, we need to use a surrogate loss [6] for training in practice. In this section, we discuss this nuance w.r.t. transferability. We will focus on the most commonly used surrogate loss, cross entropy (CE), although some of the results can be easily adapted to other loss functions. To distinguish a surrogate loss from the 0-1 loss, we use ϵD from now on for a surrogate loss and ϵ0 1 D for the 0-1 loss. One of the difficulties is the non-equivalence between δ-minimal sets w.r.t. the 0-1 loss and a surrogate loss, i.e. argmin(ϵD, δD) might be quite different from argmin(ϵ0 1 D , δD). Moreover, it is not practical to find all elements in argmin(ϵ0 1 D , δD) since the loss is nonconvex and nonsmooth. In light of these difficulties, we propose a more practical notion of transferability based on surrogate loss ϵD: Proposition 11 (transfer measure with a surrogate loss). Given surrogate loss ϵD ϵ0 1 D on a general domain D. Suppose Γ = argmin(ϵS, δS) and denote ϵ T = infh Γ ϵT (h), ϵ S = infh Γ ϵS(h), (ϵ0 1 T ) = infh H ϵ0 1 T (h). If the following holds: TΓ(S T ) = sup h Γ ϵT (h) ϵ T (ϵS(h) ϵ S) δ, (8) then we have argmin(ϵS, δS) argmin(ϵ0 1 T , δ + δS + ϵ T (ϵ0 1 T ) ). This proposition implies that if the transfer measure is small, then a near-optimal classifier of the surrogate loss in the source domain would be near-optimal in the target domain for the 0-1 loss. It also gives us a practical framework to guarantee transferability, which we will discuss in more depth in Section 4. Assume ϵD : H R to be Lipschitz continuous and strongly convex, which is satisfied for the cross entropy loss (see Appendix B.4). We are able to translate the δ-minimal set to Lp balls in the function space: C1 h h 2,D ϵD(h) ϵD(h ) C2 h h 1,D, (9) where C1 and C2 are absolute constants and h is an optimal classifier. The function norms 1,D and 2,D are the usual Lp norms over distribution D. Since the classifier h = q(θ, ) is usually parameterized with, say a neural network, we further upper bound the function norms by the distance of parameters, that is, for 1 p < , h = q(θ, ) and h = q(θ , ), we have h h p,D L θ θ 2, with L some Lipschitz constant of q (Appendix B.4). Combined with (9), we obtain: ϵD(h) ϵD(h ) LC2 θ θ 2. (10) In other words, if the parameters are close enough, then the losses should not differ too much. We denote 2 as the Euclidean norm, and for later convenience we will omit the subscript in 2. 4 Algorithms for Evaluating and Improving Transferability The notion of transferability is defined w.r.t. domains, hence by learning feature embeddings that induce certain feature distributions, one can aim to improve transferability of two given domains. In this section we design algorithms to evaluate and improve transferability by learning such transformations. To start with, let g : X Z be a feature embedding (a.k.a. featurizer), where Z is understood to be a feature space. By a joint distribution Dg (or Sg, T g) we mean a distribution on g(X) Y. Formally, we are dealing with push-forwards of distributions: Sg := (g, id)#S, T g := (g, id)#T , (11) where (g, id) : (x, y) 7 (g(x), y) is a function on X Y. S and T here are joint distributions on X Y, and here we specify X to be the space of the original signal such as an image. Since S and T cannot be changed, what we are evaluating here is the feature embedding g. The key quantity is transfer measures as in (8): TΓ(Sg T g) = sup h Γ ϵT g(h) ϵ T g (ϵSg(h) ϵ Sg), Γ = argmin(ϵSg, δSg). (12) Although Γ is hard to compute, we can use (10) to obtain a lower bound of (12). That is, given a parametrization of the classifier h = q(θ, ) and the optimal classifier h = q(θ , ), we have: TΓ(Sg T g) sup θ θ δ ϵT g(h) ϵSg(h) ϵ T g + ϵ Sg sup θ θ δ ϵT g(h) ϵSg(h) ϵT g(c h ) θ c θ δ ϵT g(h) ϵSg(h) ϵT g(c h ) (13) where δ > 0 depends on Γ and the constant in (10). In the second and the third lines, we approximated the optimal errors ϵ T g and ϵ Sg with 0 ϵ Sg ϵSg(c h ), 0 ϵ T g ϵT g(c h ), and we use the learned classifier c h = q( bθ , ) as a surrogate for the optimal classifier. As a result, if the r.h.s. of (13) is large, then Sg is not quite transferable to T g. Algorithm 1: Algorithm for evaluating transferability among multiple domains Input: learned feature embedding g, learned classifier c h = q( bθ , ), target sample training set T = S0, sample training sets S1, ..., Sn, ascent optimizer, minimal errors ϵ Si ϵSi(c h ), adversarial radius δ Initialize: a classifier h = q(θ, ) and θ = bθ , gap = for t in 1 . . . T do Find maxi ϵSi(h g) and mini ϵSi(h g) and corresponding indices j and k Run an ascent optimizer on h to maximize gap0 = ϵSj(h g) ϵSk(h g) Project θ onto the Euclidean ball θ bθ δ if gap0 > gap then gap = gap0, save accuracies and losses of each domain Output: j, k, h, ϵSj(h g) ϵSk(h g), ϵSj(c h ), ϵSk(c h ) We can thus design an algorithm to evaluate the transferability in Section 4.1. By computing the lower bound in (13), we can disprove the transferability as in Prop. 5 and Prop. 11. Computing the lower bound in (13) can be regarded as an attack method: there is an adversary trying to show that Sg is not transferable to T g. For this attack, we could also design a defence method aiming to minimize the lower bound and learn more transferable features. 4.1 Algorithm for evaluating transferability In domain generalization we have one target domain and more than one source domains. To ease the presentation, we denote S0 = T (and thus Sg 0 = T g) and extend the index set to be {0, 1, , n}. We need to evaluate the transferability (13) between all pairs of Sg i and Sg j . Algorithm 1 gives an efficient method to compute the worst-case gap sup θ c θ δ ϵSg i (h) ϵSg j (h) among all pairs of (i, j). Essentially, it finds the worst pair of (i, j) at each step such that the gap ϵSg i (h) ϵSg j (h) takes the largest value, and then maximize this gap over parameter θ through gradient ascent. Note that the computation of (13) also depends on the information from the target domain. This is valid since we are only evaluating but not training over these domains. 4.2 Algorithm for improving transferability The evaluation sub-procedure provides us a way to pick a pair of non-transferable domains (Sg i , Sg j ), which in turn could be used to improve the transferability among all source domains by updating the feature embedding g such that the gap sup θ θ δ ϵSg i (h) ϵSg j (h) for (i, j) [n] [n]. Simultaneously, we also require that the feature embedding g preserves information for the target task of interest. With the parametrization h = q(θ, ), h = q(θ , ), the overall optimization problem can be formulated as: min g,h max θ θ δ 1 n i=1 ϵSi(h g) + maxiϵSi(h g) miniϵSi(h g) . (14) Intuitively, we want to learn a common feature embedding and a classifier such that all source errors are small and the pairwise transferability between source domains is also small. If the optimization problem is properly solved, then we have the following guarantee: Theorem 12 (optimization guarantee). Assume that the function q( , x) is Lθ Lipschitz continuous for any x. Suppose we have learned a feature embedding g and a classifier h such that the loss functional ϵSg i : H R is LℓLipschitz continuous w.r.t. distribution Sg i for i [n] and max θ θ δ 1 n i=1 ϵSi(h g) + maxiϵSi(h g) miniϵSi(h g) η, (15) where θ, θ are parameters of h and h . Then for any h Γ = {q(θ , ) : θ θ δ}, we have: Tr Γ(T g 1 , T g 2 ) η, ϵSi(h g) η + LℓLθδ, ϵT (h g) 2η + LℓLθδ, (16) for any T g 1 , T g 2 , T g conv(Sg 1, . . . , Sg n) and any i [n]. The Lipschitzness assumption for ϵSg i is mild and can be satisfied for cross entropy loss (c.f. Appendix B.4.1). Here conv( ) denotes the convex hull in the same sense as Albuquerque et al. [2], i.e., each element is a mixture of source distributions. Thm 12 tells us that if we can solve the optimization problem (14) properly, we can guarantee transferability on a neighborhood of the classifier, as an approximation of the δ-minimal set. We thus propose Algorithm 2, which shares similarity with existing frameworks, such as DANN [16] and Distributional Robust Optimization [43, 45], in the sense that they all involve adversarial training and minimax optimization. However, the objective in our case is different and we provide a more detailed comparison with existing methods in Appendix B.5. Algorithm 2: Transfer algorithm for domain generalization Input: samples sets of source domains S1, . . . , Sn, feature embedding g, classifier h = q(θ, ), adversarial classifier h = q(θ , ), surrogate loss ϵD, adversarial radius δ, ascent optimizer, descent optimizer, weight parameter λ, number of epochs T for t in 1 . . . T do Compute maxi ϵSi(h g) and mini ϵSi(h g) Initialization h = h (or θ = θ) for k in 1 . . . N do Run the ascent optimizer on h to maximize maxi ϵSi(h g) mini ϵSi(h g) fixing g Project θ onto the Euclidean ball θ θ δ Fixing h , run the descent optimizer on g, h to minimize error = 1 i ϵSi(h g) + (maxi ϵSi(h g) mini ϵSi(h g)) Output: feature embedding g, classifier h 5 Experiments Gulrajani and Lopez-Paz [20] did extensive experiments on comparing DG algorithms, using the same neural architecture and data split. Specifically, they show that with data augmentation, ERM perform relatively well among a large array of algorithms. Our experiments are based on their settings. We run Algorithm 1 on standard benchmarks, including Rotated MNIST [18], PACS [28], Office-Home [51] and WILDS-FMo W [23] (c.f. Appendix C.1). Specifically, WILDS-FMo W is a large dataset with nearly half a million images. Detailed experimental settings can be seen at Appendix C. Evaluating transferability From Figure 3 it can be seen that at a neighborhood of the learned classifier, there exists a classifier such that the target accuracy is degraded significantly, whereas some source domain still has high accuracy. This poses questions to whether current popular algorithms such as ERM [50], DANN [17] and Mixup [53, 54] are really learning invariant and transferable features. If so, the target accuracy should be high given a high source accuracy. However, for the PACS dataset and Mixup model (the second column of Figure 3), the target accuracy decreases by more than 30% while the source accuracy remains roughly at the same level. We can also, e.g., read from the first column that with a small decrease of the source (test) accuracy by 2% (at δ = 2), the target accuracy of DANN drops by 10%. Figure 3: Top row: test accuracy of the target domain; bottom row: test accuracy of one of the source domains. Each column is for a given dataset with the name in the middle, and the legends on the bottom row are the same as those on the top row. δ is the parameter in Algorithm 1. From Figure 3 we can also see that Correlation Alignment [CORAL, 47] and Spectral Decomposition [SD, 40] have better transferability that other algorithms. In some sense, they are in fact learning robust classifiers, i.e., all the classifiers on the neighborhood of the learned classifier can achieve good accuracies. With this robust classifier, the target accuracy does not decrease much even if the classifier is perturbed. Improving transferability Algorithm 2 has good performance among all four datasets that we tried, comparable to CORAL and SD. Note that CORAL and SD do not always perform well, such as in the Office-Home and WILDS-FMo W datasets, but our Transfer algorithm does. However, in our experiments we find there are two limitations of Algorithm 2: (1) we need a large number of inner maximization steps to compute the gap, which needs more training time. This is similar to adversarial robustness [33] which is slower than usual training. In order to overcome this difficulty we used pretraining from other algorithms in the experiments on Office-Home and WILDS-FMo W; (2) Moderate hyper-parameter tuning is needed. For example, we need to tune N is Algorithm 2, the learning rate (lr) of SGA and the choice of δ. We find that taking N = 20 or 30 is usually a good choice, and δ can be quite large such that the projection step is not taken. We take lr = 0.01 for Rotated MNIST and lr = 0.001 for other datasets. Label shift In order to show the difference with the well-known H-divergence [8], we compute the label shifts in the PACS dataset. As shown in Zhao et al. [59], the optimal joint error ϵS(h ) + ϵS(h ) (h is the optimal classifier that minimizes ϵS(h) + ϵT (h)) can be large under the shift of label distributions. We follow [59] and compute the label shift between pairs of domains in the PACS dataset, measured by total variation. From Table 1 we can see that the label shift is large in this case, and thus the H-divergence bound [8] can be quite loose. Comparably, our transfer measure bound Prop. 3 is tighter (c.f. Prop. 27) and therefore still useful in practice. Table 1: Label shift between pairs of domains in the PACS dataset. TV: total variation; A: art painting; C: cartoon; P: photo; S: sketch. The total variation is always between zero and one. A 0.0 0.12 0.11 0.3 C 0.12 0.0 0.18 0.24 P 0.11 0.18 0.0 0.37 S 0.3 0.24 0.37 0.0 6 Related Work Multi-task learning Multi-task learning (MTL) [56] is related to but different from DG. In MTL, there are several tasks, and one hopes to improve the performance of each task by jointly training all the tasks simultaneously, utilizing the relationships between them. This is different from DG in the sense that in DG the target domain is unknown a priori, whereas in MTL the focus is more on better generalization on existing tasks that appear in training. Hence, there is no distribution shift in MTL per se. Furthermore, for MTL, the output spaces of different tasks are not necessarily the same. Zero-shot learning / Few-shot learning / Meta-learning DG is different from zero-shot learning [27]. In zero-shot learning, one has labeled training data and the goal is to make predictions on a new unseen label set. However, in DG the label set remains the same for the source and the target domains. On the other hand, the focus of few-shot learning is on fast adaptation, in the sense that the test distribution remains the same as the training distribution, but the learner can only have access to a few labeled samples. Domain generalization also shares similarity with meta-learning. However, in meta-learning, the learner is allowed to fine-tune over the target domain. In other words, the protocol of meta-learning allows access to a small amount of labeled data from future unseen domains. Meta-learning is more or less one specific method that is used to tackle few-shot learning. Because of the similarity, some meta-learning algorithms can be applied to DG [29]. Self-supervised learning Self-supervised learning (SSL) is a popular unsupervised feature learning approach [13, 19, 21]. The goal of SSL is to learn invariant representations w.r.t. different views of the same image. Although it is a promising feature learning method, it differs from our DG settings in the sense that no labels are used in SSL. Domain generalization There have been a lot of old and new algorithms proposed for domain generalization. The simplest one is Empirical Risk Minimization (ERM), where we simply minimize the empirical risk of (the sum of) all source domains. In Blanchard et al. [10], Muandet et al. [35], kernel methods for DG were proposed. Arjovsky et al. [3] proposed Invariant Risk Minimization (IRM) which aims to learn invariant predictors across source domains, and follow-up discussions can be found in [25, 41]. Another approach is called distributional robustness [43, 52], where the model is optimized over a worst-case distribution under the constraint that this distribution is generated from a small perturbation around the source distributions. In Albuquerque et al. [2], a DG scheme based on distribution matching was proposed. Moreover, many domain adaptation algorithms can be directly adapted to the task of domain generalization, such as CORAL [47] and DANN [16]. Last but not least, we mention a concurrent work [55] on the theory of domain generalization, which focuses more on proposing a model selection rule based on accuracy and variation. Adversarial robustness Our evaluation and training methods in 4 are reminiscent of the adversarial training method [33] in the literature of adversarial robustness. Perturbing the classifier in our case corresponds to perturbing the input data in adversarial robustness. From this perspective, our Transfer algorithm is parallel to the adversarial training method. It would be interesting to design certified robust feature embeddings, by analogy with certified robust classifiers [15]. 7 Conclusions In this paper we formally define the notion of transferability that we can quantify, estimate and compute. Our transfer measures can be understood as a special class of IPMs. They are weaker than total variation and even very dissimilar distributions could be transferable to each other. Our definition of transferability can also be naturally used to derive a generalization bound for prediction error on the target domain. Based on our theory, we propose algorithms to evaluate and improve the transferability by learning feature representations. Experiments show that, somewhat surprisingly, many existing algorithms are not quite learning transferable features. From this perspective, our transfer measures offer a novel way to evaluate the features learned from different DG algorithms. We hope that our proposal of transferability could draw the community s attention to further investigate and better understand the fundamental quantity that allows robust models under distribution shifts. Broader Impact Reliable domain generalization models are important for practice use. Our work points out the reliability issue of DG algorithms. It is worth mentioning that our evaluation method can only disprove the transferability and survival of our attack method should not be treated as a warranty. Misunderstanding of it could lead to potential harm in practical applications. Acknowledgements and Funding Transparency Statement We thank the anonymous reviewers for their constructive comments as well as the area chair and the senior area chair for overseeing the review process. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. We thank NSERC and the Canada CIFAR AI Chairs program for funding support. GZ is also supported by David R. Cheriton scholarship and research grant from Vector Institute. HZ is supported by a startup funding from the Department of Computer Science at UIUC. Finally, we thank Vector Institute for providing the GPU cluster. [1] Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. Invariant risk minimization games. In International Conference on Machine Learning, pages 145 155. PMLR, 2020. [2] Isabela Albuquerque, João Monteiro, Mohammad Darvishi, Tiago H Falk, and Ioannis Mitliagkas. Generalizing to unseen domains via distribution matching. ar Xiv preprint ar Xiv:1911.00804, 2019. [3] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. [4] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In International Conference on Machine Learning, pages 224 232. PMLR, 2017. [5] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. [6] Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. [7] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VCdimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res., 20(63):1 17, 2019. [8] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137 144, 2007. [9] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79 (1-2):151 175, 2010. [10] Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classification tasks to a new unlabeled sample. Advances in neural information processing systems, 24: 2178 2186, 2011. [11] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning. Journal of Machine Learning Research, 22(2):1 55, 2021. [12] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, 2007. [13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pages 1597 1607, 2020. URL http://proceedings.mlr. press/v119/chen20j.html. [14] Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172 6180, 2018. [15] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pages 1310 1320. PMLR, 2019. [16] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180 1189. PMLR, 2015. [17] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016. [18] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision, pages 2551 2559, 2015. [19] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In Neur IPS, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ f3ada80d5c4ee70142b17b8192b2958e-Abstract.html. [20] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. ar Xiv preprint ar Xiv:2007.01434, 2020. [21] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729 9738, 2020. [22] Zeyi Huang, Haohan Wang, Eric P. Xing, and Dong Huang. Self-challenging improves crossdomain generalization. In ECCV, 2020. [23] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. ar Xiv preprint ar Xiv:2012.07421, 2020. [24] Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. The Journal of Machine Learning Research, 11:2457 2485, 2010. [25] Masanori Koyama and Shoichiro Yamaguchi. When is invariance useful in an out-of-distribution generalization problem?, 2021. [26] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (Rex). ar Xiv preprint ar Xiv:2003.00688, 2020. [27] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951 958. IEEE, 2009. [28] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542 5550, 2017. [29] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Metalearning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [30] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5400 5409, 2018. [31] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 624 639, 2018. [32] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pages 2208 2217. PMLR, 2017. [33] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. [34] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. [35] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10 18. PMLR, 2013. [36] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, pages 429 443, 1997. [37] Hyeonseob Nam, Hyun Jae Lee, Jongchan Park, Wonjun Yoon, and Donggeun Yoo. Reducing domain gap via style-agnostic networks. ar Xiv preprint ar Xiv:1910.11645, 2019. [38] Balas K Natarajan. On learning sets and functions. Machine Learning, 4(1):67 97, 1989. [39] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947 1012, 2016. [40] Mohammad Pezeshki, Sékou-Oumar Kaba, Yoshua Bengio, Aaron Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. ar Xiv preprint ar Xiv:2011.09468, 2020. [41] Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski. The risks of invariant risk minimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Bb NIb VPJ-42. [42] Walter Rudin. Real and complex analysis. Mc Graw-Hill Education, 1987. [43] Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryx Gu Jr Fv S. [44] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [45] Aman Sinha, Hongseok Namkoong, Riccardo Volpi, and John Duchi. Certifying some distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017. [46] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Gert RG Lanckriet, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550 1599, 2012. [47] Baochen Sun and Kate Saenko. Deep CORAL: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443 450. Springer, 2016. [48] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. Advances in Neural Information Processing Systems, 33, 2020. [49] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167 7176, 2017. [50] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831 838, 1992. [51] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In (IEEE) Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [52] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. In Neur IPS, 2018. [53] Minghao Xu, Jian Zhang, Bingbing Ni, Teng Li, Chengjie Wang, Qi Tian, and Wenjun Zhang. Adversarial domain adaptation with domain mixup. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6502 6509, 2020. [54] Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training. ar Xiv preprint ar Xiv:2001.00677, 2020. [55] Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization, 2021. [56] Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, pages 1 1, 2021. doi: 10.1109/TKDE.2021.3070203. [57] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning, pages 7404 7413. PMLR, 2019. [58] Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. Advances in neural information processing systems, 31:8559 8570, 2018. [59] Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pages 7523 7532. PMLR, 2019.