# quantifying_and_improving_transferability_in_domain_generalization__91daee4b.pdf

Quantifying and Improving Transferability in Domain Generalization

Guojun Zhang School of Computer Science University of Waterloo Vector Institute guojun.zhang@uwaterloo.ca

Han Zhao Department of Computer Science University of Illinois at Urbana-Champaign hanzhao@illinois.edu

Yaoliang Yu School of Computer Science University of Waterloo Vector Institute yaoliang.yu@uwaterloo.ca

Pascal Poupart School of Computer Science University of Waterloo Vector Institute ppoupart@uwaterloo.ca

Out-of-distribution generalization is one of the key challenges when transferring a model from the lab to the real world. Existing efforts mostly focus on building invariant features among source and target domains. Based on invariant features, a high-performing classiﬁer on source domains could hopefully behave equally well on a target domain. In other words, we hope the invariant features to be transferable. However, in practice, there are no perfectly transferable features, and some algorithms seem to learn more transferable features than others. How can we understand and quantify such transferability? In this paper, we formally deﬁne transferability that one can quantify and compute in domain generalization. We point out the difference and connection with common discrepancy measures between domains, such as total variation and Wasserstein distance. We then prove that our transferability can be estimated with enough samples and give a new upper bound for the target error based on our transferability. Empirically, we evaluate the transferability of the feature embeddings learned by existing algorithms for domain generalization. Surprisingly, we ﬁnd that many algorithms are not quite learning transferable features, although few could still survive. In light of this, we propose a new algorithm for learning transferable features and test it over various benchmark datasets, including Rotated MNIST, PACS, Ofﬁce-Home and WILDS-FMo W. Experimental results show that the proposed algorithm achieves consistent improvement over many state-of-the-art algorithms, corroborating our theoretical ﬁndings.1

1 Introduction

One of the cornerstone assumptions underlying the recent success of deep learning models is that the test data should share the same distribution as the training data. However, faced with ubiquitous distribution shifts in various real-world applications, such assumption hardly holds in practice. For example, a self-driving recognition system trained using data collected in the daytime may continually degrade its performance during nightfall. The system may also encounter weather or trafﬁc conditions in a new city that never appear in the training set. In light of these potentially unseen scenarios, it is

1Code available at https://github.com/Gordon-Guojun-Zhang/Transferability-Neur IPS2021.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

of paramount importance that the trained model can generalize Out-Of-Distribution (OOD): even if the target domain is not exactly the same as the source domain(s), the learned model should hopefully behave robustly under slight distribution shift.

To this end, one line of works focuses on learning the so-called invariant representations [2, 16, 57, 58]. At a colloquial level, the goal here is to learn feature embeddings that lead to indistinguishable feature distributions from different domains. In practice, both the feature embeddings and the domain discriminators are often parametrized by neural networks, leading to an adversarial game between these two. Furthermore, in order to avoid degenerate solutions, the learned features are required to be informative about the output variable as well. This is enforced by placing a predictor over the features and minimize the corresponding supervised loss simultaneously [17, 32, 48, 49].

Another line of recent works aims to learn features that can induce invariant predictors, ﬁrst termed as the invariant risk minimization (IRM) [3, 39] paradigm. Roughly speaking, the goal of IRM is to discover a feature embedding, upon which the optimal predictors, i.e., the Bayes predictor, are invariant across the training domains. Again, at the same time, the features should be informative about the output variable as well. However, the optimization problem of IRM is rather difﬁcult, and several follow-up works have proposed different relaxations to the original formulation [1, 26].

Despite being extensively studied, both theoretical [41, 59] and empirical [20, 23] works have shown the insufﬁciency of existing algorithms for domain generalization (DG). Methods based on invariant features ignore the potential shift in the marginal label distributions across domains [59] and the methods based on invariant predictors are not robust to covariate shift [26]. Perhaps surprisingly, empirical works have shown that with proper data augmentation and careful model tuning, the very basic algorithm of empirical risk minimization (ERM) demonstrates superior performance on domain generalization over existing methods on benchmark image datasets [20, 23]. This sharp gap between theory and practice calls for a fundamental understanding of the following question:

What kind of invariance should we look for, in order to ensure that a good model on source domains also achieves decent accuracy on a related target domain?

In this work we attempt to answer the above question by proposing a criterion for models to look at, dubbed as transferability, which asks for an invariance of the excess risks of a predictor across domains. Different from existing proposals of invariant features and invariant predictors, which seek to ﬁnd feature embeddings that respectively induce invariant marginal and conditional distributions, our notion of transferability depends on the excess risk, hence it directly takes into account the joint distribution over both the features and the labels. We show how it can be used to naturally derive a new upper bound for the target error, and then we discuss how to estimate the transferability empirically with enough samples. Our deﬁnition also inspires a method that aims to ﬁnd more transferable features via representation learning using adversarial training.

Figure 1: The target and source (test) accuracies of ERM on MNIST.

Empirically, we perform experiments to measure the transferability of several existing algorithms, on both small and large scale datasets. We show that many algorithms, including ERM, are not quite transferable under the deﬁnition (Fig. 1, see more details in 5): when we go away from the optimal classiﬁer (with distance δ in the parameter space), it could happen that the source accuracy remains high but the target accuracy drops signiﬁcantly. This implies that during the training process, an existing algorithm may ﬁnd a good source classiﬁer with low target accuracy, hence violating the requirement for invariance of excess risks. In contrast, our algorithm is more transferable, and achieves consistent improvement over existing state-of-the-art algorithms, corroborating our ﬁndings.

2 What is Transferability?

In this section we present our deﬁnition of transferability in the classiﬁcation setting. The setup of domain generalization is the following:

Settings and Notation Given n labeled source domains S1, . . . , Sn, the problem of domain generalization is to learn a model from these source domains, in the hope that it performs well on an

unseen target domain T that is similar to the source domains. Throughout the paper, we assume that both the source domains and the unseen target domain share the same input and output spaces, denoted as X and Y, respectively. For multi-class classiﬁcation, the output space Y = [K] is a set of labels for multi-class classiﬁcation. For binary classiﬁcation, we consider Y = { 1, +1}. Denote H as the hypothesis class. We deﬁne the classiﬁcation error of a classiﬁer h H on a domain D (or S for source domains, or T for target domains) as:2

ϵD(h) = E(x,y) D[ℓ(h(x), y)]. (1)

For ℓ(h(x), y) = 1(h(x) = y), where 1( ) is the usual indicator function, we use ϵ0 1 D (h) to denote it is the 0-1 loss.

In domain generalization, we often have several source domains. For the ease of presentation, we only consider a single source domain in this section, and later extend to the general case in Section 5. Given two domains, the source domain S and the target domain T , the task of domain generalization is to transfer a classiﬁer h that performs well on S to T . We ask: how much of the success of h on S can be transferred to T ?

Note that in order to evaluate the transferability from S to T , we need information from the target domain, similar to the test phase in traditional supervised learning. We believe a good criterion of transferability should satisfy the following properties:

1. Quantiﬁable: the notion should be quantiﬁable and can be computed in practice; 2. Any near-optimal source classiﬁer should be near-optimal on the target domain. 3. If the two domains are similar, as measured by e.g., total variation, then they are transferable to each other, but the converse may not be true.

At ﬁrst glance the second criterion above might seem too strong and restrictive. However, we argue that in the task of domain generalization, we only have labeled source data and there is no clue to distinguish a classiﬁer from another if both of them perform equally well on the source domain. Based on the second property, we ﬁrst propose the following deﬁnition of transferability: Deﬁnition 1 (transferability). S is (δS, δT )H-transferable to T if for δS > 0, there exists δT > 0 such that argmin(ϵS, δS)H argmin(ϵT , δT )H, where: argmin(ϵD, δD)H := {h H : ϵD(h) inf h H ϵD(h) + δD}.

In the literature the set argmin(ϵD, δD)H is also known as a δD-minimal set [24] of ϵD, which represents the near-optimal set of classiﬁers. Note that the δ-minimal set depends on the hypothesis class H. Throughout the paper, we omit the subscript H in the deﬁnition when there is no confusion. Def. 1 says that near-optimal source classiﬁers are also near-optimal target classiﬁers. Furthermore, it is easy to verify that our transferability is transitive: if S is (δS, δP)-transferable to P, and P is (δP, δT )-transferable to T , then S is (δS, δT )-transferable to T .

Next we deﬁne transfer measures, which we will show to be equivalent with Def. 1 in Prop. 5. Deﬁnition 2 (quantiﬁable transfer measures). Given some Γ H, ϵ S := infh Γ ϵS(h) and ϵ T := infh Γ ϵT (h) we deﬁne the one-sided transfer measure, symmetric transfer measure and the realizable transfer measure respectively as: TΓ(S T ) := sup h Γ ϵT (h) ϵ T (ϵS(h) ϵ S), (2)

TΓ(S, T ) := max{TΓ(S T ), TΓ(T S)} = sup h Γ |ϵS(h) ϵ S (ϵT (h) ϵ T )|, (3)

Tr Γ(S, T ) := sup h Γ |ϵS(h) ϵT (h)|. (4)

The distinction between Γ and H will become apparent in Prop. 5. Note that the one-sided transfer measure is not symmetric. If we want the two domains S and T to be mutually transferable to each other, we can use the symmetric transfer measure. We call both quantities as transfer measures. Furthermore, the symmetric transfer measure reduces to (4) in the realizable case when ϵ S = ϵ T = 0. In statistical learning theory, ϵD(h) ϵ D is often known as an excess risk [24], which is the relative error compared to the optimal classiﬁer. The transfer measures can thus be represented with the difference of excess risks. With Def. 2, we can immediately obtain the following result that upper bounds the target error:

2Throughout the paper, we will use the terms domain and distribution interchangeably.

Proposition 3 (target error bound). Given Γ H, for any h Γ, the target error is bounded by:

ϵT (h) ϵS(h) + ϵ T ϵ S + TΓ(S T ) ϵS(h) + ϵ T ϵ S + TΓ(S, T ). (5)

The ﬁrst error bound of such type for a target domain uses H-divergence [8, 9, 12] for binary classiﬁcation (or more rigorously, the H H-divergence). The main difference between ours and H-divergence is that H-divergence only concerns about the marginal input distributions, whereas the transfer measures depend on the joint distributions over both the inputs and the labels. We note that Proposition 3 is general and works in the multi-class case as well. Moreover, even in the binary classiﬁcation case we can prove that our Proposition 3 is tighter than H-divergence (see Proposition 27 in the appendix).

In practice we may not know the optimal errors. In this case, we can use the realizable transfer measure to upper bound the symmetric transfer measure (note that ϵ S or ϵ T may not be zero): Proposition 4. For Γ H and domains S, T we have: TΓ(S, T ) 2Tr Γ(S, T ).

Since Def. 1 essentially asks that the excess risks of approximately optimal classiﬁers on the source domain are comparable between the source and target domains, we can show that Def. 1 and Def. 2 are equivalent if Γ is a δ-minimal set: Proposition 5 (equivalence between transferability and transfer measures). Let δS > 0 and Γ = argmin(ϵS, δS) and suppose infh Γ ϵT (h) = infh H ϵT (h). If TΓ(S T ) δ or TΓ(S, T ) δ, then S is (δS, δ + δS)-transferable to T . Furthermore, if S is (δS, δT )-transferable to T , then TΓ(S T ) δT and TΓ(S, T ) max{δS, δT }.

In Prop. 5, we do not require Γ = H since it is unnecessary to impose that all classiﬁers in H have similar excess risks on source and target domains. Instead, we only constrain Γ to be a δ-minimal set, i.e., Γ includes approximately optimal classiﬁers of S. See also Example 8. An additional assumption is that Γ also includes the optimal classiﬁer of T which can be ensured by controlling δS.

2.1 Comparison with other discrepancy measures between domains

In this subsection, we compare the realizable transfer measure (4) with other discrepancy measures between domains and focus on the 0-1 loss ϵ0 1 D . We ﬁrst note that Tr Γ(S, T ) can be written as an integral probability metric (IPM) [36, 46]. The l.h.s. of (4) can be written as:

Tr Γ(S, T ) := d FΓ(S, T ), where d F(S, T ) = sup f F

Z f(x, y)(p S(x, y) p T (x, y))dx

and FΓ := {(x, y) 7 1(h(x) = y), h Γ}. Typical IPMs [46] include MMD, Wasserstein distance, Dudley metric and the Kolmogorov Smirnov distance (see Appendix B.2 for more details). However, FΓ is fundamentally different from these IPMs since it relies on an underlying function class Γ. Our realizable transfer measure shares some similarity with Arora et al. [4], where a changeable function class is used, but the exact choices of the function class are different.

Even though the transferability can be written in terms of IPM, it is in fact a pseudo-metric: Proposition 6 (pseudo-metric). For a general loss ϵD as in (1), Tr Γ(S, T ) is a pseudo-metric, i.e., for any distributions S, T , P on the same underlying space, we have Tr Γ(S, S) = 0, Tr Γ(S, T ) = T r Γ (T , S) (symmetry), and Tr Γ(S, T ) Tr Γ(S, P) + Tr Γ(P, T ) (triangle inequality).

However in general Tr Γ(S, T ) is not a metric since Tr Γ(S, T ) = 0 even if S = T . For instance, taking Γ = {h } to be the optimal classiﬁer on both S and T . we have Tr Γ(S, T ) = 0, but S and T could differ a lot (see Figure 2). In the next result we discuss the connection between realizable transfer measures and total variation (c.f. Appendix B.2). Proposition 7 (equivalence with total variation). For binary classiﬁcation with labels { 1, 1}, given the 0-1 loss ϵD = ϵ0 1 D , we have Tr Γ(S, T ) d TV(S, T ) for domains S, T and any Γ H. Denote Ht to be the set of all binary classiﬁers. Then we have d TV(S, T ) 4Tr Ht(S, T ).

Prop. 7 tells us that transfer measures (see also Prop. 4) are no stronger than total variation, and in the realizable case, (3) is equivalent to the similarity of domains (as measured by total variation) if Γ is unconstrained. We can moreover show that transfer measures are strictly weaker, if we choose Γ to be some δ-minimal set:

Example 8 (very dissimilar joint distributions but transferable). We study the distributions described in Figure 2. The joint distributions are very dissimilar, i.e., for any X, Y in the domain, |p S(X, Y ) p T (X, Y )| = 0.8. Deﬁne

hρ(X) = 1 if 1 X < ρ 1 if ρ X < 1 . (7)

We choose the hypothesis class H = {hρ, ρ [ 1, 1]} and Γ = {hρ, |ρ| δ/0.8} (for small δ, say δ < 0.01) to be some neighborhood of the optimal source classiﬁer h = h0. Then TΓ(S, T ) = suph Γ |ϵS(h) ϵT (h)| = δ, and S is (δS, δ +δS)-transferable to T on Γ for any δS > 0 according to Prop. 5. Note that ϵ S = ϵ T = 0.

3 Computing Transferability

Figure 2: Visualization of Example 8. Source domain: PS(Y = 1, 1 X < 0) = 0.1, PS(Y = 1, 0 X < 1) = 0.9. Target domain: PT (Y = 1, 1 X < 0) = 0.9, PT (Y = 1, 0 X < 1) = 0.1. The dark and light colors show the intensity of the probability mass. The vertical axis denotes whether it is the target or source domain (above or below x-axis).

In the last section we proposed a new concept called transferability. However, although Def. 1 provides a theoretically sound result for transferability, it is hard to verify it in practice, since we cannot exhaust all approximately good classiﬁers, especially for rich models such as deep neural networks. Nevertheless, Prop. 3 and Prop. 5 provide a framework to compute transferability through transfer measures, despite their simplicity. In this section we discuss how to compute these quantities by making necessary approximations based on transfer measures. There are two difﬁculties we need to overcome: (1) In practice we only have ﬁnite samples drawn from true distributions; (2) We need a surrogate loss such as cross entropy for training and the 0-1 loss for evaluation. In 3.1 we show that our transfer measures can be estimated with enough samples, and in 3.2 we discuss transferability with a surrogate loss. These results will be used in our algorithms in the next section.

3.1 Estimation of transferability

We show how to estimate the transfer measure TΓ(S T ) from ﬁnite samples. Other versions of transfer measures in Def. 1 follow analogously (see Appendix A for more details).

Lemma 9 (reduction of estimation error). Given general loss ϵD as in (1), suppose b S and b T are i.i.d. sample distributions drawn from distributions of S and T , then for any Γ H we have:

TΓ(S T ) TΓ( b S b T ) + 2estΓ(S) + 2estΓ(T ),

with the estimation errors estΓ(S) = suph Γ |ϵS(h) ϵ b S(h)|, estΓ(T ) = suph Γ |ϵT (h) ϵ b T (h)|.

This lemma tells us that estimating transferability is no harder than computing the estimation errors of both domains. If the function class Γ has uniform convergence property [44], then we can guarantee efﬁcient estimation of transferability. We ﬁrst bound the sample complexity through Rademacher complexity, which is a standard tool in bounding estimation errors [5]:

Theorem 10 (estimation error with Rademacher complexity). Given the 0-1 loss ϵD = ϵ0 1 D , suppose b S and b T are sample sets with m and k samples drawn i.i.d. from distributions S and T , respectively. For any Γ H the following holds with probability 1 δ:

TΓ(S T ) TΓ( b S b T ) + 4Rm(FΓ) + 4Rk(FΓ) + 2

where FΓ := {(x, y) 7 1(h(x) = y), h Γ}. If furthermore, Γ is a set of binary classiﬁers with labels { 1, 1}, then 2Rm(FΓ) = Rm(Γ), 2Rk(FΓ) = Rk(Γ).

We also provide estimation error results using Vapnik Chervonenkis (VC) dimension and Natarajan dimension in Appendix B.3. It is worth mentioning that the VC dimension of piecewise-polynomial neural networks has been upper bounded in Bartlett et al. [7]. Since transfer measures can be estimated, in later sections we do not distinguish the sample sets b S, b T and the underlying distributions S, T .

3.2 Transferability with a surrogate loss

Due to the intractability of minimizing the 0-1 loss, we need to use a surrogate loss [6] for training in practice. In this section, we discuss this nuance w.r.t. transferability. We will focus on the most commonly used surrogate loss, cross entropy (CE), although some of the results can be easily adapted to other loss functions. To distinguish a surrogate loss from the 0-1 loss, we use ϵD from now on for a surrogate loss and ϵ0 1 D for the 0-1 loss. One of the difﬁculties is the non-equivalence between δ-minimal sets w.r.t. the 0-1 loss and a surrogate loss, i.e. argmin(ϵD, δD) might be quite different from argmin(ϵ0 1 D , δD). Moreover, it is not practical to ﬁnd all elements in argmin(ϵ0 1 D , δD) since the loss is nonconvex and nonsmooth. In light of these difﬁculties, we propose a more practical notion of transferability based on surrogate loss ϵD:

Proposition 11 (transfer measure with a surrogate loss). Given surrogate loss ϵD ϵ0 1 D on a general domain D. Suppose Γ = argmin(ϵS, δS) and denote ϵ T = infh Γ ϵT (h), ϵ S = infh Γ ϵS(h), (ϵ0 1 T ) = infh H ϵ0 1 T (h). If the following holds:

TΓ(S T ) = sup h Γ ϵT (h) ϵ T (ϵS(h) ϵ S) δ, (8)

then we have argmin(ϵS, δS) argmin(ϵ0 1 T , δ + δS + ϵ T (ϵ0 1 T ) ).

This proposition implies that if the transfer measure is small, then a near-optimal classiﬁer of the surrogate loss in the source domain would be near-optimal in the target domain for the 0-1 loss. It also gives us a practical framework to guarantee transferability, which we will discuss in more depth in Section 4. Assume ϵD : H R to be Lipschitz continuous and strongly convex, which is satisﬁed for the cross entropy loss (see Appendix B.4). We are able to translate the δ-minimal set to Lp balls in the function space:

C1 h h 2,D ϵD(h) ϵD(h ) C2 h h 1,D, (9)

where C1 and C2 are absolute constants and h is an optimal classiﬁer. The function norms 1,D and 2,D are the usual Lp norms over distribution D. Since the classiﬁer h = q(θ, ) is usually parameterized with, say a neural network, we further upper bound the function norms by the distance of parameters, that is, for 1 p < , h = q(θ, ) and h = q(θ , ), we have h h p,D L θ θ 2, with L some Lipschitz constant of q (Appendix B.4). Combined with (9), we obtain:

ϵD(h) ϵD(h ) LC2 θ θ 2. (10)

In other words, if the parameters are close enough, then the losses should not differ too much. We denote 2 as the Euclidean norm, and for later convenience we will omit the subscript in 2.

4 Algorithms for Evaluating and Improving Transferability

The notion of transferability is deﬁned w.r.t. domains, hence by learning feature embeddings that induce certain feature distributions, one can aim to improve transferability of two given domains. In this section we design algorithms to evaluate and improve transferability by learning such transformations. To start with, let g : X Z be a feature embedding (a.k.a. featurizer), where Z is understood to be a feature space. By a joint distribution Dg (or Sg, T g) we mean a distribution on g(X) Y. Formally, we are dealing with push-forwards of distributions:

Sg := (g, id)#S, T g := (g, id)#T , (11)

where (g, id) : (x, y) 7 (g(x), y) is a function on X Y. S and T here are joint distributions on X Y, and here we specify X to be the space of the original signal such as an image. Since S and T cannot be changed, what we are evaluating here is the feature embedding g. The key quantity is transfer measures as in (8):

TΓ(Sg T g) = sup h Γ ϵT g(h) ϵ T g (ϵSg(h) ϵ Sg), Γ = argmin(ϵSg, δSg). (12)

Although Γ is hard to compute, we can use (10) to obtain a lower bound of (12). That is, given a parametrization of the classiﬁer h = q(θ, ) and the optimal classiﬁer h = q(θ , ), we have: TΓ(Sg T g) sup θ θ δ ϵT g(h) ϵSg(h) ϵ T g + ϵ Sg

sup θ θ δ ϵT g(h) ϵSg(h) ϵT g(c h )

θ c θ δ ϵT g(h) ϵSg(h) ϵT g(c h ) (13)

where δ > 0 depends on Γ and the constant in (10). In the second and the third lines, we approximated the optimal errors ϵ T g and ϵ Sg with 0 ϵ Sg ϵSg(c h ), 0 ϵ T g ϵT g(c h ), and we use the learned classiﬁer c h = q( bθ , ) as a surrogate for the optimal classiﬁer. As a result, if the r.h.s. of (13) is large, then Sg is not quite transferable to T g.

Algorithm 1: Algorithm for evaluating transferability among multiple domains

Input: learned feature embedding g, learned classiﬁer c h = q( bθ , ), target sample training set T = S0, sample training sets S1, ..., Sn, ascent optimizer, minimal errors ϵ Si ϵSi(c h ), adversarial radius δ Initialize: a classiﬁer h = q(θ, ) and θ = bθ , gap = for t in 1 . . . T do

Find maxi ϵSi(h g) and mini ϵSi(h g) and corresponding indices j and k Run an ascent optimizer on h to maximize gap0 = ϵSj(h g) ϵSk(h g) Project θ onto the Euclidean ball θ bθ δ if gap0 > gap then

gap = gap0, save accuracies and losses of each domain

Output: j, k, h, ϵSj(h g) ϵSk(h g), ϵSj(c h ), ϵSk(c h )

We can thus design an algorithm to evaluate the transferability in Section 4.1. By computing the lower bound in (13), we can disprove the transferability as in Prop. 5 and Prop. 11. Computing the lower bound in (13) can be regarded as an attack method: there is an adversary trying to show that Sg is not transferable to T g. For this attack, we could also design a defence method aiming to minimize the lower bound and learn more transferable features.

4.1 Algorithm for evaluating transferability

In domain generalization we have one target domain and more than one source domains. To ease the presentation, we denote S0 = T (and thus Sg 0 = T g) and extend the index set to be {0, 1, , n}. We need to evaluate the transferability (13) between all pairs of Sg i and Sg j . Algorithm 1 gives an efﬁcient method to compute the worst-case gap sup θ c θ δ ϵSg i (h) ϵSg j (h) among all pairs of (i, j). Essentially, it ﬁnds the worst pair of (i, j) at each step such that the gap ϵSg i (h) ϵSg j (h) takes the largest value, and then maximize this gap over parameter θ through gradient ascent.

Note that the computation of (13) also depends on the information from the target domain. This is valid since we are only evaluating but not training over these domains.

4.2 Algorithm for improving transferability

The evaluation sub-procedure provides us a way to pick a pair of non-transferable domains (Sg i , Sg j ), which in turn could be used to improve the transferability among all source domains by updating the feature embedding g such that the gap sup θ θ δ ϵSg i (h) ϵSg j (h) for (i, j) [n] [n]. Simultaneously, we also require that the feature embedding g preserves information for the target task of interest. With the parametrization h = q(θ, ), h = q(θ , ), the overall optimization problem can be formulated as:

min g,h max θ θ δ 1 n

i=1 ϵSi(h g) + maxiϵSi(h g) miniϵSi(h g) . (14)

Intuitively, we want to learn a common feature embedding and a classiﬁer such that all source errors are small and the pairwise transferability between source domains is also small. If the optimization problem is properly solved, then we have the following guarantee:

Theorem 12 (optimization guarantee). Assume that the function q( , x) is Lθ Lipschitz continuous for any x. Suppose we have learned a feature embedding g and a classiﬁer h such that the loss functional ϵSg i : H R is LℓLipschitz continuous w.r.t. distribution Sg i for i [n] and

max θ θ δ 1 n

i=1 ϵSi(h g) + maxiϵSi(h g) miniϵSi(h g) η, (15)

where θ, θ are parameters of h and h . Then for any h Γ = {q(θ , ) : θ θ δ}, we have:

Tr Γ(T g 1 , T g 2 ) η, ϵSi(h g) η + LℓLθδ, ϵT (h g) 2η + LℓLθδ, (16)

for any T g 1 , T g 2 , T g conv(Sg 1, . . . , Sg n) and any i [n].

The Lipschitzness assumption for ϵSg i is mild and can be satisﬁed for cross entropy loss (c.f. Appendix B.4.1). Here conv( ) denotes the convex hull in the same sense as Albuquerque et al. [2], i.e., each element is a mixture of source distributions. Thm 12 tells us that if we can solve the optimization problem (14) properly, we can guarantee transferability on a neighborhood of the classiﬁer, as an approximation of the δ-minimal set. We thus propose Algorithm 2, which shares similarity with existing frameworks, such as DANN [16] and Distributional Robust Optimization [43, 45], in the sense that they all involve adversarial training and minimax optimization. However, the objective in our case is different and we provide a more detailed comparison with existing methods in Appendix B.5.

Algorithm 2: Transfer algorithm for domain generalization Input: samples sets of source domains S1, . . . , Sn, feature embedding g, classiﬁer h = q(θ, ), adversarial classiﬁer h = q(θ , ), surrogate loss ϵD, adversarial radius δ, ascent optimizer, descent optimizer, weight parameter λ, number of epochs T for t in 1 . . . T do

Compute maxi ϵSi(h g) and mini ϵSi(h g) Initialization h = h (or θ = θ) for k in 1 . . . N do

Run the ascent optimizer on h to maximize maxi ϵSi(h g) mini ϵSi(h g) ﬁxing g Project θ onto the Euclidean ball θ θ δ Fixing h , run the descent optimizer on g, h to minimize error = 1

i ϵSi(h g) + (maxi ϵSi(h g) mini ϵSi(h g)) Output: feature embedding g, classiﬁer h

5 Experiments

Gulrajani and Lopez-Paz [20] did extensive experiments on comparing DG algorithms, using the same neural architecture and data split. Speciﬁcally, they show that with data augmentation, ERM perform relatively well among a large array of algorithms. Our experiments are based on their settings. We run Algorithm 1 on standard benchmarks, including Rotated MNIST [18], PACS [28], Ofﬁce-Home [51] and WILDS-FMo W [23] (c.f. Appendix C.1). Speciﬁcally, WILDS-FMo W is a large dataset with nearly half a million images. Detailed experimental settings can be seen at Appendix C.

Evaluating transferability From Figure 3 it can be seen that at a neighborhood of the learned classiﬁer, there exists a classiﬁer such that the target accuracy is degraded signiﬁcantly, whereas some source domain still has high accuracy. This poses questions to whether current popular algorithms such as ERM [50], DANN [17] and Mixup [53, 54] are really learning invariant and transferable features. If so, the target accuracy should be high given a high source accuracy. However, for the PACS dataset and Mixup model (the second column of Figure 3), the target accuracy decreases by more than 30% while the source accuracy remains roughly at the same level. We can also, e.g., read from the ﬁrst column that with a small decrease of the source (test) accuracy by 2% (at δ = 2), the target accuracy of DANN drops by 10%.

Figure 3: Top row: test accuracy of the target domain; bottom row: test accuracy of one of the source domains. Each column is for a given dataset with the name in the middle, and the legends on the bottom row are the same as those on the top row. δ is the parameter in Algorithm 1.

From Figure 3 we can also see that Correlation Alignment [CORAL, 47] and Spectral Decomposition [SD, 40] have better transferability that other algorithms. In some sense, they are in fact learning robust classiﬁers, i.e., all the classiﬁers on the neighborhood of the learned classiﬁer can achieve good accuracies. With this robust classiﬁer, the target accuracy does not decrease much even if the classiﬁer is perturbed.

Improving transferability Algorithm 2 has good performance among all four datasets that we tried, comparable to CORAL and SD. Note that CORAL and SD do not always perform well, such as in the Ofﬁce-Home and WILDS-FMo W datasets, but our Transfer algorithm does. However, in our experiments we ﬁnd there are two limitations of Algorithm 2: (1) we need a large number of inner maximization steps to compute the gap, which needs more training time. This is similar to adversarial robustness [33] which is slower than usual training. In order to overcome this difﬁculty we used pretraining from other algorithms in the experiments on Ofﬁce-Home and WILDS-FMo W; (2) Moderate hyper-parameter tuning is needed. For example, we need to tune N is Algorithm 2, the learning rate (lr) of SGA and the choice of δ. We ﬁnd that taking N = 20 or 30 is usually a good choice, and δ can be quite large such that the projection step is not taken. We take lr = 0.01 for Rotated MNIST and lr = 0.001 for other datasets.

Label shift In order to show the difference with the well-known H-divergence [8], we compute the label shifts in the PACS dataset. As shown in Zhao et al. [59], the optimal joint error ϵS(h ) + ϵS(h ) (h is the optimal classiﬁer that minimizes ϵS(h) + ϵT (h)) can be large under the shift of label distributions. We follow [59] and compute the label shift between pairs of domains in the PACS dataset, measured by total variation. From Table 1 we can see that the label shift is large in this case, and thus the H-divergence bound [8] can be quite loose. Comparably, our transfer measure bound Prop. 3 is tighter (c.f. Prop. 27) and therefore still useful in practice.

Table 1: Label shift between pairs of domains in the PACS dataset. TV: total variation; A: art painting; C: cartoon; P: photo; S: sketch. The total variation is always between zero and one.

A 0.0 0.12 0.11 0.3 C 0.12 0.0 0.18 0.24 P 0.11 0.18 0.0 0.37 S 0.3 0.24 0.37 0.0

6 Related Work

Multi-task learning Multi-task learning (MTL) [56] is related to but different from DG. In MTL, there are several tasks, and one hopes to improve the performance of each task by jointly training all the tasks simultaneously, utilizing the relationships between them. This is different from DG in the sense that in DG the target domain is unknown a priori, whereas in MTL the focus is more on better generalization on existing tasks that appear in training. Hence, there is no distribution shift in MTL per se. Furthermore, for MTL, the output spaces of different tasks are not necessarily the same.

Zero-shot learning / Few-shot learning / Meta-learning DG is different from zero-shot learning [27]. In zero-shot learning, one has labeled training data and the goal is to make predictions on a new unseen label set. However, in DG the label set remains the same for the source and the target domains. On the other hand, the focus of few-shot learning is on fast adaptation, in the sense that the test distribution remains the same as the training distribution, but the learner can only have access to a few labeled samples. Domain generalization also shares similarity with meta-learning. However, in meta-learning, the learner is allowed to ﬁne-tune over the target domain. In other words, the protocol of meta-learning allows access to a small amount of labeled data from future unseen domains. Meta-learning is more or less one speciﬁc method that is used to tackle few-shot learning. Because of the similarity, some meta-learning algorithms can be applied to DG [29].

Self-supervised learning Self-supervised learning (SSL) is a popular unsupervised feature learning approach [13, 19, 21]. The goal of SSL is to learn invariant representations w.r.t. different views of the same image. Although it is a promising feature learning method, it differs from our DG settings in the sense that no labels are used in SSL.

Domain generalization There have been a lot of old and new algorithms proposed for domain generalization. The simplest one is Empirical Risk Minimization (ERM), where we simply minimize the empirical risk of (the sum of) all source domains. In Blanchard et al. [10], Muandet et al. [35], kernel methods for DG were proposed. Arjovsky et al. [3] proposed Invariant Risk Minimization (IRM) which aims to learn invariant predictors across source domains, and follow-up discussions can be found in [25, 41]. Another approach is called distributional robustness [43, 52], where the model is optimized over a worst-case distribution under the constraint that this distribution is generated from a small perturbation around the source distributions. In Albuquerque et al. [2], a DG scheme based on distribution matching was proposed. Moreover, many domain adaptation algorithms can be directly adapted to the task of domain generalization, such as CORAL [47] and DANN [16]. Last but not least, we mention a concurrent work [55] on the theory of domain generalization, which focuses more on proposing a model selection rule based on accuracy and variation.

Adversarial robustness Our evaluation and training methods in 4 are reminiscent of the adversarial training method [33] in the literature of adversarial robustness. Perturbing the classiﬁer in our case corresponds to perturbing the input data in adversarial robustness. From this perspective, our Transfer algorithm is parallel to the adversarial training method. It would be interesting to design certiﬁed robust feature embeddings, by analogy with certiﬁed robust classiﬁers [15].

7 Conclusions

In this paper we formally deﬁne the notion of transferability that we can quantify, estimate and compute. Our transfer measures can be understood as a special class of IPMs. They are weaker than total variation and even very dissimilar distributions could be transferable to each other. Our deﬁnition of transferability can also be naturally used to derive a generalization bound for prediction error on the target domain. Based on our theory, we propose algorithms to evaluate and improve the transferability by learning feature representations. Experiments show that, somewhat surprisingly, many existing algorithms are not quite learning transferable features. From this perspective, our transfer measures offer a novel way to evaluate the features learned from different DG algorithms. We hope that our proposal of transferability could draw the community s attention to further investigate and better understand the fundamental quantity that allows robust models under distribution shifts.

Broader Impact Reliable domain generalization models are important for practice use. Our work points out the reliability issue of DG algorithms. It is worth mentioning that our evaluation method can only disprove the transferability and survival of our attack method should not be treated as a warranty. Misunderstanding of it could lead to potential harm in practical applications.

Acknowledgements and Funding Transparency Statement

We thank the anonymous reviewers for their constructive comments as well as the area chair and the senior area chair for overseeing the review process. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. We thank NSERC and the Canada CIFAR AI Chairs program for funding support. GZ is also supported by David R. Cheriton scholarship and research grant from Vector Institute. HZ is supported by a startup funding from the Department of Computer Science at UIUC. Finally, we thank Vector Institute for providing the GPU cluster.

[1] Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. Invariant risk minimization games. In International Conference on Machine Learning, pages 145 155. PMLR, 2020.

[2] Isabela Albuquerque, João Monteiro, Mohammad Darvishi, Tiago H Falk, and Ioannis Mitliagkas. Generalizing to unseen domains via distribution matching. ar Xiv preprint ar Xiv:1911.00804, 2019.

[3] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

[4] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In International Conference on Machine Learning, pages 224 232. PMLR, 2017.

[5] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

[6] Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

[7] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VCdimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res., 20(63):1 17, 2019.

[8] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137 144, 2007.

[9] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79 (1-2):151 175, 2010.

[10] Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classiﬁcation tasks to a new unlabeled sample. Advances in neural information processing systems, 24: 2178 2186, 2011.

[11] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning. Journal of Machine Learning Research, 22(2):1 55, 2021.

[12] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, 2007.

[13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pages 1597 1607, 2020. URL http://proceedings.mlr. press/v119/chen20j.html.

[14] Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172 6180, 2018.

[15] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certiﬁed adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pages 1310 1320. PMLR, 2019.

[16] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180 1189. PMLR, 2015.

[17] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016.

[18] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision, pages 2551 2559, 2015.

[19] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In Neur IPS, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ f3ada80d5c4ee70142b17b8192b2958e-Abstract.html.

[20] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. ar Xiv preprint ar Xiv:2007.01434, 2020.

[21] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729 9738, 2020.

[22] Zeyi Huang, Haohan Wang, Eric P. Xing, and Dong Huang. Self-challenging improves crossdomain generalization. In ECCV, 2020.

[23] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. ar Xiv preprint ar Xiv:2012.07421, 2020.

[24] Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. The Journal of Machine Learning Research, 11:2457 2485, 2010.

[25] Masanori Koyama and Shoichiro Yamaguchi. When is invariance useful in an out-of-distribution generalization problem?, 2021.

[26] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (Rex). ar Xiv preprint ar Xiv:2003.00688, 2020.

[27] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951 958. IEEE, 2009.

[28] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542 5550, 2017.

[29] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy Hospedales. Learning to generalize: Metalearning for domain generalization. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

[30] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5400 5409, 2018.

[31] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 624 639, 2018.

[32] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pages 2208 2217. PMLR, 2017.

[33] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

[34] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.

[35] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10 18. PMLR, 2013.

[36] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, pages 429 443, 1997.

[37] Hyeonseob Nam, Hyun Jae Lee, Jongchan Park, Wonjun Yoon, and Donggeun Yoo. Reducing domain gap via style-agnostic networks. ar Xiv preprint ar Xiv:1910.11645, 2019.

[38] Balas K Natarajan. On learning sets and functions. Machine Learning, 4(1):67 97, 1989.

[39] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947 1012, 2016.

[40] Mohammad Pezeshki, Sékou-Oumar Kaba, Yoshua Bengio, Aaron Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. ar Xiv preprint ar Xiv:2011.09468, 2020.

[41] Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski. The risks of invariant risk minimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Bb NIb VPJ-42.

[42] Walter Rudin. Real and complex analysis. Mc Graw-Hill Education, 1987.

[43] Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryx Gu Jr Fv S.

[44] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[45] Aman Sinha, Hongseok Namkoong, Riccardo Volpi, and John Duchi. Certifying some distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017.

[46] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Gert RG Lanckriet, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550 1599, 2012.

[47] Baochen Sun and Kate Saenko. Deep CORAL: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443 450. Springer, 2016.

[48] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. Advances in Neural Information Processing Systems, 33, 2020.

[49] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167 7176, 2017.

[50] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831 838, 1992.

[51] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In (IEEE) Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[52] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. In Neur IPS, 2018.

[53] Minghao Xu, Jian Zhang, Bingbing Ni, Teng Li, Chengjie Wang, Qi Tian, and Wenjun Zhang. Adversarial domain adaptation with domain mixup. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 6502 6509, 2020.

[54] Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training. ar Xiv preprint ar Xiv:2001.00677, 2020.

[55] Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization, 2021.

[56] Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, pages 1 1, 2021. doi: 10.1109/TKDE.2021.3070203.

[57] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International Conference on Machine Learning, pages 7404 7413. PMLR, 2019.

[58] Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. Advances in neural information processing systems, 31:8559 8570, 2018.

[59] Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pages 7523 7532. PMLR, 2019.