# distribution_free_domain_generalization__181c43a9.pdf

Distribution Free Domain Generalization

Peifeng Tong 1 Wu Su 2 He Li 3 4 Jialin Ding 3 Haoxiang Zhan 3 Song Xi Chen 3 1

Accurate prediction of the out-of-distribution data is desired for a learning algorithm. In domain generalization, training data from source domains tend to have different distributions from that of the target domain, while the target data are absence in the training process. We propose a Distribution Free Domain Generalization (DFDG) procedure for classification by conducting standardization to avoid the dominance of a few domains in the training process. The essence of the DFDG is its reformulating the cross domain/class discrepancy by pairwise two sample test statistics, and equally weights their importance or the covariance structures to avoid dominant domain/class. A theoretical generalization bound is established for the multi-class classification problem. The DFDG is shown to offer a superior performance in empirical studies with fewer hyperparameters, which means faster and easier implementation.

1. Introduction

Domain generalization (DG) aims at transferring knowledge from the source domains to the target domains without the target data in the training process (Blanchard et al., 2011). A major challenge of DG is that the source and target data are not identically distributed. An algorithm trained from the source domains tends to be less performing in the target domain. DG is designed to attain robust performance in the target domain.

Compared with the domain adaptation where the target data are accessible in training to obtain a target specific predictor (Long et al., 2015; Li et al., 2021), DG is designed for a single global predictor or classifier that performs well in

1Guanghua School of Management, Peking University, Beijing 100871, China 2Center for Big Data Research, Peking University, Beijing 100871, China 3School of Mathematical Science, Peking University, Beijing 100871, China 4Pazhou Lab, Guangzhou 510330, China. Correspondence to: Song Xi Chen <csx@gsm.pku.edu.cn>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

both the source and target domains (Blanchard et al., 2021). Studies have been proposed for the DG (Zhou et al., 2021; Fan et al., 2021; Shu et al., 2021), such as the kernel based domain invariant feature representation (Hu et al., 2020), the meta learning framework (Balaji et al., 2018) and the model selection or model average (Ye et al., 2021). See Wang et al. (2022) and Zhou et al. (2022) for a review.

Among the existed DG methods, we follow the kernel DG methods (Muandet et al., 2013; Ghifary et al., 2017; Li et al., 2018; Hu et al., 2020) for new development. These methods first map data to a high dimensional reproducing kernel Hilbert space (RKHS), and then construct metrics to measure the cross domain and class discrepancy, followed by a low dimensional feature representation that minimizes the cross domain dissimilarity while keeping new features with different classes well separated. The metrics are usually constructed as variants of the maximum mean discrepancy (MMD) (Gretton et al., 2012).

A common challenge with the DG is to counter the different mean levels and the variation among the discrepancy measures of different domains in the training stage. A robust DG procedure has to avoid domains with higher mean levels or variations to dictate the feature selection as features much influenced by the outlaying domains are doom to be weak in domain generalization. Existing kernel DG methods have to use more hyperparameters to balance the between-domain discrepancy measures, which may reduce the generalization ability of the methods.

We propose two standardization procedures which are designed to reduce the heterogeneity in the kernel DG discrepancy statistics among the domains by conducting mean and variance adjustments. These standardizations are based on asymptotic analysis ((12) and Proposition 1) on the pairwise MMD statistics, which reduces the number of hyperparameters and speeds up the training process, and hence allows more computation intensive classifier in the DG procedure.

Specifically, we put forward a distribution-free DG (DFDG) approach that provides a superior performance using fewer hyperparameters, which is well suited for DG. We unify the kernel DG methods as an optimization problem based on pairwise two-sample test statistics with concise matrix form in terms of the sandwich structure. Two distributionfree standardized metrics are proposed, one reweights the

Distribution Free Domain Generalization

weighting matrix by the means of the null distributions, and the other de-correlates the averaged Gram matrix. A generalization bound for the multi-class classification based on the DFDG is derived, which provides theoretical guarantee for the proposed DFDG approach.

The paper is organized as follows. Section 2 gives the unified framework of the DG problem for classification. Section 3 proposes two distribution free metrics. Section 4 is for the generalization bound. Simulation and case studies are provided in Section 5, followed by a conclusion in Section 6. Some technical and numerical details are relegated to the supplementary material (SM).

2. Unified framework of DG problem

Throughout the paper, we use bold lowercase letters for column vectors, and bold uppercase letters for matrices.

We consider a classification task. Let X Rp denote the observation space and Y R be the set of class labels. Let PX Y denote the set of joint distributions on X Y. It is assumed that there exists a unimodal super distribution P with finite variance over PX Y, such that P (1) XY , . . . , P (m) XY are independent and identically distributed (IID) realizations from P in PX Y. For a domain s, there is a sample {(xs i, ys i )}ns i=1 of ns IID realizations of (x, y) according to the distribution P (s) XY . In general, for any s = s , P (s) XY = P (s ) XY , implying no-identical distribution cross the domains.

Consider a target distribution P (t) XY P and target sample {(xt i, yt i)}nt i=1, where the class labels {yt i} are not available, and {xt i} are not used in the training. This forces us to establish a global model without retraining the model for a specific target domain. Our goal is to extract domain-invariant features that have minimum cross domain discrepancy and maximum cross class discrepancy simultaneously.

The kernel method is founded on a RKHS H associated with a kernel k and inner product , H having the reproducing property that for any function f : X R, f( ), k(x, ) H = f(x). The canonical map ϕ(x) : X H can be denoted as ϕ(x) := k(x, ) satisfying k(x, x ) = ϕ(x)T ϕ(x ).

To map a probability distribution to the RKHS, we define the kernel mean embedding µ : PX H induced by k

µPX := EX[ϕ(X)] = Z

X ϕ(x)d PX.

If k is a bounded and characteristic kernel, the mapping is injective so that ||µPX µP X||H = 0 if and only if (iff) PX = P X. The sample estimator ˆµPX = 1

n Pn i=1 ϕ(xi).

Denote the kernel mean embedding of P (s) X and P (s) X|Y =j by

µs and µs j, respectively. These mean maps are all high dimensional and we assume that µP RN for a large integer N, where N can be infinity.

2.1. Cross domain discrepancy

The cross domain discrepancy can be regarded as the sum of pairwise distances at each domain condition over every class, as follows.

Definition 1 (pairwise cross domain discrepancy (PDD)). Given the class-conditional distributions {P (s) X|Y =j} for s {1, . . . , m} and j {1, . . . , c}, the PDD

Ψpdd := 1 c m 2

1 s<s m ||µs j µs j ||2 H, (1)

where m 2 is the number of combination.

Each term in (1) is a squared MMD , which describes the distance between two distributions. It is also similar to the traditional Hotelling s T-test but without weighting via a covariance matrix.

To reformulate (1) as a concise matrix form, for a class j, denote Mj = [µ1 j, . . . , µm j ] RN m and Γ1 = m Im

1m1T m, we have Ψpdd = c 1 m 2 1 Pc j=1 tr(MjΓ1M T j ), where Im is a m m identity matrix and 1m is a vector in Rm whose elements are all ones. Moreover, denote M = [M1, . . . , Mc] and let Γpdd = Ic Γ1 where " denotes the Kronecker product. Then, it is readily shown that

Ψpdd = c 1 m 2 1tr(MΓpdd M T ). (2)

The above formulation introduces a matrix sandwich form with the block diagonal Γpdd as the weighting matrix.

2.2. Cross class discrepancy

While the PDD metric (1) has been considered by Li et al. (2018) and Hu et al. (2020), to measure the class dissimilarity, now we propose a cross class discrepancy measure.

Definition 2 (pairwise cross class discrepancy (PCD)). Domain specified cross class discrepancy is defined as

Ψpcd := 1 m c 2

1 j<j c ||µs j µs j ||2 H, (3)

the average class dissimilarity among the domains.

Compared with Ψpdd, Ψpcd exchanges the order of the domain and class indexes. Let Us = [µs 1, . . . , µs c] RN c, and Γ2 = c Ic 1c1T c , (3) becomes

Ψpcd = m 1 c 2 1tr(UΓpcd U T ), (4)

Distribution Free Domain Generalization

where U = [U1, . . . , Um] and Γpcd = Im Γ2.

It is noted that the existing kernel DG methods, e.g. in Ghifary et al. (2017) and Hu et al. (2020), had two metrics for the domain discrepancy and two metrics for the class mismatch, respectively. Our proposal has only one metric for each measure, which means fewer hyperparameters than the existing DG methods, which leads to more efficient DG methods as shown later.

2.3. Feasible optimization framework

The metrics Ψpdd and Ψpcd depend on the distribution PXY and the kernel k. The goal of the DG is to find a q-dimensional invariant feature via a transformation W RN q : H Rq, such that the new features M T W and U T W have minimum cross domain discrepancy and maximum cross class discrepancy simultaneously.

As the dimension of RKHS H is high for finite samples, we need a practically calculable form of W . Let n = n1 + + nm be the total training data size, and Φ = [ϕ(x1) . . . , ϕ(xn)]T Rn N be the high dimensional feature matrix. Since Φ has rank no more than n, W can be expressed as a linear combination of Φ such as

W = ΦT B, (5)

where B Rn q is a feasible mixing matrix.

We construct PDD and PCD metrics based on lower dimensional features M T W and U T W . By (2), for the low dimensional feature M T W ,

Ψpdd =c 1 m 2 1tr(W T MΓpdd M T W )

=c 1 m 2 1tr(BT ΦMΓpdd M T ΦT B)

:=tr(BT QBT ),

Q = c 1 m 2 1KpddΓpdd Kpdd T and Kpdd := ΦM. (6) Similarly, we can update Ψpcd = tr(BT F BT ) with

F = m 1 c 2 1KpcdΓpcd Kpcd T and Kpcd := ΦU. (7)

We note that both the N cm matrices M and U are consisted with column vectors µs j. After applying the kernel trick, the column vectors of Kpdd and Kpcd have the form

Ks j := Φµs j = 1

i=1 Φϕ(xs j,i) = 1

i=1 Ks j,i,

where Ks j,i = (k(x1, xs j,i), . . . , k(xn, xs j,i))T Rn is the i-th column vector of the Gram matrix

K = ΦΦT = [k(xi, xj)]ij Rn n, (8)

and we use xs j,i to emphasize the corresponding class and domain indexes of xi. Since M and Q only differ from the order of columns, Kpdd and Kpcd share the same property, see Figure S1 for a graphical illustration.

As one cannot optimize Ψpcd = tr(BT F B) and Ψpdd = tr(BT QB) simultaneously, we maximize Ψpcd while keeping Ψpdd fixed, namely

arg max B tr(BT F B) s.t. tr(BT (Q + γK)B) = 1, (9)

where the first term in the trace constraint is to limit Ψpdd

while the second term BT KB = BT ΦΦT B = W T W is to control the magnitude of W . The first order condition yields the generalized eigenvalue problem

F B = (Q + γK)BΓ,

where Γ = diag(λ1, . . . , λq) is a diagonal matrix collecting q leading eigenvalues, B is the corresponding eigenvectors. In practice, one may add εI with ε = 10 5 for numerically stable performance so that

F B = (Q + γK + εI)BΓ. (10)

Compared with the existing kernel DG methods which involve more metrics and more hyperparameters as shown in Table 2, there is only one hyperparameter γ in (10).

3. Distribution free metrics

In this section, we propose two sets of empirical estimates of matrices F and Q used in the generalized eigenvalue problem (10) based on two ways of standardization that reweight the domain/class discrepancy measures. The first approach mainly focuses on the first moment difference of the features, while the second approach focuses on the second moment and adjusts for the empirical covariance matrix of the averaged Gram matrix.

3.1. Eigenvalue adjustment

The first set of F and Q estimates comes from standardizing the maximum mean discrepancy (MMD) statistic (Gretton et al., 2012) based on an asymptotic analysis. The MMD used in both (1) and (3) is a distance between two domain distributions,

MMD2(P (s) X , P (s ) X ) := ||µs µs ||2 H = Exs i ,xs j[k(xs i, xs j)]

2Exs i ,xs j [k(xs i, xs j )] + Exs i ,xs j [k(xs i , xs j )]. (11)

Estimates of MMD2 can be made via the Uor the Vstatistics. We consider the V-statistic formulation since

Distribution Free Domain Generalization

it leads to positive semidefinite estimates for F and Q:

\ MMD 2 b(P (s) X , P (s ) X ) = 1 (ns)2

i,j=1 k(xs i, xs j)

j=1 k(xs i, xs j ) + 1 (ns )2

i,j=1 k(xs i , xs j ).

Under the null hypothesis that P (s) X = P (s ) X , the MMD statistic is equivalent to the one based on a centered kernel k (Sejdinovic et al., 2013)

k (xi, xj) =k(xi, xj) Exk(xi, x) Exk(x, xi)+

Ex,x k(x, x ).

The null distribution of \ MMD 2 b (Gretton et al., 2012) under limns,ns ns ns+ns = ρs,s is

\ MMD 2 b(P (s) X , P (s) X ) ns + ns

d 1 ρs,s (1 ρs,s )

l=1 λs l z2 l , (12)

where z2 l are IID χ2 1 distributed, and {λs l } are the solutions to the eigenvalue equations Z

X k (x, xj)ϕl(x)d P (s) X (x) = λs l ϕl(xj).

Note that the expectation of the limiting distribution in (12) is 1 ρs,s (1 ρs,s ) P l=1 λs l , which can be estimated by tr(K ) (Shawe-Taylor et al., 2005) or the nuclear norm ||K || , where K = K 1

n1n1T n K 1

n K1n1T n + 1 n2 1n1T n K1n1T n. This leads to an eigenvalue adjusted Ψpdd

and Ψpcd by dividing each pair of MMD2 by its expectation. Such that for each domain and class, the expectation of the scaled MMD2 are asymptotically equal to one under the null hypothesis.

Definition 3 (scaled pairwise cross domain discrepancy (SPDD)). Given the set of class-conditional distributions {P (s) X|Y =j}, the empirical SPDD measure is

ˆΨspdd := 1 c m 2

ns jns j ns j + ns j ||K j,s,s || 1

\ MMD 2 b(P (s) X|Y =j, P (s ) X|Y =j) . (13)

Definition 4 (scaled pairwise cross class discrepancy (SPCD)). Given the set of domain-conditional distributions {P (s) X|Y =j}, the empirical SPCD metric

ˆΨspcd := 1 m c 2

ns jns j ns j + ns j ||K s,j,j || 1

\ MMD 2 b(P (s) X|Y =j, P (s) X|Y =j ) . (14)

In (13) and (14), K s,j,j Rns j ns j is a submatrix of K , whose (i, l)-th element is k (xs j,i, xs j ,l).

Mimic a similar dimension reduction as in Section 2.3, we work on the optimization problem (9) leading to the generalized eigenvalue problem (10) by replacing F and Q with their empirical estimates

ˆF = 1 m c 2

ns jns j ns j + ns j ||K s,j,j || 1

( K s j K s j )( K s j K s j )T , (15)

ˆQ = 1 c m 2

ns jns j ns j + ns j ||K j,s,s || 1

( K s j K s

j )( K s j K s

j )T . (16)

Solving (10) with the ˆF and ˆQ leads to the estimated eigenvectors ˆ B whose i-th column ˆ Bi Rn associated with the nonzero eigenvalues, which needs to be standardized so that || ˆ Wi||H = ˆ BT i K ˆ Bi = 1, which means we let

ˆ Bi ˆ Bi/ q

ˆ BT i K ˆ Bi. (17)

Compared with the existed kernel DG methods that standardizes ˆ B by ˆ BˆΓ 1

2 where ˆΓ is the estimated eigenvalue matrix in (10), (17) is more robust for a large feature dimension q by avoiding dividing near zero eigenvalues.

The scaling conducted in (13) and (14) is designed to remove the mean differences among the pairwise MMD-statistics by reweighting the statistics by their asymptotic means according to (12). The scaling allows the pairwise discrepancy measures between domains and classes being treated more equally. Thus, for the extracted invariant features in the DG, all the domains have a similar and balanced contribution, so that the features reflect the collective information of all participants, avoiding a few domains or classes dominate the selected features.

3.2. One side covariance filter

This subsection considers another standardization on the F and Q estimates in (6) and (7) by rotating Kpcd and Kpdd

via their covariance matrices, namely

Kpcd = Kpcd( Γpcd) 1

2 , where (18)

Γpcd = 1 cm

i=1 (Kpcd i,j,s Kpcd j,s )(Kpcd i,j,s Kpcd j,s )T ,

Kpcd j,s = 1

i=1 Kpcd i,j,s.

The standardization of Kpdd, denoted by Kpdd, can be obtained similarly by replacing Kpcd with Kpdd in the above

Distribution Free Domain Generalization

formulation. In the above equations, (Kpcd i,j,s)T R1 cm

is the i-th row vector of Kpcd corresponding to domain s and class j, whose explicit form is left in Supplementary Materials (SM).

One may wonder why we can treat Kpcd like the data matrix X = [x1, . . . , xn]T , and the covariance matrix is calculated like Kpcd contains n independent observations whose underlying distribution are the same under the same domain and class. Proposition 1 shows that although there exist correlations among the rows and columns of Kpcd, the column-wise covariances are of the order O(1) and the row-wise covariances are O((ns j) 1). The latter means that the column-wise covariances can be ignored in large samples. We use the averaged covariance matrix Γpcd in the Euclidean space for the standardization (rotation), and one may consider the averaged version in Riemannian space (Barachant et al., 2012).

We call the rotation in Kpcd and Kpdd the one side covariance filter and redefine F and Q in (10) by the rotated Kpcd

and Kpdd as

ˆF = m 1 c 2 1 KpcdΓpcd Kpcd T , (19)

ˆQ = c 1 m 2 1 KpddΓpdd Kpdd T , (20)

where Γpcd and Γpdd are similarly defined as those in (2) and (4). The algorithm is summarized in Algorithm 1.

The rest of the subsection provides the theoretical justification for the rotations, which is based on the correlation structures of Kpcd and Kpdd. For notation simplicity, we first omit the class index j and consider a generic K = [ 1

ns Pns l=1 kss il ]is Rn m for either Kpcd and Kpdd for

i = 1, . . . , n and s = 1, . . . , m, and kss ij := k(xs i, xs j ) is a simplified notation for elements of the Gram matrix K. We note that K can be obtained by merging Kpdd or Kpcd over the class lever. The results provided in Proposition 1 can be easily extended to cover both domain s and class j by merging s and j in a new defined single index s : (s, j) 7 {1, . . . , cm}.

For a general kernel function k(x, y) = f(||x y||2 2/h), where h is the bandwidth, we want to derive its first two moments. We begin with the following assumptions. Assumption 1. 1. For each domain s, the covariates {xs 1, . . . , xs ns} are generated according to

xs i = Γsus i + ηs, i = 1, . . . , ns, (21)

where us i Rp are IID random variables satisfying E(us i) = 0, var(us i) = Iq. For the j-th element us i(j) of us i, E(us i(j)8) < . The parameters Γs Rp p , the mean ηs Rp, and ΓsΓs T = Σs.

2. For each domain s, tr(Σs) = O(p), the operator norms of Σs are bounded from above and ||ηs||2 2 = O(p).

Algorithm 1: Distribution free domain generalization Input: Source data: {(xs i, ys i )}ns i=1, s = 1, . . . , m; Hyperparameter γ and kernel function k( , ); Number of subspace features q. Output: Mixing matrix B, Gram matrix K, new features Z and Zt.

1 Calculate Gram matrix K via (8);

2 if use eigenvalue adjustment then

3 Obtain the centered K = K 1

n1n1T n K 1

n K1n1T n + 1 n2 1n1T n K1n1T n ;

4 Calculate ˆF and ˆQ via (15) and (16);

5 else if use one side covariance filter then

6 Calculate ˆF and ˆQ via (19) and (20);

7 Solve eigenvalues Γ and corresponding eigenvectors B from the generalized eigenvalue problem (10), select the q leading eigenvectors;

8 Standardize B by letting Bi Bi/ p

9 Construct Gram matrix at test set as [Kt]ij = k(xt i, xj). The extracted features at training/testing set are Z = KB and Zt = Kt B, respectively.

3. The domain sample sizes are balanced so that limn ns/n = κs (0, 1) where n = Pm s=1 ns. Let g(x) = f(x2). Then, g C3[0, ) and sup1 s 3 supx 0 |g(s)(x)| < .

The following proposition gives the covariance structures of K, whose proof follows the Taylor expansions in Yan & Zhang (2022) as showed in the SM.

Proposition 1. Given Assumption 1,

j=1 kss ij = µ(s,s ) + O(p3/2h 3),

j=1 kss ij = σ(s,s ) + O(p2h 4),

where the specific forms of µ(s,s ) and σ(s,s ) are given in (B.3) and (B.4) of the SM. For the covariances, if there is a common row

j=1 kss ij , 1

j=1 kss ij = O(p2h 2), (22)

and if they is a common column domain

j=1 ks s ij , 1

j=1 ks s lj = O(p2n 1 s h 2) (23)

and the covariance is 0 if there is no common row or column domain.

Distribution Free Domain Generalization

Proposition 1 suggests that the leading variance of 1 ns Pns l=1 kss il = 1 ns Pns l=1 k(xs i, xs l ) is σ(s,s ), which is O(p2h 2) as shown in the SM. This is due to all the ns terms in the summation have a common xs i, which leads to all cov(kss il , kss il ) being a constant. If the two elements of K are in the same row, their covariance is also O(p2h 2). But if they are in the same column, the covariance is O(p2n 1 s h 2), which is a smaller order of O(p2h 2). Moreover, row vectors of K belong to the same domain have the same mean and covariance structures. When the sample size goes to infinity, the correlation between different rows vanishes, and we treat them like independent variables. The results of Proposition 1 justifies the form of the Γ used in the rotation after (18).

Since the distribution of X and the kernel f in Assumption 1 are very general without much restriction, the one side covariance filter is generally applicable, for instance for non-Gaussian data and a general kernel.

4. Generalization bound

In this section, we analyze the generalization bound of the multi-class classification problem after applying the proposed DFDG algorithm (10). After providing the classifier, the loss function and the kernel. the generalization bound for the DFDG based classification is established.

The classifier f is of the form f : PX X 7 R. Let X = (PX, X) be the extended covariate. In the training procedure, one has ns labeled data {( ˆP (s) X , xs i, ys i )}ns i=1 = {( xs i, ys i )}ns i=1 ( X Y)ns for each domain s, where Y = {1, . . . , c} is the set of c classes. For the multi-class classification, a classifier f is defined via a scoring function g : X Y R as

f : x 7 arg max y Y g( x, y),

where we consider a linear scoring function g such that for class j, g( x, j) = a T j W T ϕk( x), and A = [a1, . . . , ac]T Rc q is bounded, namely ||A|| Λ. Since W has been standardized to have column norm one,

||AW T ||Hk ||A|| ||W ||Hk qΛ,

where q is the feature dimension of W = ΦT B. We have reused the notation f and g, different from those in Assumption 1.

The loss function is established by the margin theory. A margin rg( x, y) of the function g at a labeled observation ( x, y) can be defined as

rg( x, y) = g( x, y) max y =y g( x, y ).

Hence, f gives the wrong classification iff rg( x, y) 0.

The empirical ρ-margin loss given g and ρ > 0 is

ˆRn,ρ(g) = 1 cm

i=1 lρ(rg( xs j,i, j)),

where xs j,i = ( ˆP (s) X|Y =j, xs j,i), lρ(x) = min(1, max(0, 1 x/ρ)) is a ρ-margin loss function, ρ 1-Lipschitz. The expected loss (risk) of the classification

R(g) = E( x,y)I(rg( xi, yi) 0),

where I( ) is the indicator function. Since I(x 0) lρ(x), the expected loss R(g) E( x,y) ˆRn,ρ(g) for any g.

For the DG problem, the widely used product kernel k is

k((P (s) X , xs i), (P (s ) X , xs j )) = k P (P (s) X , P (s ) X )k1(xs i, xs j ) (24) with a RKHS H k (Blanchard et al., 2011). For the choice of k P , let k2 denote a kernel on X with RKHS Hk2 and feature map ϕk2, we define the k2 induced kernel mean embedding µ : PX Hk2 as µPX := R

X ϕk2(x)d PX(x), and introduce another kernel K on Hk2 such that

k P P (s) X , P (s ) X = K µP (s) X , µP (s ) X

Combining the classifier and the kernel k, a family of the DG based score functions can be denoted as

G k = {( x, y) X {1, . . . , c} 7 a T y W T ϕ k( x) :

A = (a1, . . . , ac)T , ||AW T ||H k qΛ}.

The following assumption makes k a bounded universal kernel.

Assumption 2. (i) The kernel k1 is universal on X, and k2 is universal and continuous on X, K is universal on any compact subset of Hk2. The kernels k1, k2 and K are bounded by U 2 1 , U 2 2 and U 2 K, respectively. (ii) The canonical feature map ϕK associated with K is LK-Lipschitz. The observation space X is a compact metric space.

We have the following theorem regarding the multi-class generalization bound.

Theorem 1. Given Assumption 2, and assume that ns j = n for balanced sample size. Then, for a ρ > 0 and any δ > 0, with probability at least 1 δ, the following multi-class classification generalization bound holds for all g G k:

R(g) ˆRn,ρ(g) + 1

ρqΛU1U2LK 6

m n + 4 r c

Distribution Free Domain Generalization

(a) Case 1 (b) Case 2 (c) Case 3 (d) Case 4 (e) Case 5 (f) Case 6

Figure 1. The prior distributions and the variances of the 6 data generalization cases. The bars show the prior probabilities of the different classes within each domain, where the center indexes indicate the domains. The light color indicates that the data are generated with variance one while the darker color (see Cases 5 and 6) means the variance is four.

Table 1. Center points and sample sizes for the synthetic data.

Domain Domain 1 Domain 2 Domain 3 Domain 4

Class 1 2 3 1 2 3 1 2 3 1 2 3

X1 1 4 4 0.5 3.5 3.5 1 4 4 0.5 3.5 3.5 X2 2 2 -2 1.5 1.5 -2.5-1.5-1.5-5.5-1.5-1.5-5.5 instances 600 600 600 600

Theorem 1 generalizes the results in Hu et al. (2020) by quantifying the effects of class number c and the feature dimension q introduced by the proposed standardization methods. Indeed, it shows that a larger c or q leads to a weaker guarantee. Given the confidence level 1 δ, the excess risk converges to zero if n log cm and m

5. Empirical results

We compare the proposed DFDG with the existing DG methods on a synthetic dataset and two real image classification tasks. The two proposed DFDG metrics DFDG-Eig (Section 3.1) and DFDG-Cov (Section 3.2) associated with two classifiers, the 1-nearest neighbor (1-NN) and the support vector machine (SVM), are used for comparison.

The proposed DFDG is compared with the conventional k NN and SVM without dimension reduction, the Kernel DG methods, namely the domain invariant component analysis (DICA, Muandet et al. 2013), the scatter component analysis (SCA, Ghifary et al. 2017), the conditional invariant DG (CIDG, Li et al. 2018) and the multi-domain discriminant analysis (MDA, Hu et al. 2020), where 1-NN was used for these kernel DG methods. The product kernel (24) was used for all the kernel-based DG methods, where k1, k2 and K are Gaussian kernels with bandwidth h, h and one, respectively. The bandwidth h is chosen by the median heuristic unless specified otherwise.

Even with the 1-NN classifier, the existing kernel based DG methods typically have three hyperparameters as listed in Table 2. In contrast, the proposed DFDG with 1-NN

classifier has one hyperparameter while those with SVM have 3 hyperparameters including a penalty parameter and the kernel bandwidth. The tuning for the kernel bandwidths has been ignored in the existing DG methods (Ramdas et al., 2015). For both the existing and the proposed methods, the hyperparameters were selected by the grid search in the validation set, where 30% of each source domain was chosen as the validation set in the training, the so-called the training-domain validation method (Gulrajani & Lopez-Paz, 2021). The candidate hyperparameters are listed in the SM. After selecting the best hyperparameters in the validation set, the classification accuracy was calculated on the target. We randomly split the source domains as training and validation sets 5 times to calculate the mean and standard deviation of classification accuracy in the target domain.

5.1. Synthetic Data

A two-dimensional dataset with 4 domains and 3 classes was drawn from different Gaussian distributions N(µ, σ2) with mean µ (Table 1) and variance σ2. To investigate the influence of the prior distribution on the classes on different DG methods, the class size may be imbalanced as displayed in Figure 1, while the sample size of each domain was kept 600. The first three domains were the source domains, while the last one was the target domain. All the data were fed into the DG methods without any data preprocessing.

As shown in Table 2, the proposed DFDG outperformed all the kernel DG methods even using only one hyperparameter with the 1-NN classifier. The performance was further lifted by using the SVM classifier with more hyperparameters for the kernel bandwidths and the SVM penalty. See Figure S2 in the SM for the Extracted features by the proposed DFDG methods. The sensitivity analysis provided in SM demonstrated a superior sensitivity performance.

5.2. Case study

We considered three datasets, the Office+Caltech, VLCS and Terra Incognita in case study. The Office+Caltech dataset

Distribution Free Domain Generalization

Table 2. Mean and standard deviation of the classification accuracy of the synthetic experiments on 6 cases for different methods, where

bold red and bold black indicate the best and second best respectively. And #hp denotes the number of hyperparameters.

Method #hp Case1 Case2 Case3 Case4 Case5 Case6

k-NN 1 77.31 0.55 78.14 0.64 76.17 0.46 83.42 1.30 71.17 0.49 51.44 1.48 SVM 2 73.86 1.27 74.86 0.99 73.11 0.89 84.56 0.80 67.28 0.87 44.83 1.04 DICA 1-NN 2 87.25 2.05 84.67 3.36 84.08 1.39 87.03 1.31 78.53 5.18 66.28 1.11 SCA 1-NN 2 87.31 1.17 83.61 0.89 84.69 1.18 86.81 1.12 80.89 1.12 66.58 1.74 MDA 1-NN 3 88.47 1.01 81.00 1.41 82.00 0.51 87.64 1.25 81.14 0.82 64.89 1.41 CIDG 1-NN 4 91.03 0.52 86.58 0.69 84.56 0.81 90.36 0.68 84.52 1.71 69.06 5.79 SVM 3 93.90 0.48 87.57 1.73 90.03 1.85 93.40 0.63 87.53 0.77 79.30 1.84 DFDG-Eig 1-NN 1 91.13 0.83 86.87 1.83 90.23 0.30 90.57 1.22 84.57 1.63 75.77 0.67 SVM 3 92.97 0.61 89.43 1.18 92.50 0.35 93.57 0.52 86.37 0.84 71.23 1.45 DFDG-Cov 1-NN 1 89.20 1.20 85.83 1.46 88.83 2.00 90.83 1.13 82.33 0.73 69.60 3.39

Table 3. Accuracy in Office+Caltech and VLCS datasets where bold red and bold black indicate the best and second best, respectively.

Office+Caltech VLCS

Source C,D,WA,D,W D,W A,C A,D A,W L,C,S V,C,S V,L,S V,L,C C,S L,S L,C V,S V,C V,L Target A C A,C W,D W,C D,C V L C S V,L V,C V,S L,C L,S C,S

k-NN 79.7 68.6 48.8 61.2 71.5 70.6 46.8 49.5 72.9 48.9 52.5 50.7 42.1 57.5 49.6 56.3 SVM 92.2 82.8 68.7 80.5 84.9 84.4 64.7 58.6 84.9 63.9 59.5 63.3 53.6 66.8 64.9 70.3 DICA 1-NN 91.8 83.2 61.7 80.2 84.9 85.4 61.7 56.8 87.5 58.7 57.3 55.1 53.7 68.8 60.0 70.0 SCA 1-NN 92.2 82.3 65.0 81.2 85.2 83.8 65.3 58.0 89.4 60.7 58.4 56.8 54.8 69.8 61.1 70.9 MDA 1-NN 90.3 75.1 56.7 75.9 80.9 78.5 64.4 57.8 90.1 61.0 57.1 61.6 54.4 70.6 59.1 69.3 CIDG 1-NN 92.5 82.4 68.6 79.5 82.0 83.4 59.6 55.3 88.9 59.5 56.4 56.7 52.0 68.7 58.3 70.4 SVM 92.3 83.2 72.3 81.2 83.8 85.0 60.8 58.4 90.2 66.2 58.4 64.2 56.4 70.8 63.4 71.2 DFDG-Eig 1-NN 91.9 82.6 66.2 82.7 82.3 84.9 61.4 57.2 91.6 64.5 57.0 63.8 51.2 68.8 63.7 68.9 SVM 92.5 83.9 73.1 81.6 83.8 84.9 64.6 59.5 91.4 65.0 57.6 63.4 56.5 70.2 64.5 72.4 DFDG-Cov1-NN 90.5 82.3 68.2 81.2 81.5 84.3 62.6 56.0 93.0 62.9 56.1 62.0 51.5 68.3 61.6 72.0

(Gong et al., 2012) consists of 2533 images from ten classes over four domains: AMAZON (A), Caltech-256 (C), DSLR (D), and WEBCAM (W). The VLCS dataset (Fang et al., 2013) consists of four domains: PASCAL VOC (V), Label Me (L), Caltech101 (C) and SUN09 (S), and has 10729 images and five categories. The Terra Incognita data (Beery et al., 2018) were acquired from the Domain Bed dataset (Gulrajani & Lopez-Paz, 2021), which contains four locations (domains), 24788 examples and 10 classes. All the images from Office+Caltech and VLCS were preprocessed by feeding into the De CAF network to extract 4096 dimensional De CAF features (Donahue et al., 2014). We obtained features for Terra Incognita by training the Empirical Risk Minimization (ERM, Vapnik 1998)-adjusted Res Net 50 and extracting 2048-dimensional features from the last hidden layer. Six cases (domains or combinations of domains) were considered as the target domains for Office+Caltech data, and ten cases were considered for the VLCS data as the target domains as shown in Table 3. To be consistent with the existing studies, we did not consider the four target domains of D, W, A&D and A&W for Office+Caltech, since they all had more than 80% accuracy with the k-NN classifier. For the Terra Incognita dataset, we only considered four single target cases to make them comparable with the results in Gulrajani & Lopez-Paz (2021).

As shown in Table 3, the DFDG with 1-NN classifier achieved a similar performance as the other DG methods but with fewer hyperparameters. While the DFDG with the SVM classifier outperformed others in 9 of the 16 cases. Collectively, the proposed DFDG methods achieved the best performance in 11 out of 16 cases, and the second best in 12 out of 16 cases. The DFDG with SVM classifier significantly outperformed others with a p-value less than 0.002 as shown in Table S4 of the SM. The full results with mean and standard deviation of classification accuracy were given in Tables S5 and S6. We note that since the SVM classifier requires two more hyperparameters, it is hard to implement the SVM for the existing kernel DG methods as the time complexity is exponential with respect to the number of hyperparameters. In contrast, the proposed method can handle the extra computing need with the SVM, as there is only one hyperparameter in the feature selection.

Table 4 demonstrated quite outstanding performance using the proposed methods compared with the ERM baseline and the existing kernel DG methods. This lends support for the suitability of the proposed approach, and provides a way to couple with any deep learning based DG method. Our results showed that the DFDG method with a 1-NN classifier achieved approximately 0.8% performance gain

Distribution Free Domain Generalization

Table 4. Accuracy in the Terra Incognita dataset, where bold red

and bold black indicate the best and second best, respectively.

method L100 L38 L43 L46

ERM baseline 53.12 41.07 54.66 36.13 DICA 1-NN 43.81 32.76 48.88 32.51 SCA 1-NN 44.57 39.21 49.00 30.14 MDA 1-NN 39.74 35.44 47.77 26.04 CIDG 1-NN 45.88 38.04 50.43 33.83 DFDG-Eig SVM 55.28 42.71 56.60 38.31 DFDG-Eig 1-NN 53.49 41.59 55.68 36.88 DFDG-Cov SVM 55.45 41.58 55.92 37.66 DFDG-Cov 1-NN 53.66 41.59 54.97 38.36

compared to the ERM baseline, while equipping the DFDG method with the SVM classifier increased the classification accuracy by 1.7%. Notably, all the best performances were achieved by the DFDG-based methods. In contrast, the existing kernel DG methods failed to outperform the ERM baseline. A possible reason for this outcome could be the highly imbalanced classes in the Terra Incognita dataset. The class with the smallest number of instances in L38 had only three observations, while the one with the largest number of instances contained 4,485 examples. In such situation, standardization is crucial in handling domain/class dominance issues.

6. Conclusion

This paper proposes a kernel DG algorithm that addresses the fundamental problem of universal generality of a learning approach by proposing two standardization procedures in a unified DG problem framework, which contains fewer hyperparameters. The standardized distribution free metrics can balance the importance of each domain, equally treat each domain and class, and thus is applicable to imbalanced data. We also derive a generalization bound on the multi-class classification problem for the kernel DG methods, and show that the proposed DFDG algorithm produces superior performance in synthetic data and two real image classification experiments.

The proposed framework can be extended to incorporate weighted coefficients towards domains and classes, which enables us to assign a higher weight to the interested domain or the minor class. By reducing the number of hyperparameters, one attains a more efficient invariant feature extraction procedure, that allows for more powerful classifiers with increased generalization ability. One limitation of our work is lack of connections between the number of hyperparameters and the generalization bound, as fewer hyperparameters would reduce the model complexity and tighten the generalization bound. We leave this to future work.

Supplementary Materials

Further technical details, proofs and the example codes are available with this paper at https://github.com/t ongpf/Distribution-Free-Domain-General ization.

Acknowledgements

This research was supported by National Natural Science Foundation of China Grant 12026607.

Balaji, Y., Sankaranarayanan, S., and Chellappa, R. Metareg: Towards domain generalization using meta-regularization. Advances in neural information processing systems, 31, 2018.

Barachant, A., Bonnet, S., Congedo, M., and Jutten, C. Multiclass brain computer interface classification by riemannian geometry. IEEE Transactions on Biomedical Engineering, 59(4):920 928, 2012.

Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.

Blanchard, G., Lee, G., and Scott, C. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.

Blanchard, G., Deshmukh, A. A., Dogan, U., Lee, G., and Scott, C. Domain generalization by marginal transfer learning. The Journal of Machine Learning Research, 22 (1):46 100, 2021. ISSN 1532-4435.

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 647 655, Bejing, China, 22 24 Jun 2014. PMLR.

Fan, X., Wang, Q., Ke, J., Yang, F., Gong, B., and Zhou, M. Adversarially adaptive normalization for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8208 8217, June 2021.

Fang, C., Xu, Y., and Rockmore, D. N. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1657 1664, 2013.

Distribution Free Domain Generalization

Ghifary, M., Balduzzi, D., Kleijn, W. B., and Zhang, M. Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (7):1414 1430, 2017. ISSN 0162-8828.

Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066 2073, 2012.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. ISSN 1532-4435.

Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. In International Conference on Learning Representations, 2021.

Hu, S., Zhang, K., Chen, Z., and Chan, L. Domain generalization via multidomain discriminant analysis. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pp. 292 302. PMLR, 22 25 Jul 2020.

Li, B., Wang, Y., Zhang, S., Li, D., Keutzer, K., Darrell, T., and Zhao, H. Learning invariant representations and risks for semi-supervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1104 1113, 2021.

Li, Y., Gong, M., Tian, X., Liu, T., and Tao, D. Domain generalization via conditional invariant representations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018.

Long, M., Cao, Y., Wang, J., and Jordan, M. Learning transferable features with deep adaptation networks. In International conference on machine learning, pp. 97 105. PMLR, 2015.

Muandet, K., Balduzzi, D., and Schölkopf, B. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10 18. PMLR, 2013.

Ramdas, A., Jakkam Reddi, S., Poczos, B., Singh, A., and Wasserman, L. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), 2015.

Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263 2291, 2013.

Shawe-Taylor, J., Williams, C. K. I., Cristianini, N., and Kandola, J. On the eigenspectrum of the gram matrix and the generalization error of kernel-pca. IEEE Transactions on Information Theory, 51(7):2510 2522, 2005. ISSN 1557-9654.

Shu, Y., Cao, Z., Wang, C., Wang, J., and Long, M. Open domain generalization with domain-augmented metalearning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9624 9633, June 2021.

Vapnik, V. N. Statistical Learning Theory. Wiley, 1998.

Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., and Yu, P. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, pp. 1 1, 2022. ISSN 1041-4347.

Yan, J. and Zhang, X. Kernel two-sample tests in high dimensions: interplay between moment discrepancy and dimension-and-sample orders. Biometrika, 2022. ISSN 1464-3510.

Ye, H., Xie, C., Cai, T., Li, R., Li, Z., and Wang, L. Towards a theoretical framework of out-of-distribution generalization. Advances in Neural Information Processing Systems, 34:23519 23531, 2021.

Zhou, K., Yang, Y., Qiao, Y., and Xiang, T. Domain generalization with mixstyle. In International Conference on Learning Representations, 2021.

Zhou, K., Liu, Z., Qiao, Y., Xiang, T., and Loy, C. C. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 20, 2022. ISSN 0162-8828.