# distribution_free_domain_generalization__181c43a9.pdf Distribution Free Domain Generalization Peifeng Tong 1 Wu Su 2 He Li 3 4 Jialin Ding 3 Haoxiang Zhan 3 Song Xi Chen 3 1 Accurate prediction of the out-of-distribution data is desired for a learning algorithm. In domain generalization, training data from source domains tend to have different distributions from that of the target domain, while the target data are absence in the training process. We propose a Distribution Free Domain Generalization (DFDG) procedure for classification by conducting standardization to avoid the dominance of a few domains in the training process. The essence of the DFDG is its reformulating the cross domain/class discrepancy by pairwise two sample test statistics, and equally weights their importance or the covariance structures to avoid dominant domain/class. A theoretical generalization bound is established for the multi-class classification problem. The DFDG is shown to offer a superior performance in empirical studies with fewer hyperparameters, which means faster and easier implementation. 1. Introduction Domain generalization (DG) aims at transferring knowledge from the source domains to the target domains without the target data in the training process (Blanchard et al., 2011). A major challenge of DG is that the source and target data are not identically distributed. An algorithm trained from the source domains tends to be less performing in the target domain. DG is designed to attain robust performance in the target domain. Compared with the domain adaptation where the target data are accessible in training to obtain a target specific predictor (Long et al., 2015; Li et al., 2021), DG is designed for a single global predictor or classifier that performs well in 1Guanghua School of Management, Peking University, Beijing 100871, China 2Center for Big Data Research, Peking University, Beijing 100871, China 3School of Mathematical Science, Peking University, Beijing 100871, China 4Pazhou Lab, Guangzhou 510330, China. Correspondence to: Song Xi Chen . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). both the source and target domains (Blanchard et al., 2021). Studies have been proposed for the DG (Zhou et al., 2021; Fan et al., 2021; Shu et al., 2021), such as the kernel based domain invariant feature representation (Hu et al., 2020), the meta learning framework (Balaji et al., 2018) and the model selection or model average (Ye et al., 2021). See Wang et al. (2022) and Zhou et al. (2022) for a review. Among the existed DG methods, we follow the kernel DG methods (Muandet et al., 2013; Ghifary et al., 2017; Li et al., 2018; Hu et al., 2020) for new development. These methods first map data to a high dimensional reproducing kernel Hilbert space (RKHS), and then construct metrics to measure the cross domain and class discrepancy, followed by a low dimensional feature representation that minimizes the cross domain dissimilarity while keeping new features with different classes well separated. The metrics are usually constructed as variants of the maximum mean discrepancy (MMD) (Gretton et al., 2012). A common challenge with the DG is to counter the different mean levels and the variation among the discrepancy measures of different domains in the training stage. A robust DG procedure has to avoid domains with higher mean levels or variations to dictate the feature selection as features much influenced by the outlaying domains are doom to be weak in domain generalization. Existing kernel DG methods have to use more hyperparameters to balance the between-domain discrepancy measures, which may reduce the generalization ability of the methods. We propose two standardization procedures which are designed to reduce the heterogeneity in the kernel DG discrepancy statistics among the domains by conducting mean and variance adjustments. These standardizations are based on asymptotic analysis ((12) and Proposition 1) on the pairwise MMD statistics, which reduces the number of hyperparameters and speeds up the training process, and hence allows more computation intensive classifier in the DG procedure. Specifically, we put forward a distribution-free DG (DFDG) approach that provides a superior performance using fewer hyperparameters, which is well suited for DG. We unify the kernel DG methods as an optimization problem based on pairwise two-sample test statistics with concise matrix form in terms of the sandwich structure. Two distributionfree standardized metrics are proposed, one reweights the Distribution Free Domain Generalization weighting matrix by the means of the null distributions, and the other de-correlates the averaged Gram matrix. A generalization bound for the multi-class classification based on the DFDG is derived, which provides theoretical guarantee for the proposed DFDG approach. The paper is organized as follows. Section 2 gives the unified framework of the DG problem for classification. Section 3 proposes two distribution free metrics. Section 4 is for the generalization bound. Simulation and case studies are provided in Section 5, followed by a conclusion in Section 6. Some technical and numerical details are relegated to the supplementary material (SM). 2. Unified framework of DG problem Throughout the paper, we use bold lowercase letters for column vectors, and bold uppercase letters for matrices. We consider a classification task. Let X Rp denote the observation space and Y R be the set of class labels. Let PX Y denote the set of joint distributions on X Y. It is assumed that there exists a unimodal super distribution P with finite variance over PX Y, such that P (1) XY , . . . , P (m) XY are independent and identically distributed (IID) realizations from P in PX Y. For a domain s, there is a sample {(xs i, ys i )}ns i=1 of ns IID realizations of (x, y) according to the distribution P (s) XY . In general, for any s = s , P (s) XY = P (s ) XY , implying no-identical distribution cross the domains. Consider a target distribution P (t) XY P and target sample {(xt i, yt i)}nt i=1, where the class labels {yt i} are not available, and {xt i} are not used in the training. This forces us to establish a global model without retraining the model for a specific target domain. Our goal is to extract domain-invariant features that have minimum cross domain discrepancy and maximum cross class discrepancy simultaneously. The kernel method is founded on a RKHS H associated with a kernel k and inner product , H having the reproducing property that for any function f : X R, f( ), k(x, ) H = f(x). The canonical map ϕ(x) : X H can be denoted as ϕ(x) := k(x, ) satisfying k(x, x ) = ϕ(x)T ϕ(x ). To map a probability distribution to the RKHS, we define the kernel mean embedding µ : PX H induced by k µPX := EX[ϕ(X)] = Z X ϕ(x)d PX. If k is a bounded and characteristic kernel, the mapping is injective so that ||µPX µP X||H = 0 if and only if (iff) PX = P X. The sample estimator ˆµPX = 1 n Pn i=1 ϕ(xi). Denote the kernel mean embedding of P (s) X and P (s) X|Y =j by µs and µs j, respectively. These mean maps are all high dimensional and we assume that µP RN for a large integer N, where N can be infinity. 2.1. Cross domain discrepancy The cross domain discrepancy can be regarded as the sum of pairwise distances at each domain condition over every class, as follows. Definition 1 (pairwise cross domain discrepancy (PDD)). Given the class-conditional distributions {P (s) X|Y =j} for s {1, . . . , m} and j {1, . . . , c}, the PDD Ψpdd := 1 c m 2 1 s 0 is ˆRn,ρ(g) = 1 cm i=1 lρ(rg( xs j,i, j)), where xs j,i = ( ˆP (s) X|Y =j, xs j,i), lρ(x) = min(1, max(0, 1 x/ρ)) is a ρ-margin loss function, ρ 1-Lipschitz. The expected loss (risk) of the classification R(g) = E( x,y)I(rg( xi, yi) 0), where I( ) is the indicator function. Since I(x 0) lρ(x), the expected loss R(g) E( x,y) ˆRn,ρ(g) for any g. For the DG problem, the widely used product kernel k is k((P (s) X , xs i), (P (s ) X , xs j )) = k P (P (s) X , P (s ) X )k1(xs i, xs j ) (24) with a RKHS H k (Blanchard et al., 2011). For the choice of k P , let k2 denote a kernel on X with RKHS Hk2 and feature map ϕk2, we define the k2 induced kernel mean embedding µ : PX Hk2 as µPX := R X ϕk2(x)d PX(x), and introduce another kernel K on Hk2 such that k P P (s) X , P (s ) X = K µP (s) X , µP (s ) X Combining the classifier and the kernel k, a family of the DG based score functions can be denoted as G k = {( x, y) X {1, . . . , c} 7 a T y W T ϕ k( x) : A = (a1, . . . , ac)T , ||AW T ||H k qΛ}. The following assumption makes k a bounded universal kernel. Assumption 2. (i) The kernel k1 is universal on X, and k2 is universal and continuous on X, K is universal on any compact subset of Hk2. The kernels k1, k2 and K are bounded by U 2 1 , U 2 2 and U 2 K, respectively. (ii) The canonical feature map ϕK associated with K is LK-Lipschitz. The observation space X is a compact metric space. We have the following theorem regarding the multi-class generalization bound. Theorem 1. Given Assumption 2, and assume that ns j = n for balanced sample size. Then, for a ρ > 0 and any δ > 0, with probability at least 1 δ, the following multi-class classification generalization bound holds for all g G k: R(g) ˆRn,ρ(g) + 1 ρqΛU1U2LK 6 m n + 4 r c Distribution Free Domain Generalization (a) Case 1 (b) Case 2 (c) Case 3 (d) Case 4 (e) Case 5 (f) Case 6 Figure 1. The prior distributions and the variances of the 6 data generalization cases. The bars show the prior probabilities of the different classes within each domain, where the center indexes indicate the domains. The light color indicates that the data are generated with variance one while the darker color (see Cases 5 and 6) means the variance is four. Table 1. Center points and sample sizes for the synthetic data. Domain Domain 1 Domain 2 Domain 3 Domain 4 Class 1 2 3 1 2 3 1 2 3 1 2 3 X1 1 4 4 0.5 3.5 3.5 1 4 4 0.5 3.5 3.5 X2 2 2 -2 1.5 1.5 -2.5-1.5-1.5-5.5-1.5-1.5-5.5 instances 600 600 600 600 Theorem 1 generalizes the results in Hu et al. (2020) by quantifying the effects of class number c and the feature dimension q introduced by the proposed standardization methods. Indeed, it shows that a larger c or q leads to a weaker guarantee. Given the confidence level 1 δ, the excess risk converges to zero if n log cm and m 5. Empirical results We compare the proposed DFDG with the existing DG methods on a synthetic dataset and two real image classification tasks. The two proposed DFDG metrics DFDG-Eig (Section 3.1) and DFDG-Cov (Section 3.2) associated with two classifiers, the 1-nearest neighbor (1-NN) and the support vector machine (SVM), are used for comparison. The proposed DFDG is compared with the conventional k NN and SVM without dimension reduction, the Kernel DG methods, namely the domain invariant component analysis (DICA, Muandet et al. 2013), the scatter component analysis (SCA, Ghifary et al. 2017), the conditional invariant DG (CIDG, Li et al. 2018) and the multi-domain discriminant analysis (MDA, Hu et al. 2020), where 1-NN was used for these kernel DG methods. The product kernel (24) was used for all the kernel-based DG methods, where k1, k2 and K are Gaussian kernels with bandwidth h, h and one, respectively. The bandwidth h is chosen by the median heuristic unless specified otherwise. Even with the 1-NN classifier, the existing kernel based DG methods typically have three hyperparameters as listed in Table 2. In contrast, the proposed DFDG with 1-NN classifier has one hyperparameter while those with SVM have 3 hyperparameters including a penalty parameter and the kernel bandwidth. The tuning for the kernel bandwidths has been ignored in the existing DG methods (Ramdas et al., 2015). For both the existing and the proposed methods, the hyperparameters were selected by the grid search in the validation set, where 30% of each source domain was chosen as the validation set in the training, the so-called the training-domain validation method (Gulrajani & Lopez-Paz, 2021). The candidate hyperparameters are listed in the SM. After selecting the best hyperparameters in the validation set, the classification accuracy was calculated on the target. We randomly split the source domains as training and validation sets 5 times to calculate the mean and standard deviation of classification accuracy in the target domain. 5.1. Synthetic Data A two-dimensional dataset with 4 domains and 3 classes was drawn from different Gaussian distributions N(µ, σ2) with mean µ (Table 1) and variance σ2. To investigate the influence of the prior distribution on the classes on different DG methods, the class size may be imbalanced as displayed in Figure 1, while the sample size of each domain was kept 600. The first three domains were the source domains, while the last one was the target domain. All the data were fed into the DG methods without any data preprocessing. As shown in Table 2, the proposed DFDG outperformed all the kernel DG methods even using only one hyperparameter with the 1-NN classifier. The performance was further lifted by using the SVM classifier with more hyperparameters for the kernel bandwidths and the SVM penalty. See Figure S2 in the SM for the Extracted features by the proposed DFDG methods. The sensitivity analysis provided in SM demonstrated a superior sensitivity performance. 5.2. Case study We considered three datasets, the Office+Caltech, VLCS and Terra Incognita in case study. The Office+Caltech dataset Distribution Free Domain Generalization Table 2. Mean and standard deviation of the classification accuracy of the synthetic experiments on 6 cases for different methods, where bold red and bold black indicate the best and second best respectively. And #hp denotes the number of hyperparameters. Method #hp Case1 Case2 Case3 Case4 Case5 Case6 k-NN 1 77.31 0.55 78.14 0.64 76.17 0.46 83.42 1.30 71.17 0.49 51.44 1.48 SVM 2 73.86 1.27 74.86 0.99 73.11 0.89 84.56 0.80 67.28 0.87 44.83 1.04 DICA 1-NN 2 87.25 2.05 84.67 3.36 84.08 1.39 87.03 1.31 78.53 5.18 66.28 1.11 SCA 1-NN 2 87.31 1.17 83.61 0.89 84.69 1.18 86.81 1.12 80.89 1.12 66.58 1.74 MDA 1-NN 3 88.47 1.01 81.00 1.41 82.00 0.51 87.64 1.25 81.14 0.82 64.89 1.41 CIDG 1-NN 4 91.03 0.52 86.58 0.69 84.56 0.81 90.36 0.68 84.52 1.71 69.06 5.79 SVM 3 93.90 0.48 87.57 1.73 90.03 1.85 93.40 0.63 87.53 0.77 79.30 1.84 DFDG-Eig 1-NN 1 91.13 0.83 86.87 1.83 90.23 0.30 90.57 1.22 84.57 1.63 75.77 0.67 SVM 3 92.97 0.61 89.43 1.18 92.50 0.35 93.57 0.52 86.37 0.84 71.23 1.45 DFDG-Cov 1-NN 1 89.20 1.20 85.83 1.46 88.83 2.00 90.83 1.13 82.33 0.73 69.60 3.39 Table 3. Accuracy in Office+Caltech and VLCS datasets where bold red and bold black indicate the best and second best, respectively. Office+Caltech VLCS Source C,D,WA,D,W D,W A,C A,D A,W L,C,S V,C,S V,L,S V,L,C C,S L,S L,C V,S V,C V,L Target A C A,C W,D W,C D,C V L C S V,L V,C V,S L,C L,S C,S k-NN 79.7 68.6 48.8 61.2 71.5 70.6 46.8 49.5 72.9 48.9 52.5 50.7 42.1 57.5 49.6 56.3 SVM 92.2 82.8 68.7 80.5 84.9 84.4 64.7 58.6 84.9 63.9 59.5 63.3 53.6 66.8 64.9 70.3 DICA 1-NN 91.8 83.2 61.7 80.2 84.9 85.4 61.7 56.8 87.5 58.7 57.3 55.1 53.7 68.8 60.0 70.0 SCA 1-NN 92.2 82.3 65.0 81.2 85.2 83.8 65.3 58.0 89.4 60.7 58.4 56.8 54.8 69.8 61.1 70.9 MDA 1-NN 90.3 75.1 56.7 75.9 80.9 78.5 64.4 57.8 90.1 61.0 57.1 61.6 54.4 70.6 59.1 69.3 CIDG 1-NN 92.5 82.4 68.6 79.5 82.0 83.4 59.6 55.3 88.9 59.5 56.4 56.7 52.0 68.7 58.3 70.4 SVM 92.3 83.2 72.3 81.2 83.8 85.0 60.8 58.4 90.2 66.2 58.4 64.2 56.4 70.8 63.4 71.2 DFDG-Eig 1-NN 91.9 82.6 66.2 82.7 82.3 84.9 61.4 57.2 91.6 64.5 57.0 63.8 51.2 68.8 63.7 68.9 SVM 92.5 83.9 73.1 81.6 83.8 84.9 64.6 59.5 91.4 65.0 57.6 63.4 56.5 70.2 64.5 72.4 DFDG-Cov1-NN 90.5 82.3 68.2 81.2 81.5 84.3 62.6 56.0 93.0 62.9 56.1 62.0 51.5 68.3 61.6 72.0 (Gong et al., 2012) consists of 2533 images from ten classes over four domains: AMAZON (A), Caltech-256 (C), DSLR (D), and WEBCAM (W). The VLCS dataset (Fang et al., 2013) consists of four domains: PASCAL VOC (V), Label Me (L), Caltech101 (C) and SUN09 (S), and has 10729 images and five categories. The Terra Incognita data (Beery et al., 2018) were acquired from the Domain Bed dataset (Gulrajani & Lopez-Paz, 2021), which contains four locations (domains), 24788 examples and 10 classes. All the images from Office+Caltech and VLCS were preprocessed by feeding into the De CAF network to extract 4096 dimensional De CAF features (Donahue et al., 2014). We obtained features for Terra Incognita by training the Empirical Risk Minimization (ERM, Vapnik 1998)-adjusted Res Net 50 and extracting 2048-dimensional features from the last hidden layer. Six cases (domains or combinations of domains) were considered as the target domains for Office+Caltech data, and ten cases were considered for the VLCS data as the target domains as shown in Table 3. To be consistent with the existing studies, we did not consider the four target domains of D, W, A&D and A&W for Office+Caltech, since they all had more than 80% accuracy with the k-NN classifier. For the Terra Incognita dataset, we only considered four single target cases to make them comparable with the results in Gulrajani & Lopez-Paz (2021). As shown in Table 3, the DFDG with 1-NN classifier achieved a similar performance as the other DG methods but with fewer hyperparameters. While the DFDG with the SVM classifier outperformed others in 9 of the 16 cases. Collectively, the proposed DFDG methods achieved the best performance in 11 out of 16 cases, and the second best in 12 out of 16 cases. The DFDG with SVM classifier significantly outperformed others with a p-value less than 0.002 as shown in Table S4 of the SM. The full results with mean and standard deviation of classification accuracy were given in Tables S5 and S6. We note that since the SVM classifier requires two more hyperparameters, it is hard to implement the SVM for the existing kernel DG methods as the time complexity is exponential with respect to the number of hyperparameters. In contrast, the proposed method can handle the extra computing need with the SVM, as there is only one hyperparameter in the feature selection. Table 4 demonstrated quite outstanding performance using the proposed methods compared with the ERM baseline and the existing kernel DG methods. This lends support for the suitability of the proposed approach, and provides a way to couple with any deep learning based DG method. Our results showed that the DFDG method with a 1-NN classifier achieved approximately 0.8% performance gain Distribution Free Domain Generalization Table 4. Accuracy in the Terra Incognita dataset, where bold red and bold black indicate the best and second best, respectively. method L100 L38 L43 L46 ERM baseline 53.12 41.07 54.66 36.13 DICA 1-NN 43.81 32.76 48.88 32.51 SCA 1-NN 44.57 39.21 49.00 30.14 MDA 1-NN 39.74 35.44 47.77 26.04 CIDG 1-NN 45.88 38.04 50.43 33.83 DFDG-Eig SVM 55.28 42.71 56.60 38.31 DFDG-Eig 1-NN 53.49 41.59 55.68 36.88 DFDG-Cov SVM 55.45 41.58 55.92 37.66 DFDG-Cov 1-NN 53.66 41.59 54.97 38.36 compared to the ERM baseline, while equipping the DFDG method with the SVM classifier increased the classification accuracy by 1.7%. Notably, all the best performances were achieved by the DFDG-based methods. In contrast, the existing kernel DG methods failed to outperform the ERM baseline. A possible reason for this outcome could be the highly imbalanced classes in the Terra Incognita dataset. The class with the smallest number of instances in L38 had only three observations, while the one with the largest number of instances contained 4,485 examples. In such situation, standardization is crucial in handling domain/class dominance issues. 6. Conclusion This paper proposes a kernel DG algorithm that addresses the fundamental problem of universal generality of a learning approach by proposing two standardization procedures in a unified DG problem framework, which contains fewer hyperparameters. The standardized distribution free metrics can balance the importance of each domain, equally treat each domain and class, and thus is applicable to imbalanced data. We also derive a generalization bound on the multi-class classification problem for the kernel DG methods, and show that the proposed DFDG algorithm produces superior performance in synthetic data and two real image classification experiments. The proposed framework can be extended to incorporate weighted coefficients towards domains and classes, which enables us to assign a higher weight to the interested domain or the minor class. By reducing the number of hyperparameters, one attains a more efficient invariant feature extraction procedure, that allows for more powerful classifiers with increased generalization ability. One limitation of our work is lack of connections between the number of hyperparameters and the generalization bound, as fewer hyperparameters would reduce the model complexity and tighten the generalization bound. We leave this to future work. Supplementary Materials Further technical details, proofs and the example codes are available with this paper at https://github.com/t ongpf/Distribution-Free-Domain-General ization. Acknowledgements This research was supported by National Natural Science Foundation of China Grant 12026607. Balaji, Y., Sankaranarayanan, S., and Chellappa, R. Metareg: Towards domain generalization using meta-regularization. Advances in neural information processing systems, 31, 2018. Barachant, A., Bonnet, S., Congedo, M., and Jutten, C. Multiclass brain computer interface classification by riemannian geometry. IEEE Transactions on Biomedical Engineering, 59(4):920 928, 2012. Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. Blanchard, G., Lee, G., and Scott, C. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. Blanchard, G., Deshmukh, A. A., Dogan, U., Lee, G., and Scott, C. Domain generalization by marginal transfer learning. The Journal of Machine Learning Research, 22 (1):46 100, 2021. ISSN 1532-4435. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 647 655, Bejing, China, 22 24 Jun 2014. PMLR. Fan, X., Wang, Q., Ke, J., Yang, F., Gong, B., and Zhou, M. Adversarially adaptive normalization for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8208 8217, June 2021. Fang, C., Xu, Y., and Rockmore, D. N. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1657 1664, 2013. Distribution Free Domain Generalization Ghifary, M., Balduzzi, D., Kleijn, W. B., and Zhang, M. Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (7):1414 1430, 2017. ISSN 0162-8828. Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066 2073, 2012. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012. ISSN 1532-4435. Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. In International Conference on Learning Representations, 2021. Hu, S., Zhang, K., Chen, Z., and Chan, L. Domain generalization via multidomain discriminant analysis. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pp. 292 302. PMLR, 22 25 Jul 2020. Li, B., Wang, Y., Zhang, S., Li, D., Keutzer, K., Darrell, T., and Zhao, H. Learning invariant representations and risks for semi-supervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1104 1113, 2021. Li, Y., Gong, M., Tian, X., Liu, T., and Tao, D. Domain generalization via conditional invariant representations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. Long, M., Cao, Y., Wang, J., and Jordan, M. Learning transferable features with deep adaptation networks. In International conference on machine learning, pp. 97 105. PMLR, 2015. Muandet, K., Balduzzi, D., and Schölkopf, B. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10 18. PMLR, 2013. Ramdas, A., Jakkam Reddi, S., Poczos, B., Singh, A., and Wasserman, L. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), 2015. Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263 2291, 2013. Shawe-Taylor, J., Williams, C. K. I., Cristianini, N., and Kandola, J. On the eigenspectrum of the gram matrix and the generalization error of kernel-pca. IEEE Transactions on Information Theory, 51(7):2510 2522, 2005. ISSN 1557-9654. Shu, Y., Cao, Z., Wang, C., Wang, J., and Long, M. Open domain generalization with domain-augmented metalearning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9624 9633, June 2021. Vapnik, V. N. Statistical Learning Theory. Wiley, 1998. Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., and Yu, P. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, pp. 1 1, 2022. ISSN 1041-4347. Yan, J. and Zhang, X. Kernel two-sample tests in high dimensions: interplay between moment discrepancy and dimension-and-sample orders. Biometrika, 2022. ISSN 1464-3510. Ye, H., Xie, C., Cai, T., Li, R., Li, Z., and Wang, L. Towards a theoretical framework of out-of-distribution generalization. Advances in Neural Information Processing Systems, 34:23519 23531, 2021. Zhou, K., Yang, Y., Qiao, Y., and Xiang, T. Domain generalization with mixstyle. In International Conference on Learning Representations, 2021. Zhou, K., Liu, Z., Qiao, Y., Xiang, T., and Loy, C. C. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1 20, 2022. ISSN 0162-8828.