# exploiting_domainspecific_features_to_enhance_domain_generalization__d2ef6d5e.pdf Exploiting Domain-Specific Features to Enhance Domain Generalization Manh-Ha Bui1 Toan Tran1 Anh Tuan Tran1 Dinh Phung1,2 1 Vin AI Research, Vietnam 2 Monash University, Australia {v.habm1, v.toantm3, v.anhtt152, v.dinhpq2}@vinai.io Domain Generalization (DG) aims to train a model, from multiple observed source domains, in order to perform well on unseen target domains. To obtain the generalization capability, prior DG approaches have focused on extracting domaininvariant information across sources to generalize on target domains, while useful domain-specific information which strongly correlates with labels in individual domains and the generalization to target domains is usually ignored. In this paper, we propose meta-Domain Specific-Domain Invariant (m DSDI) - a novel theoretically sound framework that extends beyond the invariance view to further capture the usefulness of domain-specific information. Our key insight is to disentangle features in the latent space while jointly learning both domain-invariant and domainspecific features in a unified framework. The domain-specific representation is optimized through the meta-learning framework to adapt from source domains, targeting a robust generalization on unseen domains. We empirically show that m DSDI provides competitive results with state-of-the-art techniques in DG. A further ablation study with our generated dataset, Background-Colored-MNIST, confirms the hypothesis that domain-specific is essential, leading to better results when compared with only using domain-invariant. 1 Introduction and Related work Domain Generalization (DG) has recently become an important research topic in machine learning due to its real-world applicability and its close connection to the way humans generalize to learn in a new domain. In a DG framework, the learner is trained on multiple datasets collected under different environments without any access to any data on the target domain (1). One of the most notable approaches to this problem is to learn the domain-invariant features across these training datasets, with the assumption that these invariant representations are also held in unseen target domains (2; 3; 4; 5; 6). While this has been shown to work well in practice, its key drawback is completely ignoring domain-specific information that could aid the generalization performance, especially when the number of source domains increases (7). For instance, consider the problem of classifying dog or fish images from two source domains: sketch and photo. While the sketch contains a conceptual drawing of the animal, the photo includes their taken picture within a background. In this case, sketch domain-invariant is kept across domains, while domain-specific, e.g., a dog in a house or fish in the ocean, will be discarded due to only existing in the photo domain. However, this background information, when present, could lead to an improvement of the classification performance in target domains due to common association between the objects of its background, and when negligent sketches are hard to distinguish. From a theoretical standpoint, there has also been strong recent evidence to indicate the insufficiency of learning domaininvariant representation for successful adaptation in domain adaptation problems (8; 9). For example, Correspondence to Manh-Ha Bui: . 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Zhao et al. (8) has pointed out the degradation in target predictive performance if domain-invariant representations are forced while the marginal label distributions on the source and target domains are overly different. Utilizing domain-specific features in DG has been widely studied in recent works (e.g., (10; 7)). Ding and Fu (10) introduce multiple domain-specific networks for each domain, then use the structured low-rank constraints to align them with domain-invariant. While this encourages the better transfer of knowledge, its main problem is the requirement of too many domain-specific networks. More recently, Chattopadhyay et al. (7) proposed a masking strategy to disentangle domain-invariant and domain-specific to further boost domain-specific learning, but its key drawback is that domaininvariant/domain-specific representations might not be disentangled since the learning and inferring procedures are performed implicitly (i.e., without any theoretical guarantee) through a mask generalization process. That means it lacks a clear motivation as well as theoretical justifications. Regarding meta-learning related work, a typical approach involving meta-learning in DG is MLDG (11) that is based on gradient update which simulates train/test domain shift within each mini-batch, mainly to learn transferable weight representations from meta-source domains to quickly adapt to the meta-target domain, and so improve generalization ability. However, their task objective adapts for all representation features which include domain-invariant, since low effectiveness because domain-invariant is stable across domains, pushing to adapt those features might affect the stability of those domain-invariant, leading to a lower generalization performance on the target domain. To handle these domain-invariant shortcomings, in this paper, we propose a novel theoretically sound DG approach that aims to extract label-informative domain-specific and then explicitly disentangles the domain-invariant and domain-specific representations in an efficient way without training multiple networks for domain-specific. Following the meta-learning idea and mitigating previous work s drawbacks, we apply a meta-learning technique specifically to exploit domain-specific quality which should need to be adapted to unseen domains from source domains. Our contributions in this work are summarized as follows: We provide a theoretical analysis based on the information bottleneck principle to point out the limitation of only learning invariant and the importance of domain-specific representation by a certainly plausible assumption. We then develop a rigorous framework to formulate elements of domain-invariant/domainspecific representations, in which our key insight is to introduce an effective metaoptimization training framework (11) to learn domain-specific representation from multiple training domains. Without accessing any data from unseen target domains, the meta-training procedure provides a suitable mechanism to self-learn domain-specific representation. We term our approach meta-Domain Specific-Domain Invariant (m DSDI) and provide necessary theoretical verifications for it. To demonstrate the merit of the proposed m DSDI framework, we extensively evaluate m DSDI on several state-of-the-art DG benchmark datasets, including Colored-MNIST, Rotated-MNIST, VLCS, PACS, Office-Home, Terra Incognita, Domain Net in addition to our newly created Background-Colored-MNIST for the ablation study to examine the behavior of our m DSDI. 2 Methodology 2.1 Problem setting and Definitions Let X RD be the sample space and Y R the label space. Denote the set of joint probability distributions on X Y by PX Y, and the set of probability marginal distributions on X by PX . A domain is defined by a joint distribution P(x, y) PX Y, and let P be a measure on PX Y, i.e., whose realizations are distributions on X Y. Denote N source domains by S(i) = {(x(i) j , y(i) j )}ni j=1, i = 1, . . . , N, where ni is the number of data points in S(i), i.e., (x(i) j , y(i) j ) iid P (i)(x, y) where P (i)(x, y) P; and x(i) j P (i) X , in which P (i) X PX . In a typical DG framework, a learning model which is only trained on the set of source domains {S(i)}N i=1 without any access to the (unlabeled) data points in the target domain, arrives at a good generalization performance on the test dataset ST = {(x T j , y T j )}n T j=1, where (x T j , y T j ) iid P T (x, y) and P T (x, y) P. First, we present the definition of domain-invariant representation in a latent space Z under covariate shift assumption (i.e., the conditional distribution P(y|x) is unchanged across the source domains): Definition 1. A feature extraction mapping Q : X Z is said to be domain-invariant if the distribution PQ(Q(X)) is unchanged across the source domains, i.e., i, j = 1, . . . , N, i = j we have P (i) Q (Q(X)) P (j) Q (Q(X)), where P (i) Q (Q(X)) = PQ(Q(X)|X P (i) X ), i = 1, . . . , N. In this case, the corresponding latent representation ZI = Q(X) is then called the domain-invariant representation (see (3) also). As mentioned in the example in the introduction part, the definition 1 reveals that the extracted domain-invariant latent ZI could be the conceptual drawing of the animal which is shared in both sketch and photo domains. However, when existing background information is taken by a picture such as a house or ocean, it is crucial to take these backgrounds into account because the domain-invariant feature extraction Q might ignore them by only existing in the photo domain. Therefore, we next introduce the definition of domain-specific in latent space as follows: Definition 2. A feature extraction mapping R : X Z is said to be domain-specific if i, j = 1, . . . , N, i = j such that P (i) R (R(X)) = P (j) R (R(X)), where P (i) R (R(X)) = PR(R(X)|X P (i) X ), i = 1, . . . , N. In this case, given X P (i) X the corresponding latent representation Z(i) S = R(X) is then called the domain-specific representation w.r.t. the domain S(i). Definition 2 states that for any domain i and j, the distributions of the domain-specific latent P (i) R (R(X)) and P (j) R (R(X)) must be completely different. For instance, following our mentioned example, given two fish images from i and j that are sketch and photo domain, the mapping R(X) should extract specific information that only belongs to the domain including the shadow of the fish drawing in the sketch and ocean background information in the photo domain. To this end, this paper aims to show that only learning domain-invariant will limit the prediction performance and generalization ability. Hence, we next provide a formal explanation for the motivation of learning domain-specific, by showing the potential drawback of only learning domain-invariance in terms of predicting class labels. 2.2 A theoretical analysis under the Information bottleneck method Domain Invariant I( ;Y; ) =I( ;Y; ) Domain Specific I( ;Y| )= Domain Specific I( ;Y| )= Relevant Domain Domain Specific Redundant Domain Specific Discard redundant info Extract relevant info Domain Invariant I( ;Y; ) =I( ;Y; ) Figure 1: Venn diagram showing relationships between source domains represented by X1, X2, target domain represented by XT , and label Y . (a) The learning procedure of minimal and sufficient label-related representation in definition 3. (b) Explaining the theorem 1 where the domain-invariant based method provides an inferior prediction performance to our proposed method that incorporates both domain-invariant I(ZI ; Y ) and labelrelated domain-specific values ϵ made by our assumption 1. (c) A case when the unseen (target) domain has different XT and Y T , while domain-invariant information is still stable across domains, some domain-specific in source domains become redundant information. Notations. Given three arbitrary random variables A, B, and C, let us use I(A; B) to represent mutual information between A and B; I(A; B|C) to represent conditional mutual information of A and B given C; H(A) to represent entropy of A; and H(A|B) to represent conditional entropy for random variables A given B. For simplicity, we consider the case with two source domains S1, S2 (the results with multiple source domains can be naturally extended from there). We also define two corresponding random variables Xi P (i) X , that are sampled from the marginal distribution PX in the domain Si, i = 1, 2. Figure 1 illustrates all the definitions and assumptions above used for our theoretical verification (in Theorem 1). In particular, in that figure, each of the four colored rectangles represents an individual entropy: H(X1) for domain S1 is in blue, H(X2) for domain S2 is in red, H(XT ) for the target domain is in black, and H(Y ) for the class label is in green border rectangle. We first show the ineffectiveness of only learning domain-invariant information when compared with incorporating domain-specific in source domain S1 (and similarly with domain S2). Our justification partly relies on the following assumption about the correlation between the domainspecific representation and the class label: Assumption 1. (Label-correlated domain-specificity) Assuming that there exists a domain-specific representation Z(1) S extracted by the deterministic mapping Z(1) S = R(X1) in definition 2, which correlates with label in domain S1 such that I(Z(1) S ; Y |X2) = I(X1; Y |X2) = ε1, where ε1 > 0 is a constant. Assumption 1 indicates that, for the source domain S1, we can learn Z(1) S = R(X1) such that I(Z(1) S ; Y |X2) is strictly positive and equals to I(X1; Y |X2), where I(X1; Y |X2) is the specific information that correlates with the label in the domain S1, but not in the domain S2 (12). For instance, in the example mentioned in the introduction, if domain S1 is photo while S2 is sketch , the value of ϵ1 should be positive because the background information such as a house, the ocean also provides information to predict whether the object is a dog or fish without considering its conceptual drawing. This assumption is particularly valid and practically plausible and is demonstrated by several examples observed in our experiments. For instance, for the Domain Net benchmark dataset, in the real-world domain, many bed pictures contain a bed in the room or bike pictures that have bicycles parked on the street. Other examples are in PACS such as dogs in the yard or guitars lying on a table in photo and art domains. These examples are strongly related to assumption 1, in which specific information correlates with labels in a particular domain. We next present supervised learning frameworks under the umbrella of the information theory (13; 14) and the information bottleneck method (13; 15) that generalizes minimal sufficient statistics to the minimal (i.e., less complexity) and sufficient (i.e, better fidelity) representations. The learning process of such representations is equivalent to solving the following objectives: Definition 3. (Minimal and sufficient representations with label (14)). Let ZX1 = G(X1) is the output of a deterministic latent mapping G. A representation Zsup is said to be the sufficient label-related representation and Zsup is said to be the minimal and sufficient representation if: Zsup = argmax G I(ZX1; Y ) and Zsup = argmin Zsup I(Zsup; X1) s.t. I(Zsup; Y ) is maximized. The learning procedure for definition 3 is illustrated in Figure 1: (a). The method is equivalent to employ compressed representations to reduce the complexity (redundant information) of I(ZX1; X1) by minimizing and providing sufficient representation to class label Y by maximizing I(ZX1; Y ). Similarly and motivated by multi-view information bottleneck settings (16), we present the objective of learning sufficient (and minimal) representations with domain-invariant information in the below definition: Definition 4. (Minimal and sufficient representations with domain-invariance (16)). Let ZX1 = Q(X1) is the output of a deterministic domain invariant mapping Q in the definition 1. Then ZI is said to be the sufficient domain-invariant representation and ZI is said to be the minimal and sufficient representation if: ZI = argmax Q I(ZX1; X2) and ZI = argmin ZI I(ZI; X1) s.t. I(ZI; X2) is maximized. Definition 4 introduces a learning strategy for domain-invariance across domains (or views) that preserves shared information across two domains by maximizing I(ZX1; X2); and also reduces specificity (redundant information) of the domain S1 by minimizing I(ZX1; X1). We next present a lemma about the conditional independence between the latent representation ZX1 and both the label Y and the random variable X2 when Q, R, and G are deterministic functions of the random variable X1: Lemma 1. (Determinism (14)) If P(ZX1|X1) is a Dirac delta function, then the following conditional independence holds: Y ZX1|X1 and X2 ZX1|X1, inducing a Markov chain X2 Y X1 ZX1. The proof of Lemma 1 is provided in Appendix A.1. Lemma 1 simply states that ZX1 contains no more information than X1. Now, we show the ineffectiveness of only learning domain-invariant approach, based on the existence of the label-related domain-specific in the following theorem: Theorem 1. (Label-related information with domain-specificity) Assuming that there exists a domainspecific value ε1 > 0 in domain S1 (see Assumption 1), the label-related representation - based learning approach (i.e., using Zsup and Zsup ) provides better prediction performance than the domain-invariant representation - based method (i.e., using ZI and ZI ). Formally, I(X1; Y ) = I(Zsup; Y ) = I(Zsup ; Y ) = I(ZI ; Y ) + ε1 > I(ZI ; Y ). The proof of Theorem 1 is mainly based on the result of Lemma 1, and is provided in Appendix A.2. The visualization of Theorem 1 is depicted by Figure 1: (b), where the domain-specific value for domain S2, ε2 is obtained in the same way as ε1. It indicates that if an existing domain-specific representation has the positive corresponding information value ε, the domain-invariance-based learning method provides an inferior prediction performance to our proposed method that incorporates both domain-invariant and label-related domain-specific. Now, Theorem 1 suggests that besides optimizing a domain-invariant mapping Q as usual, we should jointly optimize domain-specific mapping R to achieve a better generalization performance. However, in domain generalization, we are not allowed to access the target domain for training and must use Q and R from source domains. As pointed out in (10; 17; 7), although domain-invariant might be the same because it is unchanged across source domains, there is no guarantee whether this domain-specific information on the source domain is relevant to the target domain while making the prediction. Figure 1: (c) illustrates this case when the target domain has different XT and Y T , then some extracted domain-specific from source domains become redundant information. Therefore, the next raising question is how to learn domain-invariant and domain-specific effectively. To handle these shortcomings, we next propose a unified framework that jointly optimizes both Q and R by disentangling their feature representation. In particular, the deterministic mapping Q is optimized by adversarial learning to extract useful domain-invariant features across domains. Meanwhile, by leveraging the transfer weight representations from meta-source domains to adapt to a meta-target domain, we apply meta-learning to deterministic mapping R to force it to extract relevant domain-specific features of the target domain to improve generalization ability. 2.3 Algorithm: meta-Domain Specific-Domain Invariant (m DSDI) So far, we have discussed the main ideas of our proposed method. Here we discuss implementation details for our proposed m DSDI approach. Figure 2 shows the graphical model and overview of our m DSDI framework. In particular, our unified network consists of the following components: a domain-invariant representation ZI = QθQ(X); a domain-specific representation ZS = RθR(X); a domain discriminator DθDI : ZI 1, N; a domain classifier DθDS : ZS 1, N and a classifier FθF : ZI ZS Y. We also denote domain random variable by D, sample space by D, and outcome by d. Domain-Invariant and Domain-Specific Extraction. The domain-invariant representation ZI defined in Definition 1, is obtained by using an adversarial training framework (3), in which the domain discriminator DθDI tries to maximize the prediction probability of the domain label from the latent ZI, while the goal of the encoder QθQ is to map the sample X to the latent ZI, such that DθDI cannot discriminate the domain of X. This task can be performed by solving the following min-max game: min θQ max θDI {LZI := Ex,d X,D [d log DI(Q(x))]} . (1) To extract the domain-specific ZS defined in Definition 2, we propose the use of the domain classifier DθDS , that is trained to predict the domain label from ZS. The corresponding parameters θDS and Adversarial Figure 2: The graphical model (a) and overall architecture (b) for our proposed m DSDI, including: domaininvariant ZI is optimized via adversarial training with domain discriminator DI, domain-specific ZS is optimized via domain classifier DS, these latent ZI and ZS are disentangled by using covariance matrix. To push them to contain label information, these latents are integrated into a classifier F which is optimized via cross-entropy with the label Y . To make the model able to adapt specific information from source to unseen domain while still remaining domain-invariance information across domains, we additionally push ZS through a meta-learning procedure. θR are, therefore, optimized with the objective function below: min θDS ,θR {LZS := Ex,d X,D [d log DS(R(x))]} . (2) Disentanglement between Domain-Invariant and Domain-Specific. The disentanglement condition between two random vectors ZI and ZS can be solved by forcing their covariance matrix, denoted by Cov(ZI, ZS) close to 0. A detailed discussion of disentangled two representations is provided in Appendix B.3. The related parameters (θQ, θR) are then updated in the following optimization problem: min θQ,θR {LD := Ex X [ Cov(Q(x), R(x)) 2]} , (3) where 2 is the L2 norm. Sufficiency of domain-specific and domain-invariant w.r.t. the classification task. The goal of the classifier F parameterized by θF is to predict the label of the original sample X based on the domain-invariant ZI and domain-specific ZS, i.e., ˆY = FθF (ZI ZS), (4) where denotes the concatenation operation. Then, the training process of F is then performed by solving min θQ,θR,θF {LT := Ex,y X,Y [y log F(Q(x), R(x))]} . (5) Meta-Training for Domain-Specific Information. To encourage the domain-specific representation ZS to adapt information learned from the source domains to the unseen target domain, we introduce the use of meta-learning framework (11), targeting a robust generalization. Note that the domaininvariant feature ZI remains during the meta-learning procedure. In particular, each source domain Sm, m 1, N is split into two sub-domains, namely meta-train Smr and meta-test Sme. The domain-specific parameters θR and the classifier parameters θF are then jointly optimized as follows: min w {LTm := f (w f (w, Smr) , Sme)} , (6) where w = (θR, θF ) and f (w, Sm) = Ex,y X,Y [y log F(ZI, R(x))] . (7) Training and Inference. The pseudo-code for training and inference processes of our proposed m DSDI framework is presented in Algorithm 1. Each iteration of the training process consists of two steps: i) First, we integrate the objective functions (1), (2), (3) and (5) to construct an objective function LA defined as follows: min θQ,θDS ,θR,θFmax θDI {LA := λZILZI + λZSLZS + λDLD + LT } , (8) where λZI, λZS and λD are selected as the balanced parameters. ii) The second step is to employ meta-training to adapt task-related domain-specific from source domains to unseen domains. In each mini-batch, the meta-train and meta-test are split, then the gradient transformation step from meta-train domains to the meta-test domain is performed by solving the optimization problem (6). Algorithm 1: Training and Inference processes of m DSDI Training Input: Source domain S(i), encoder QθQ, RθR, domain classifier DθDI , DθDS for ZI, ZS, task classifier FθF , batch size B, learning rate η. Output: The optimal: Q θQ, R θR, F θF ; for ite = 1 iterations do Sample SB with a mini-batch B for each domain S(i); Compute LA using Eq. (8) and perform gradient update θQ,θR,θDI ,θDS ,θF LA with η.; for j = 1 N (number of source domains) do Split Meta-train SB/j, Meta-test Sj; Meta-train: Perform gradient update θR,θF by minimizing Eq. (7) with SB/j and η; Meta-test: Compute LTm using Eq. (6) with Sj and updated gradient from Meta-train; Meta-optimization: Perform gradient update θR,θF LTm with η; end end Inference Input: Target domain ST , optimal: Q θQ, R θR, F θF . Output: Y T using Eq. (4); 3 Experiments 3.1 Experimental settings Dataset. To evaluate the effectiveness of the proposed method, we utilize 7 commonly used datasets including: Colored-MNIST (18): includes 70000 samples of dimension (2, 28, 28) in binary classification problem with noisy label, from MNIST over 3 domains with noisy rate d 0.1, 0.3, 0.9 , Rotated-MNIST (19): contains 70000 samples of dimension (1, 28, 28) and 10 classes, rotated from MNIST over 6 domains d 0, 15, 30, 45, 60, 75 , VLCS (20): includes 10729 samples of dimension (3, 224, 224) and 5 classes, over 4 photographic domains d Caltech101, Label Me, SUN09, VOC2007 , PACS (2): contains 9991 images of dimension (3, 224, 224) and 7 classes, over 4 domains d artpaint, cartoon, sketches, photo , Office-Home (21): has 15500 daily images of dimension (3, 224, 224) and 65 categories, over 4 domains d art, clipart, product, real , Terra Incognita (22): includes 24778 wild photographs of dimension (3, 224, 224) and 10 animals, over 4 camera-trap domains d L100, L38, L43, L46 , and Domain Net (23): contains 586575 images of dimension (3, 224, 224) and 345 classes, over 6 domains d clipart, infograph, painting, quickdraw, real, sketch . The detail of each dataset is provided in Appendix C.1. Baseline. Following Domain Bed (24) settings, we compare our model with 14 related methods in DG which are divided by 5 common techniques, including: Standard Empirical Risk Minimization: Empirical Risk Minimization (ERM (25)); domain-specific-learning: Group Distributionally Robust Optimization (Group DRO (26)), Marginal Transfer Learning (MTL (1; 27)), Adaptive Risk Minimization (ARM (28)); Meta-learning: Meta-Learning for DG (MLDG (11)); domain-invariantlearning: Invariant Risk Minimization (IRM (18)), Deep CORrelation ALignment (CORAL (29)), Maximum Mean Discrepancy (MMD (30)), Domain Adversarial Neural Networks (DANN (31)), Class-conditional DANN (CDANN (32)), Risk Extrapolation (VREx (33)); Augmenting data: Interdomain Mixup (Mixup (34; 35; 36)), Style-Agnostic Networks (Sag Nets (37)), Representation Self Challenging (RSC (38)). The detail of each method is provided in Appendix C.2. We use the training-domain validation set technique as proposed in Domain Bed (24) for model selection. In particular, for all datasets, we first merge the raw training and validation, then, we run the test three times with three different seeds. For each random seed, we randomly split training and validation and choose the model maximizing the accuracy on the validation set, then compute performance on the given test sets. The mean and standard deviation of classification accuracy from these three runs are reported. We evaluate generalization performance based on backbones MNIST-Conv Net (24) for MNIST datasets and Res Net-50 (39) for non-MNIST datasets to compare with the mentioned methods. Data-processing techniques, model architectures, hyper-parameters, and changes of objective functions during training are presented in detail in Appendix C.3 C.5. All source code to reproduce results are available at https://github.com/Vin AIResearch/m DSDI. 3.2 Results Table 1: Classification accuracy (%) for all algorithms and datasets summarization. Our m DSDI method achieves highest accuracy on average when comparing 14 popular DG algorithms across 7 benchmark datasets. Method CMNIST RMNIST VLCS PACS Office Home Terra Inc Domain Net Average ERM (25) 51.5 0.1 98.0 0.0 77.5 0.4 85.5 0.2 66.5 0.3 46.1 1.8 40.9 0.1 66.6 IRM (18) 52.0 0.1 97.7 0.1 78.5 0.5 83.5 0.8 64.3 2.2 47.6 0.8 33.9 2.8 65.4 Group DRO (26) 52.1 0.0 98.0 0.0 76.7 0.6 84.4 0.8 66.0 0.7 43.2 1.1 33.3 0.2 64.8 Mixup (34; 35; 36) 52.1 0.2 98.0 0.1 77.4 0.6 84.6 0.6 68.1 0.3 47.9 0.8 39.2 0.1 66.7 MLDG (11) 51.5 0.1 97.9 0.0 77.2 0.4 84.9 1.0 66.8 0.6 47.7 0.9 41.2 0.1 66.7 CORAL (29) 51.5 0.1 98.0 0.1 78.8 0.6 86.2 0.3 68.7 0.3 47.6 1.0 41.5 0.1 67.5 MMD (30) 51.5 0.2 97.9 0.0 77.5 0.9 84.6 0.5 66.3 0.1 42.2 1.6 23.4 9.5 63.3 DANN (31) 51.5 0.3 97.8 0.1 78.6 0.4 83.6 0.4 65.9 0.6 46.7 0.5 38.3 0.1 66.1 CDANN (32) 51.7 0.1 97.9 0.1 77.5 0.1 82.6 0.9 65.8 1.3 45.8 1.6 38.3 0.3 65.6 MTL (1; 27) 51.4 0.1 97.9 0.0 77.2 0.4 84.6 0.5 66.4 0.5 45.6 1.2 40.6 0.1 66.2 Sag Nets (37) 51.7 0.0 98.0 0.0 77.8 0.5 86.3 0.2 68.1 0.1 48.6 1.0 40.3 0.1 67.2 ARM (28) 56.2 0.2 98.2 0.1 77.6 0.3 85.1 0.4 64.8 0.3 45.5 0.3 35.5 0.2 66.1 VREx (33) 51.8 0.1 97.9 0.1 78.3 0.2 84.9 0.6 66.4 0.6 46.4 0.6 33.6 2.9 65.6 RSC (38) 51.7 0.2 97.6 0.1 77.1 0.5 85.2 0.9 65.5 0.9 46.6 1.0 38.9 0.5 66.1 m DSDI (Ours) 52.2 0.2 98.0 0.1 79.0 0.3 86.2 0.2 69.2 0.4 48.1 1.4 42.8 0.1 67.9 Table 1 summarizes the results of our experiments on 7 benchmark datasets when compared with mentioned methods. The full result per dataset and domain is provided in Appendix C.4. From these results, we draw three conclusions about our m DSDI model: Our m DSDI still preserves domain-invariant information. We observe in some target domains which have background-less images and assume only contain domain-invariant information such as Colored-MNIST, Rotated-MNIST, or Terra Incognita (similar observation in cartoon or sketch in PACS, clip-art or product in Office Home, and quickdraw in Domain Net. Full results in Appendix C.4), our m DSDI model still achieves competitive results with other baselines (e.g., 52.2% in Colored MNIST, 98.0% in Rotated-MNIST, and 48.1% in Terra Incognita) which are based on domaininvariant-learning techniques such as DANN, C-DANN, CORAL, MMD, IRM, and VREx. Those results demonstrate the effectiveness of our adversarial training technique for extracting domaininvariant features. Furthermore, due to considering disentangled domain-invariant and domainspecific latent, even in situations where the samples do not have domain-specific, our model still performs well by retaining informative domain-invariance. Our m DSDI could capture the usefulness of domain-specific information. In contrast, we observe that in some target domains that have relevant domain-specific with source domains such as landscape background of the object class from photographic pictures in VLCS (similar observation in PACS such as dogs in the yard or guitars lying on a table in photo and art domain, bed in the room or bike parked on the street in the Art and Real-world domain of Office Home. Full results in Appendix C.4), m DSDI achieves significantly higher results than other methods (e.g., 79.0% in VLCS, 86.2% in PACS). This means that our domain-invariant features not only support generalization better but also our domain-specific ones cover helpful information in special scenarios such as backgrounds and colors related to objects in the classification task. Moreover, when comparing with other domain-specific based techniques such as Group DRO, MTL, and ARM, the results showed that domain-specific features learned by meta-training from our model are more helpful than theirs and have captured useful domain-specific features from those object-background relations. Extending beyond the invariance view to usefulness domain-specific information is important. As shown in Table 1, our m DSDI has the highest average number with 67.9% (highlighted with statistically significant according to a t-test at a significance level α = 0.05). The reason why our model outperforms other baselines could be explained by the fact that their domain-invariant methods are not able to capture domain-specific information, and so have poor performance. Meanwhile, when comparing with other domain-specific based methods, their models only concentrate on domainspecific techniques, and so provide inferior domain-invariant information to our techniques in some background-less images. In contrast, due to considering disentangled domain-invariant and domainspecific features, and having the right strategy to learn each latent, our model captures both this useful information, hence, outperforms their results. Not only has the highest average number, but our method also dominates other methods on a known large-scale dataset such as 69.2% in Office-Home or 42.8% in Domain Net. This implies that besides the essential combination between domain-invariant and domain-specific, when the number of datasets increases, our method can extract more relevant information for complex tasks, such as classifying 345 classes in Domain Net. These results also mean that our model has a balance between informative domain-invariant and domainspecific features to adapt better to different environments than others, therefore showing the highest average in all settings. 3.3 How does m DSDI work? (a) Domain-invariant (Classes) (b) Domain-invariant (Domains) (d) Domain-specific (Domains) (c) Domain-specific (Classes) Figure 3: Feature visualization for domain-invariant: (a): different colors represent different classes; (b): different colors indicate different domains. Feature visualization for domain-specific: (c): different colors represent different classes; (d): different colors indicate different domains. Source domain includes: art (red), cartoon (green), sketch (blue) while target domain is photo (black) in the domain plots. Best viewed in color (Zoom in for details). To better understand our framework, we visualize the distribution of the learned features with t SNE (40) to analyze the feature space of domain-invariant and domain-specific. As shown in Figure 3 on PACS Dataset, our domain-invariant extractor can minimize the distance between the distribution of the domains (see Figure 3: (b)). However, these domain-invariant features still make mistakes on the classification task, indicated by a mixture of points from different class labels in the middle (see Figure 3: (a)), and many of these points are from the target domain (black color in the Figure 3: (b)). Meanwhile, the domain-specific representation better distinguishes points by class label (see Figure 3: (c)). More importantly, the photo domain s specific features (black) are close to the art domain (red) (see Figure 3: (d)). This is reasonable because only these two domains include backgrounds related to the object class. It implies that meta-training in our model well learns specific features that can be adapted to the new unseen domain. 3.4 Ablation study: Important of m DSDI on the Background-Colored-MNIST dataset This section examines our system design by checking its performance under different settings in the real scenario with our generated dataset (a similar experiment with PACS benchmark dataset is in Appendix C.6). Background-Colored-MNIST dataset. Figure 4 illustrates our Background-Colored-MNIST, generated from the original MNIST. We assume the domain-invariant is the digit s sketch and design the dataset so that the background color is domain-specific. As a result, on the unseen domain, domain-specific will be useful for the classification task. Specifically, the dataset includes three source domains dtr, different by digit s color red, green, blue , generated from a subset with 1000 training images for MNIST per each domain. In each source domain, the background color is the same for intra-class images but different across classes. In the target domain, 10000 testing images of MNIST are colored for one target domain dte with digit color orange . In this domain, each class s background color is similar to the same class s background color in one of three source domains. Importance of m DSDI. We aim to prove the combination of learning disentangled representation domain-specific, domain-invariant, and meta-training on domain-specific are important in this scenario. To do so, we compare our model under nine settings: learning domain-invariant only (DI), learning domain-specific only (DS), meta-training on domain-invariant (DI-Meta), meta-training on domain-specific (DS-Meta), a combination of domain-invariant and domain-specific without disentanglement loss LD (DSDI-Without LD), a combination of domain-invariant and domain-specific without meta-training (DSDI-Without Meta), meta-training on both representation ZI and ZS (DSDIMeta), meta-training on domain-invariant without domain-specific (DSDI-Meta DI) and our proposed framework (m DSDI-Meta DS), which is meta-training on domain-specific without domain-invariant. Table 2 shows that our model is the best setting with 89.7%. It proves that combining domain-invariant and domain-specific is crucial by dominating the settings with only domain-invariant or domainspecific (DI, DI-Meta, DS, DS-Meta). Regarding disentangling two representations ZI and ZS, it is worth noting that without disentanglement loss LD, the model only achieves 81.4%, which is lower than m DSDI. It reveals adding disentanglement loss LD is essential to boost our model performance. Compared with meta-training on both, which only reaches 82.1% and on domain-invariant are 79.0%, it implies that meta-training is necessary but only for the domain-specific, and our arguments are reasonable. It also shows that our combination with domain-invariant and domain-specific is not easy: a deep ensemble between two neural networks, but existing a deep-down insight in our framework in DG, when compared with m DSDI-Without Meta which can be seen as a type of ensemble, is only around 80.4%. Source domains: Target domain: Figure 4: Background-Colored-MNIST Dataset, where source domains include red, green, blue digit colors and target domain has orange color. Table 2: Classification accuracy (%) on Background Colored-MNIST. Ablation study shows impact of domain-invariant when combined with meta-training on domain-specific in our method. Method Accuracy DI 65.7 4.6 DI-Meta 63.6 5.1 DS 70.7 4.8 DS-Meta 75.3 3.4 DSDI-Without LD 81.4 2.6 DSDI-Without Meta 80.4 1.7 DSDI-Meta 82.1 1.4 DSDI-Meta DI 79.0 2.3 m DSDI-Meta DS (Ours) 89.7 0.8 4 Conclusion and Discussion Despite being aware of the importance of domain-specific information, little investigation into the theory and a rigorous algorithm to explore its representation. To the best of our knowledge, our work provides the first theoretical analysis to understand and realize the efficiency of domain-specific information in domain generalization. The domain-specific contains unique characteristics and when combined with domain-invariant information can significantly aid performance on unseen domains. Following our theoretical insights based on the information bottleneck principle, we propose a m DSDI algorithm which disentangles these features. We next introduce the use of the meta-training scheme to support domain-specific to adapt information from source domains to unseen domains. Our experimental results demonstrate m DSDI brings out competitive results with related approaches in domain generalization. In addition, the ablation study with our Background-Colored-MNIST further illustrated and demonstrated the efficiency of combining domain-invariant and domain-specific via our proposed m DSDI. Our theoretical analysis and proposed m DSDI framework can facilitate fundamental progress in understanding the behavior of both domain-invariant and domain-specific representation in domain generalization. Toward a robustness algorithm that can effectively learn both domain-invariant and domain-specific features, there are certainly many challenges that remain in our paper, for example, a theorem to explain when domain-specific may hurt performance in the unseen domain, a stronger connection between theory in implementation, a method to make two representations to be non-linearly independent as well as a lower computational cost of the covariance matrix. In the future, we plan to continue tackling these challenges to provide a better understanding and learning framework in domain generalization then extending to broader settings of transfer learning. A detailed clarification, discussion, and plausible methods in the future work are provided in Appendix B. [1] Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classification tasks to a new unlabeled sample. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24, pages 2178 2186. Curran Associates, Inc., 2011. [2] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542 5550, 2017. [3] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao. Deep domain generalization via conditional invariant adversarial networks. In ECCV, 2018. [4] Shoubo Hu, Kun Zhang, Zhitang Chen, and Laiwan Chan. Domain generalization via multidomain discriminant analysis, 2019. [5] Maximilian Ilse, Jakub M. Tomczak, Christos Louizos, and Max Welling. Diva: Domain invariant variational autoencoders, 2019. [6] Ching-Yao Chuang, Antonio Torralba, and Stefanie Jegelka. Estimating generalization under distribution shifts via domain-invariant representations, 2020. [7] Prithvijit Chattopadhyay, Yogesh Balaji, and Judy Hoffman. Learning to balance specificity and invariance for in and out of domain generalization, 2020. [8] Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoffrey J. Gordon. On learning invariant representation for domain adaptation. Co RR, abs/1901.09453, 2019. [9] Fredrik D. Johansson, David Sontag, and Rajesh Ranganath. Support and invertibility in domain-invariant representations, 2019. [10] Z. Ding and Y. Fu. Deep domain generalization with structured low-rank constraint. IEEE Transactions on Image Processing, 27(1):304 313, 2018. [11] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales. Learning to generalize: Meta-learning for domain generalization, 2017. [12] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368 377, 1999. [13] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368 377, 1999. [14] Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Selfsupervised learning from a multi-view perspective. In International Conference on Learning Representations, 2021. [15] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations, 2018. [16] Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. Co RR, abs/2002.07017, 2020. [17] Vihari Piratla, Praneeth Netrapalli, and Sunita Sarawagi. Efficient domain generalization via common-specific low-rank decomposition, 2020. [18] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization, 2020. [19] Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain generalization for object recognition with multi-task autoencoders. Co RR, abs/1508.07680, 2015. [20] Chen Fang, Ye Xu, and Daniel N. Rockmore. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In 2013 IEEE International Conference on Computer Vision, pages 1657 1664, 2013. [21] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation, 2017. [22] Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. Co RR, abs/1807.04975, 2018. [23] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation, 2019. [24] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. Co RR, abs/2007.01434, 2020. [25] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998. [26] Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization, 2020. [27] Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning, 2021. [28] Marvin Zhang, Henrik Marklund, Nikita Dhawan, Abhishek Gupta, Sergey Levine, and Chelsea Finn. Adaptive risk minimization: A meta-learning approach for tackling group distribution shift, 2021. [29] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation, 2016. [30] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C. Kot. Domain generalization with adversarial feature learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5400 5409, 2018. [31] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks, 2016. [32] Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain generalization via conditional invariant representation, 2018. [33] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex), 2021. [34] Minghao Xu, Jian Zhang, Bingbing Ni, Teng Li, Chengjie Wang, Qi Tian, and Wenjun Zhang. Adversarial domain adaptation with domain mixup, 2019. [35] Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training, 2020. [36] Yufei Wang, Haoliang Li, and Alex C. Kot. Heterogeneous domain generalization via domain mixup. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020. [37] Hyeonseob Nam, Hyun Jae Lee, Jongchan Park, Wonjun Yoon, and Donggeun Yoo. Reducing domain gap by reducing style bias, 2021. [38] Zeyi Huang, Haohan Wang, Eric P. Xing, and Dong Huang. Self-challenging improves crossdomain generalization, 2020. [39] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Co RR, abs/1512.03385, 2015. [40] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579 2605, 2008. [41] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006. [42] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723 773, 2012. [43] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. [44] Seonguk Seo, Yumin Suh, Dongwan Kim, Jongwoo Han, and Bohyung Han. Learning to optimize domain specific normalization for domain generalization. Co RR, abs/1907.04275, 2019.