# subspace_identification_for_multisource_domain_adaptation__3af435e4.pdf Subspace Identification for Multi-Source Domain Adaptation Zijian Li2,3, Ruichu Cai2 , Guangyi Chen3,1, Boyang Sun3, Zhifeng Hao4, Kun Zhang3,1 1 Carnegie Mellon University 2 School of Computer Science, Guangdong University of Technology 3 Mohamed bin Zayed University of Artificial Intelligence 4 Shantou University Multi-source domain adaptation (MSDA) methods aim to transfer knowledge from multiple labeled source domains to an unlabeled target domain. Although current methods achieve target joint distribution identifiability by enforcing minimal changes across domains, they often necessitate stringent conditions, such as an adequate number of domains, monotonic transformation of latent variables, and invariant label distributions. These requirements are challenging to satisfy in real-world applications. To mitigate the need for these strict assumptions, we propose a subspace identification theory that guarantees the disentanglement of domain-invariant and domain-specific variables under less restrictive constraints regarding domain numbers and transformation properties, thereby facilitating domain adaptation by minimizing the impact of domain shifts on invariant variables. Based on this theory, we develop a Subspace Identification Guarantee (SIG) model that leverages variational inference. Furthermore, the SIG model incorporates class-aware conditional alignment to accommodate target shifts where label distributions change with the domains. Experimental results demonstrate that our SIG model outperforms existing MSDA techniques on various benchmark datasets, highlighting its effectiveness in real-world applications. 1 Introduction Multi-Source Domain Adaptation (MSDA) is a method of transferring knowledge from multiple labeled source domains to an unlabeled target domain, to address the challenge of domain shift between the training data and the test environment. Mathematically, in the context of MSDA, we assume the existence of M source domains {S1, S2, ..., SM} and a single target domain T . For each source domain Si, data are drawn from a distinct distribution, represented as px,y|u Si, where the variables x, y, u correspond to features, labels, and domain indices, respectively. In a similar manner, the distribution within the target domain T is given by px,y|u T . In the source domains, we have access to mi annotated feature-label pairs of each domain, denoted by (x Si, y Si) = (x Si k , y Si k )mi k=1, while in the target domain, only m T unannotated features are observed, represented as (x(T )) = (x T k )m T k=1. The primary goal of MSDA is to effectively leverage these labeled source data and unlabeled target data to identify the target joint distribution px,y|u T . However, identifying the target joint distribution of x, y|u T using only x|u T as observations present a significant challenge, since the possible mappings from px,y|u T to px|u T are infinite when no extra constraints are given. To solve this problem, some assumptions have been proposed to constrain the domain shift, such as covariate shift [44], target shift [58, 1], and conditional shift [3, 56]. For example, the most conventional covariate shift assumption posits that py|x is fixed across different domains Corresponding authors. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). while px varies. Under this assumption, researchers can employ techniques such as importance reweighting [52], invariant representation learning [11], or cycle consistency [17] for distribution alignment. Additionally, target shift assumes that px|y is fixed while the label distribution py changes, whereas conditional shift is represented by a fixed py and a varying px|y. More generally, the minimal change principle has been proposed, which not only unifies the aforementioned assumptions but also enables the theoretical guarantee of the identifiability of the target joint distribution. Specifically, it assumes that py|u and px|y,u change independently and the change of px|y,u is minimal. Please refer to Appendix G for further discussion of related works, including domain adaptation and identification. Although current methods demonstrate the identifiability of the target joint distribution through the minimal change principle, they often impose strict conditions on the data generation process and the number of domains, limiting their practical applicability. For instance, i MSDA [30] presents the component-wise identification of the domain-changed latent variables, subsequently identifying target joint distribution by modeling a data generation process with variational inference. However, this identification requires the following conditions. First, a sufficient number of auxiliary variables is employed for the component-wise theoretical guarantees, meaning that when the dimension of latent variables is n, a total of 2n + 1 domains are needed. Second, in order to identify high-level invariant latent variables, a component-wise monotonic function between latent variables must be assumed. Third, these methods implicitly assume that label distribution remains stable across domains, despite the prevalence of target shift in real-world scenarios. These conditions are often too restrictive to be met in practice, highlighting the need for a more general approach to identifying latent variables across a wider range of domain shifts. In an effort to alleviate the need for such strict assumptions, we present a subspace identification theory in this paper that guarantees the disentanglement of domain-invariant and domain-specific variables under more relaxed constraints concerning the number of domains and transformation properties. In contrast to component-wise identification, our subspace identification method demands fewer auxiliary variables (i.e., when the dimension of latent variables is n, only n + 1 domains are required). Additionally, we design a more general data generation process that accounts for target shift and does not necessitate monotonic transformation between latent variables. In this process, we categorize latent variables into four groups based on whether they are influenced by domain index or label. Building on the theory and causal generation process, we develop a Subspace Identification Guarantee (SIG) model that employs variational inference to identify latent variables. A class-aware condition alignment is incorporated to mitigate the impact of target shift, ensuring the update of the most confident cluster embedding. Our approach is validated through a simulation experiment for subspace identification evaluation and four widely-used public domain adaptation benchmarks for application evaluation. The impressive performance that outperforms state-of-the-art methods demonstrates the effectiveness of our method. 2 Identifying Target Joint Distribution with Data Generation Process 2.1 Data Generation Process Figure 1: Data generation process, where the gray the white nodes denote the observed and latent variables, respectively. We begin with introducing the data generation process. As shown in Figure 1, the observed data x X are generated by latent variables z Z Rn. Sequentially, we divide the latent variables z into the four parts, i.e. z = {z1, z2, z3, z4} Z Rn, which are shown as follows. domain-specific and label-irrelevant variables z1 Rn1. domain-specific but label-relevant variables z2 Rn2. domain-invariant and label-relevant variables z3 Rn3. domain-invariant but label-irrelevant variables z4 Rn4. To better understand these latent variables, we provide some examples in Domain Net datasets. First, z1 Rn1 denotes the styles of the images like infograph and sketch , which are irrelevant to labels. z2 Rn2 denotes the latent variables that can be the texture information relevant to domains and labels. For example, the samples of clock and telephone contain some digits, and these digits in these samples are a special texture, which can be used for classification and be influenced by different styles, such as infograph . z3 Rn3 denotes the latent variables that are only relevant to the labels. For example, in the Domain Net dataset, it can be interpreted as the meaning of different classes like Bicycle or Teapot . Finally, z4 Rn4 denotes the label-irrelevant latent variables. For example, z4 can be interpreted as the background that is invariant to domains and labels. Based on the definitions of these latent variables, we let the observed data be generated from z through an invertible and smooth mixing function g : Z X. Due to the target shift, we further consider that the py is influenced by u, i.e. u y. Compared with the existing data generation process like [30], the proposed data generation process is different in three folds. First, pu is independent of py in the i MSDA [30], so the target shift is not taken into account. Second, the data generation process of i MSDA requires an invertible and monotonic function between latent variables for component-wise identification, which is too strict to be met in practice. Third, to provide a more general way to depict the real-world data, our data generation process introduces the domain-specific but label-relevant latent variables zs2 and the domain-invariant but label-irrelevant variables z4. 2.2 Identifying the Target Joint Distribution In this part, we show how to identify the target joint distribution px,y|u T with the help of marginal distribution. By introducing the latent variables and combining the proposed data generation process, we can obtain the following derivation. px,y|u T = Z z4 px,y,z1,z2,z3,z4|u T dz1dz2dz3dz4 z4 px,z1,z2,z3,z4|y,u T py|u T dz1dz2dz3dz4 z4 px|z1,z2,z3,z4 pz1,z2,z3,z4|y,u T py|u T dz1dz2dz3dz4. According to the derivation in Equation (1), we can identify the target joint distribution by modeling three distributions. First, we need to model px|z1,z2,z3,z4, implying that we need to model the conditional distribution of observed data give latent variables, which coincides with a generative model for observed data. Second, we need to estimate the label pseudo distribution of target domain py|u T . Third, we need to model pz1,z2,z3,z4|y,u T meaning that the latent variables should be identified with auxiliary variables u, y under theoretical guarantees. In the next section, we will introduce how to identify these latent variables with subspace identification block-wise identification results. 3 Subspace Identifiability for Latent Variables Figure 2: A simple data generalization process for introducing subspace identification. In this section, we provide how to identify the latent variables in Figure 1. In detail, we first prove that z2 is subspace identifiable and z1, z3 can be reconstructed from the estimated ˆz1, ˆz2, ˆz3. Then we further prove that z4 is block-wise identifiable. To clearly introduce the subspace identification theory, we employ a simple data generation process [3] as shown in Figure 2. In this data generation process, zs Rns and zc Rnc denote the domain-specific and domain-invariant latent variables, respectively. For convenient, we let z = {zs, zc}, n = ns + nc. Moreover, we assume zs = (zi)ns i=1 and zc = (zi)n i=ns+1. And {u, y, x} denote the domain index, labels, and observed data, respectively. And we further let the observed data be generated from z through an invertible and smooth mixing function g : Z X. The subspace identification of zs means that for each ground-truth zs,i, there exits ˆzs and an invertible function hi : Rn R, such that zs,i = hi(ˆzs). Theorem 1. (Subspace Identification of zs.) We follow the data generation process in Figure 2 and make the following assumptions: A1 (Smooth and Positive Density): The probability density function of latent variables is smooth and positive, i.e., pz|u > 0 over Z and U. A2 (Conditional independent): Conditioned on u, each zi is independent of any other zj for i, j {1, , n}, i = j, i.e. log pz|u(z|u) = Pn i qi(zi, u) where qi(zi, u) is the log density of the conditional distribution, i.e., qi : log pzi|u. A3 (Linear independence): For any zs Zs Rns, there exist ns + 1 values of u, i.e., uj with j = 0, 1, , ns, such that these ns vectors w(z, uj) w(z, u0) with j = 1, , ns are linearly independent, where vector w(z, uj) is defined as follows: w(z, u) = q1(z1, u) z1 , , qi(zi, u) zi , qns(zns, u) By modeling the aforementioned data generation process, zs is subspace identifiable. Proof sketch. First, we construct an invertible transformation h between the ground-truth z and estimated ˆz. Sequentially, we leverage the variance of different domains to construct a full-rank linear system, where the only solution of zs ˆzc is zero. Since the Jacobian of h is invertible, for each zs,i, i {1, , ns}, there exists a hi such that zs,i = hi(ˆz) and zs is subspace identifiable. The proof can be found in Appendix B.1. The first two assumptions are standard in the componentwise identification of existing nonlinear ICA [30, 27]. The third Assumption means that pz|u should vary sufficiently over n + 1 domains. Compared to component-wise identification, which necessitates 2n + 1 domains and is likely challenging to fulfill, subspace identification can yield equivalent results in terms of identifying the ground-truth latent variables with only n+1 domains. Therefore, subspace identification benefits from a more relaxed assumption. Based on the theoretical results of subspace identification, we show that the ground-truth z1, z2 and z3 be reconstructed from the estimated ˆz1, ˆz2 and ˆz3. For ease of exposition, we assume that z1, z2, z3, and z4 correspond to components in z with indices {1, , n1}, {n1 + 1, , n1 + n2}, {n1 + n2 + 1, , n1 + n2 + n3}, and {n1 + n2 + n3 + 1, , n}, respectively. Corollary 1.1. We follow the data generation in Section 2.1, and make the following assumptions which are similar to A1-A3: A4 (Smooth and Positive Density): The probability density function of latent variables is smooth and positive, i.e., pz|u,y > 0 over Z, U, and Y. A5 (Conditional independent): Conditioned on u and y, each zi is independent of any other zj for i, j {1, , n}, i = j, i.e. log pz|u,y(z|u, y) = Pn i qi(zi, u, y) where qi(zi, u, y) is the log density of the conditional distribution, i.e., qi : log pzi|u,y. A6 (Linear independence): For any z Z Rn, there exists n1 + n2 + n3 + 1 combination of (u, y), i.e. j = 1, , U and c = 1, , C and U C 1 = n1 + n2 + n3, where U and C denote the number of domains and the number of labels. such that these n1 + n2 + n3 vectors w(z, uj, yc) w(z, u0, y0) are linearly independent, where w(z, uj, yc) is defined as follows: w(z, u, y) = q1(z1, u, y) z1 , , qi(zi, u, y) zi , qn(zn, u, y) By modeling the data generation process in Section 2.1, z2 is subspace identifiable, and z1, z3 can be reconstructed from ˆz1, ˆz2 and ˆz2, ˆz3, respectively. Proof sketch. The detailed proof can be found in Appendix B.2. First, we construct an invertible transformation h to bridge the relation between the ground-truth z and the estimated ˆz. Then, we repeatedly use Theorem 1 three times by considering the changing of labels and domains. Hence, we find that the values of some blocks of the Jacobian of h are zero. Finally, the Jacobian of h can be formalized as Equation (4). J 1,1 h J 1,2 h J 1,3 h = 0 J 1,4 h = 0 J 2,1 h = 0 J 2,2 h J 2,3 h = 0 J 2,4 h = 0 J 3,1 h = 0 J 3,2 h J 3,3 h J 3,4 h = 0 J 4,1 h J 4,2 h J 4,3 h J 4,4 h where Jh denotes the Jacobian of h and Jij h := zi ˆzj and i, j {1, 2, 3, 4}. Since h( ) is invertible, Jh is a full-rank matrix. Therefore, for each z2,i, i {n1 + 1, , n1 + n2}, there exists a h2,i such that z2,i = hi(ˆz2). Moreover, for each z1,i, i {1, , n1 + 1}, there exists a Pre-trained Backbone Networks Bottleneck Networks Label Predictor Class-aware Condition Alignment Figure 3: The framework of the Subspace Identification Guarantee model. The pre-trained backbone networks are used to extract the feature f from observed data. The bottleneck and fµ, fσ are used to generate z with a reparameterization trick. Label predictor takes z2, z3, and u as input to model py|u,z2,z3. The decoder is used to model the marginal distribution. Finally, z2 is used for class-aware conditional alignment. h1,i such that z1,i = h1,i(ˆz1, ˆz2). And for each z3,i, i {n1 + n2 + 1, , n1 + n2 + n3}, there exists a h3,i such that z3,i = h3,i(ˆz2, ˆz3). Then we prove that z4 is block-wise identifiable, which means that there exists an invertible function h4, s.t.z4 = h4(ˆz4). Lemma 2. [30] Following the data generation process in Section 2.1 and the assumptions A4-A6 in Theorem 3, we further make the following assumption: A7 (Domain Variability: For any set Az Z) with the following two properties: 1) Az has nonzero probability measure, i.e. P[z Az|{u = u , y = y }] > 0 for any u U and y Y. 2) Az cannot be expressed as Bz4 Z1 Z2 Z3 for any Bz4 Z4. u1, u2 U and y1, y2 Y, such that R z Az pz|u1,y1dz = R z Az pz|u2,y2dz. By modeling the data generation process in Section 2.1, the z4 is block-wise identifiable. The proof can be found in Appendix B.3. Lemma 4 shows that z4 can be block-wise identifiable when the pz|u changes sufficiently across domains. In summary, we can obtain the estimated latent variables ˆz with the help of subspace identification and block-wise identification. 4 Subspace Identification Guarantee Model Based on the theoretical results, we proposed the Subspace Identification Guaranteed model (SIG) as shown in Figure 3, which contains a variational-inference-based neural architecture to model the marginal distribution and a class-aware conditional alignment to mitigate the target shift. 4.1 Variational-Inference-based Neural Architecture According to the data generation process in Figure 1, we first derive the evidence lower bound (ELBO) in Equation (5). ELBO =Eqz|x(z|x) ln px|z(x|z) + Eqz|x(z|x) ln py|u,z2,z3(y|u, z2, z3) + Eqz|x(z|x) ln pu|z(u|z) DKL(qz|x(z|x)||pz(z)). (5) Since the reconstruction of u is not the optimization goal, we remove the reconstruction of u and we rewrite Equation (5) as the objective function in Equation (6). Lelbo = Lvae + Ly Lvae = Eqz|x(z|x) ln px|z(x|z) + DKL(qz|x(z|x)||pz(z)) Ly = Eqz|x(z|x) ln py|u,z2,z3(y|u, z2, z3). (6) To minimize the pairwise class confusion, we further employ the minimum class confusion [25] into the classification loss Ly. According to the objective function in Equation (6), we illustrate how to implement the SIG model in Figure 3. First, we take the observed data x Si and x T from the source domains and the target domain as the inputs of the pre-trained backbone networks like Res Net50 and extract the feature f Si and f T . Sequentially, we employ an MLP-based encoder, which contains bottleneck networks, fµ and fσ, to extract the latent variables z. Then, we take z to reconstruct the pre-trained features via an MLP-based decoder to estimate the marginal distribution px|u. Finally, we take the z2, z3, and the domain embedding to predict the source label to estimate py|u,z2,z3. 4.2 Class-aware Conditional Alignment To estimate the target label distribution py|u T and mitigate the influence of target shift, we propose the class-aware conditional alignment to automatically adjust the conditional alignment loss for each sample. Formally, it can be written as i=1 w(i) pˆy(i)||ˆz(i) 3,S ˆz(i) 3,T ||2, w(i) = 1 + exp ( H(pˆy(i))), (7) where C denotes the class number; ˆz(i) 3,S and ˆz(i) 3,T denote the latent variables of ith class from source and target domain, respectively; w(i) denotes the prediction uncertainty of each class in the target domain; pˆy(i) denotes the estimated label probability density of ith class; H denotes the entropy. The aforementioned class-aware conditional alignment is based on the existing conditional alignment loss, which can be formalized in Equation (8). i=1 ||ˆz(i) 3,S ˆz(i) 3,T ||2, (8) However, conventional conditional alignment usually suffers from two drawbacks including misestimated centroid and low-quality pseudo-labels. First, the conditional alignment method implicitly assumes that the feature centroids from different domains are the same. But it is hard to estimate the correct centroids of the target domain for the class with low probability density. Second, conditional alignment heavily relies on the quality of the pseudo label. But existing methods usually use pseudo labels without any discrimination, which might result in false alignment. To solve these problems, we consider two types of reweighting. Distribution-based Reweighting for Misestimated Centroid: Although the conditional alignment method implicitly assumes that the feature centroids from different domains are the same, it is hard to estimate the correct centroids of the target domain for the class with low probability density. To address this challenge, one straightforward solution is to consider the label distribution of the target domain. To achieve this, we employ the technique of black box shift estimation method (BBSE) [37] to estimate the label distribution from the target domain pˆy. So we use the estimated label distribution to reweight the conditional alignment loss in Equation (8). Entropy-based Reweighting for Low-quality Pseudo-labels: conditional alignment heavily relies on the quality of the pseudo label. However, existing methods usually use pseudo labels without any discrimination, which might result in false alignment. To address this challenge, we consider the prediction uncertainty of each class in the target domain. Technologically, for each sample in the target dataset, we calculate the entropy-based weights via the prediction results which are shown as w(i) in Equation (7). By combining the distribution-based weights and the entropy-based weight, we can obtain the class-aware conditional alignment as shown in Equation (7). Hence the total loss of the Subspace Identification Guarantee model can be formalized as follows: Ltotal = Ly + βLvae + αLa, (9) where α, β denote the hyper-parameters. 5 Experiments 5.1 Experiments on Simulation Data In this subsection, we illustrate the experiment results of simulation data to evaluate the theoretical results of subspace identification in practice. 5.1.1 Experimental Setup Data Generation. We generate the simulation data for binary classification with 8 domains. To better evaluate our theoretical results, we follow the data generation process in Figure 2, which includes two types of latent variables, i.e., domain-specific latent variables zs and domain-invariant latent variables zc. We let the dimensions of zs and zc be both 2. Moreover, zs are sampled from u different mixture of Gaussians, and zc are sampled from a factorized Gaussian distribution. We let the data generation process from latent variables to observed variables be MLPs with the Tanh activation function. We further split the simulation dataset into the training set, validation set, and test set. Evaluation Metrics. First, we employ the accuracy of the target domain data to measure the classification performance of the model. Second, we compute the Mean Correlation Coefficient (MCC) between the ground-truth zs and the estimated ˆzs on the test dataset to evaluate the componentwise identifiability of domain-specific latent variables. A higher MCC denotes the better identification performance the model can achieve. Third, to evaluate the performance of subspace identifiability of domain-specific latent variables, we first use the estimated ˆzs from the validation set to regress each dimension of the ground-truth zs from the validation set with the help of a MLPs. Sequentially, we take the ˆzs from the test set as input to estimate how well the MLPs can reconstruct the ground-truth zs from the test set, so we employ Root Mean Square Error (RMSE) to measure the extent of subspace identification. A low RMSE denotes that there exists a transformation hi between zs,i and ˆzs,1, ˆzs,2, i.e. zs,i = hi(ˆzs,1, ˆzs,2), i {1, 2}. For the scenario where the number of domains is less than 8, we first fix the target domain and then try all the combinations of the source domains. And we publish the average performance of all the combinations. We repeat each experiment over 3 random seeds. 5.1.2 Results and Discussion Table 1: Experiments results on simulation data. State U ACC MCC RMSE Component-wise Identification 8 0.9982(0.0004) 0.9037(0.0087) 0.0433(0.0051) 6 0.9982(0.0007) 0.8976(0.0162) 0.0439(0.0073) 5 0.9982(0.0007) 0.8973(0.0131) 0.0441(0.0055) Subspace Identification 4 0.9233(0.2039) 0.8484(0.1452) 0.0582(0.0431) 3 0.8679(0.2610) 0.8077(0.1709) 0.0669(0.0482) No Identification 2 0.5978(0.3039) 0.6184(0.2093) 0.1272(0.0608) The experimental results of the simulation dataset are shown in Table 1. According to the experiment results, we can obtain the following conclusions: 1) We can find that the values of MCC increase along with the number of domains. Moreover, the values of MCC are high (around 0.9) and stable when the number of domains is larger than 5. This result corresponds to the theoretical result of component-wise identification, where a certain number of domains (i.e. 2n + 1) are necessary for component-wise identification. 2) We can find that the values of RMSE decrease along with the number of domains. Furthermore, the values of RMSE are low and stable (less than 0.07) when the number of domains is larger than 3, but it drops sharply when u = 2. These experimental results coincide with the theoretical results of subspace identification as well as the intuition where a certain number of domains are necessary for subspace identification (i.e. ns + 1). 3) According to the experimental results of ACC, we can find that the accuracy grows along with the number of domains and its changing pattern is relevant to that of RMSE, i.e., the performance is stable when the number of domains is larger than 3. The ACC results also indirectly support the results of subspace identification, since one straightforward understanding of subspace identification is that the domain-specific information is preserved in ˆzs. And the latent variables are well disentangled with the help of subspace identification, which benefits the model performance. 5.2 Experiments on Real-world Data 5.2.1 Experimental Setup Datasets: We consider four benchmarks: Office-Home, PACS, Image CLEF, and Domain Net. For each dataset, we let each domain be a target domain and the other domains be the source domains. For the Domain Net dataset, we equip a cross-attention module to the Res Net101 backbone networks for Table 2: Classification results on the Office-Home and Image CLEF datasets. For the Office-Home dataset, We employ Res Net50 as the backbone network. For the Image CLEF dataset, we employ Res Net18 as the backbone network. Model Office-Home Image CLEF Art Clipart Product Real World Average P C I Average Source Only [16] 64.5 52.3 77.6 80.7 68.8 77.2 92.3 88.1 85.8 DANN [11] 64.2 58.0 76.4 78.8 69.3 77.9 93.7 91.8 87.8 DAN [40] 68.2 57.9 78.4 81.9 71.6 77.6 93.3 92.2 87.7 DCTN [69] 66.9 61.8 79.2 77.7 71.4 75.0 95.7 90.3 87.0 MFSAN [81] 72.1 62.0 80.3 81.8 74.1 79.1 95.4 93.6 89.4 WADN [54] 75.2 61.0 83.5 84.4 76.1 77.7 95.8 93.2 88.9 i MSDA [30] 75.4 61.4 83.5 84.4 76.2 79.2 96.3 94.3 90.0 SIG 76.4 63.9 85.4 85.8 77.8 79.3 97.3 94.3 90.3 better usage of domain knowledge. We also employ the alignment of MDD [78]. For the Office-Home and Image CLEF datasets, we employ the pre-trained Res Net50 with an MLP-based classifier. For the PACS dataset, we use Res Net18 with an MLP-based classifier. The implementation details are provided in the Appendix C. We report the average results over 3 random seeds. Table 3: Classification results on the PACS datasets. We employ Res Net18 as the backbone network. Experiment results of other compared methods are taken from ([30]). Model A C P S Average Source Only [16] 74.9 72.1 94.5 64.7 76.7 DANN [11] 81.9 77.5 91.8 74.6 81.5 MDAN [79] 79.1 76.0 91.4 72.0 79.6 WBN [43] 89.9 89.7 97.4 58.0 83.8 MCD [50] 88.7 88.9 96.4 73.9 87.0 M3SDA [46] 89.3 89.9 97.3 76.7 88.3 CMSS [70] 88.6 90.4 96.9 82.0 89.5 Lt C-MSDA [63] 90.1 90.4 97.2 81.5 89.8 T-SVDNet [33] 90.4 90.6 98.5 85.4 91.2 i MSDA [30] 93.7 92.4 98.4 89.2 93.4 SIG 94.1 93.6 98.6 89.5 93.9 Baselines: Besides the classical approaches for single source domain adaptation like DANN [11], DAN [40], MCD [50], and ADDA [61]. We also compare our method with several state-of-the-art multi-source domain adaptation methods, for example, MIAN-γ [45], T-SVDNet [33], Lt C-MSDA [63], SPS [64], and PFDA [10]. Moreover, we further consider the WADN [54], which is devised for the target shift of multi-source domain adaptation. For a fair comparison, we employ the same pre-train backbone networks instead of the pre-trained features for WADN in the original paper. We also consider the latest i MSDA [30], which addresses the MSDA via component-wise identification. 5.2.2 Results and Discussion Experimental results on Office-Home, Image CLEF, PACS, and Domain Net are shown in Table 2, 3, and 4, respectively. Experiment results of other compared methods are provided in Appendix D.2. According to the experiment results of the Office-Home dataset on the left side of Table 2, our SIG model significantly outperforms all other baselines on all the transfer tasks. Specifically, our method outperforms the most competitive baseline by a clear margin of 1.3% 4% and promotes the classification accuracy substantially on the hard transfer task, e.g. Clipart. It is noted that our method achieves a better performance than that of WADN, which is designed for the target shift scenario. This is because our method not only considers how the domain variables influence the distribution of labels but also identifies the latent variables of the data generation process. Moreover, our SIG method also outperforms i MSDA, indirectly reflecting that the proposed data generation process is closer to the real-world setting and the subspace identification can achieve better disentanglement performance under limited auxiliary variables. For datasets like Image CLEF and PACS, our method also achieves the best-averaged results. In detail, we achieved a comparable performance in all the transfer tasks in the Image CLFE dataset. In the PACS dataset, our SIG method still performs better than the latest compared methods like i MSDA and T-SVDNet in some challenging transfer tasks like Cartoon. Finally, we also consider the Table 4: Classification results on the Domain Net datasets. We employ Res Net101 as the backbone network. Experiment results of other compared methods are taken from ([34] and [64]). Model Clipart Infograph Painting Quickdraw Real Sketch Average Source Only [16] 52.1 23.4 47.6 13.0 60.7 46.5 40.6 ADDA [61] 47.5 11.4 36.7 14.7 49.1 33.5 32.2 MCD [50] 54.3 22.1 45.7 7.6 58.4 43.5 38.5 DANN [11] 60.6 25.8 50.4 7.7 62.0 51.7 43 DCTN [69] 48.6 23.5 48.8 7.2 53.5 47.3 38.2 M3SDA-β [46] 58.6 26.0 52.3 6.3 62.7 49.5 42.6 ML_MSDA [35] 61.4 26.2 51.9 19.1 57.0 50.3 44.3 meta-MCD [32] 62.8 21.4 50.5 15.5 64.6 50.4 44.2 Lt C-MSDA [63] 63.1 28.7 56.1 16.3 66.1 53.8 47.4 CMSS [70] 64.2 28.0 53.6 16.9 63.4 53.8 46.5 DRT+ST [34] 71.0 31.6 61.0 12.3 71.4 60.7 51.3 SPS [64] 70.8 24.6 55.2 19.4 67.5 57.6 49.2 PFDA [10] 64.5 29.2 57.6 17.2 67.2 55.1 48.5 i MSDA [30] 68.1 25.9 57.4 17.3 64.2 52.0 47.5 SIG 72.7 32.0 61.5 20.5 72.4 59.5 53.0 Figure 4: Ablation study on the Office-Home dataset. we explore the impact of different loss terms. most challenging dataset, Domain Net, which contains more classes and more complex domain shifts. Results in Table 4 show the significant performance of the proposed SIG method, which provides 3.3% averaged promotion, although the performance in the task of Sketch is slightly lower than that of DRT+ST. Compared with i MSDA, our SIG overpasses by a large margin under a more general data generation process. Ablation Study: To evaluate the effectiveness of individual loss terms, we also devise the two model variants. 1) SIG-sem: remove the class-aware alignment loss. 2) SIG-vae: remove the reconstruction loss and the KL divergence loss. Experiment results on the Office-Home dataset are shown in Figure 4. We can find that the class-aware alignment loss plays an important role in the model performance, reflecting that the class-aware alignment can mitigate the influence of target shift. We also discover that incorporating the reconstruction and KL divergence has a positive impact on the overall performance of the model, which shows the necessity of modeling the marginal distributions. 6 Conclusion This paper presents a general data generation process for multi-source domain adaptation, which coincides with real-world scenarios. Based on this data generation process, we prove that the changing latent variables are subspace identifiable, which provides a novel solution for disentangled representation. Compared with the existing methods, the proposed subspace identification theory requires fewer auxiliary variables and frees the model from the monotonic transformation of latent variables, making it possible to apply the proposed method to real-world data. Experiment results on several mainstream benchmark datasets further evaluate the effectiveness of the proposed subspace identification guaranteed model. In summary, this paper takes a meaningful step for causal representation learning. Broader Impacts: SIG disentangles the latent variables to create a model that is more transparent, thereby aiding in the reduction of bias and the promotion of fairness. Limitation: However, the proposed subspace identification still requires several assumptions that might not be met in real-world scenarios. Therefore, how further to relax the assumptions, i.e., conditional independent assumption, would be an interesting future direction. 7 Acknowledgements We are very grateful to the anonymous reviewers for their help in improving the paper. This research was supported in part by the National Key R&D Program of China (2021ZD0111501), the National Science Fund for Excellent Young Scholars (62122022), Natural Science Foundation of China (61876043, 61976052), the major key project of PCL (PCL2021A12). This project is also partially supported by NSF Grant 2229881, the National Institutes of Health (NIH) under Contract R01HL159805, a grant from Apple Inc., a grant from KDDI Research Inc., and generous gifts from Salesforce Inc., Microsoft Research, and Amazon Research. [1] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. ar Xiv preprint ar Xiv:1903.09734, 2019. [2] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. Advances in neural information processing systems, 29, 2016. [3] Ruichu Cai, Zijian Li, Pengfei Wei, Jie Qiao, Kun Zhang, and Zhifeng Hao. Learning disentangled semantic representation for domain adaptation. In IJCAI: proceedings of the conference, volume 2019, page 2060. NIH Public Access, 2019. [4] Woong-Gi Chang, Tackgeun You, Seonguk Seo, Suha Kwak, and Bohyung Han. Domainspecific batch normalization for unsupervised domain adaptation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 7354 7362, 2019. [5] Chao Chen, Zhihong Chen, Boyuan Jiang, and Xinyu Jin. Joint domain alignment and discriminative feature learning for unsupervised deep domain adaptation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3296 3303, 2019. [6] Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu, and Junzhou Huang. Progressive feature alignment for unsupervised domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 627 636, 2019. [7] Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin Wang. Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In International conference on machine learning, pages 1081 1090. PMLR, 2019. [8] Yuansi Chen and Peter Bühlmann. Domain adaptation under structural causal models. The Journal of Machine Learning Research, 22(1):11856 11935, 2021. [9] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994. [10] Yangye Fu, Ming Zhang, Xing Xu, Zuo Cao, Chao Ma, Yanli Ji, Kai Zuo, and Huimin Lu. Partial feature selection and alignment for multi-source domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16654 16663, 2021. [11] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180 1189. PMLR, 2015. [12] Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary Lipton. A unified view of label shift estimation. Advances in Neural Information Processing Systems, 33:3290 3300, 2020. [13] Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In International conference on machine learning, pages 2839 2848. PMLR, 2016. [14] Hermanni Hälvä and Aapo Hyvarinen. Hidden markov nonlinear ica: Unsupervised learning from nonstationary time series. In Conference on Uncertainty in Artificial Intelligence, pages 939 948. PMLR, 2020. [15] Hermanni Hälvä, Sylvain Le Corff, Luc Lehéricy, Jonathan So, Yongjie Zhu, Elisabeth Gassiat, and Aapo Hyvarinen. Disentangling identifiable features from noisy data with structured nonlinear ica. Advances in Neural Information Processing Systems, 34:1624 1633, 2021. [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [17] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pages 1989 1998. Pmlr, 2018. [18] Aapo Hyvärinen. Independent component analysis: recent advances. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110534, 2013. [19] Aapo Hyvarinen, Juha Karhunen, and Erkki Oja. Independent component analysis. Studies in informatics and control, 11(2):205 207, 2002. [20] Aapo Hyvärinen, Ilyes Khemakhem, and Ricardo Monti. Identifiability of latent-variable and structural-equation models: from linear to nonlinear. ar Xiv preprint ar Xiv:2302.02672, 2023. [21] Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in neural information processing systems, 29, 2016. [22] Aapo Hyvarinen and Hiroshi Morioka. Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pages 460 469. PMLR, 2017. [23] Aapo Hyv"arinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999. [24] Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859 868. PMLR, 2019. [25] Ying Jin, Ximei Wang, Mingsheng Long, and Jianmin Wang. Minimum class confusion for versatile domain adaptation. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XXI 16, pages 464 480. Springer, 2020. [26] Guoliang Kang, Lu Jiang, Yunchao Wei, Yi Yang, and Alexander Hauptmann. Contrastive adaptation network for single-and multi-source domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 44(4):1793 1804, 2020. [27] Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207 2217. PMLR, 2020. [28] Ilyes Khemakhem, Ricardo Monti, Diederik Kingma, and Aapo Hyvarinen. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ica. Advances in Neural Information Processing Systems, 33:12768 12778, 2020. [29] David Klindt, Lukas Schott, Yash Sharma, Ivan Ustyuzhaninov, Wieland Brendel, Matthias Bethge, and Dylan Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding. ar Xiv preprint ar Xiv:2007.10930, 2020. [30] Lingjing Kong, Shaoan Xie, Weiran Yao, Yujia Zheng, Guangyi Chen, Petar Stojanov, Victor Akinwande, and Kun Zhang. Partial disentanglement for domain adaptation. In International Conference on Machine Learning, pages 11455 11472. PMLR, 2022. [31] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. ar Xiv preprint ar Xiv:1711.00848, 2017. [32] Da Li and Timothy Hospedales. Online meta-learning for multi-source and semi-supervised domain adaptation. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XVI, pages 382 403. Springer, 2020. [33] Ruihuang Li, Xu Jia, Jianzhong He, Shuaijun Chen, and Qinghua Hu. T-svdnet: Exploring high-order prototypical correlations for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9991 10000, 2021. [34] Yunsheng Li, Lu Yuan, Yinpeng Chen, Pei Wang, and Nuno Vasconcelos. Dynamic transfer for multi-source domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10998 11007, 2021. [35] Zhenpeng Li, Zhen Zhao, Yuhong Guo, Haifeng Shen, and Jieping Ye. Mutual learning network for multi-source domain adaptation. ar Xiv preprint ar Xiv:2003.12944, 2020. [36] Zijian Li, Ruichu Cai, Tom ZJ Fu, and Kun Zhang. Transferable time-series forecasting under causal conditional shift. ar Xiv preprint ar Xiv:2111.03422, 2021. [37] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pages 3122 3130. PMLR, 2018. [38] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114 4124. PMLR, 2019. [39] Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar Rätsch, Bernhard Schölkopf, and Olivier Bachem. Disentangling factors of variation using few labels. ar Xiv preprint ar Xiv:1905.01258, 2019. [40] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97 105. PMLR, 2015. [41] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pages 2208 2217. PMLR, 2017. [42] Sara Magliacane, Thijs Van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. Advances in neural information processing systems, 31, 2018. [43] Massimiliano Mancini, Lorenzo Porzi, Samuel Rota Bulo, Barbara Caputo, and Elisa Ricci. Boosting domain adaptation by discovering latent domains. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3771 3780, 2018. [44] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345 1359, 2010. [45] Geon Yeong Park and Sang Wan Lee. Information-theoretic regularization for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9214 9223, 2021. [46] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406 1415, 2019. [47] Chuan-Xian Ren, Yong-Hui Liu, Xi-Wen Zhang, and Ke-Kun Huang. Multi-source unsupervised domain adaptation via pseudo target domain. IEEE Transactions on Image Processing, 31:2122 2135, 2022. [48] Manley Roberts, Pranav Mani, Saurabh Garg, and Zachary Lipton. Unsupervised learning under latent label shift. Advances in Neural Information Processing Systems, 35:18763 18778, 2022. [49] Alexander Robey, George J Pappas, and Hamed Hassani. Model-based domain generalization. Advances in Neural Information Processing Systems, 34:20210 20229, 2021. [50] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3723 3732, 2018. [51] Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634, 2021. [52] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000. [53] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised domain adaptation. ar Xiv preprint ar Xiv:1802.08735, 2018. [54] Changjian Shui, Zijian Li, Jiaqi Li, Christian Gagné, Charles X Ling, and Boyu Wang. Aggregating from multiple target-shifted sources. In International Conference on Machine Learning, pages 9638 9648. PMLR, 2021. [55] Petar Stojanov, Mingming Gong, Jaime Carbonell, and Kun Zhang. Data-driven approach to multiple-source domain adaptation. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3487 3496. PMLR, 2019. [56] Petar Stojanov, Zijian Li, Mingming Gong, Ruichu Cai, Jaime Carbonell, and Kun Zhang. Domain adaptation with invariant representation learning: What transformations to learn? Advances in Neural Information Processing Systems, 34:24791 24803, 2021. [57] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016. [58] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. Advances in Neural Information Processing Systems, 33:19276 19289, 2020. [59] Takeshi Teshima, Issei Sato, and Masashi Sugiyama. Few-shot domain adaptation by causal mechanism transfer. In International Conference on Machine Learning, pages 9458 9469. PMLR, 2020. [60] Frederik Träuble, Elliot Creager, Niki Kilbertus, Francesco Locatello, Andrea Dittadi, Anirudh Goyal, Bernhard Schölkopf, and Stefan Bauer. On disentangled representations learned from correlated data. In International Conference on Machine Learning, pages 10401 10412. PMLR, 2021. [61] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167 7176, 2017. [62] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. ar Xiv preprint ar Xiv:1412.3474, 2014. [63] Hang Wang, Minghao Xu, Bingbing Ni, and Wenjun Zhang. Learning to combine: Knowledge aggregation for multi-source domain adaptation. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part VIII 16, pages 727 744. Springer, 2020. [64] Zengmao Wang, Chaoyang Zhou, Bo Du, and Fengxiang He. Self-paced supervision for multisource domain adaptation. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 2022. [65] Jun Wen, Nenggan Zheng, Junsong Yuan, Zhefeng Gong, and Changyou Chen. Bayesian uncertainty matching for unsupervised domain adaptation. ar Xiv preprint ar Xiv:1906.09693, 2019. [66] Junfeng Wen, Russell Greiner, and Dale Schuurmans. Domain aggregation networks for multi-source domain adaptation. In International Conference on Machine Learning, pages 10214 10224. PMLR, 2020. [67] Shaoan Xie, Lingjing Kong, Mingming Gong, and Kun Zhang. Multi-domain image generation and translation with identifiability guarantees. In The Eleventh International Conference on Learning Representations. [68] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In International conference on machine learning, pages 5423 5432. PMLR, 2018. [69] Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, and Liang Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3964 3973, 2018. [70] Luyu Yang, Yogesh Balaji, Ser-Nam Lim, and Abhinav Shrivastava. Curriculum manager for source selection in multi-source domain adaptation. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XIV 16, pages 608 624. Springer, 2020. [71] Weiran Yao, Guangyi Chen, and Kun Zhang. Temporally disentangled representation learning. ar Xiv preprint ar Xiv:2210.13647, 2022. [72] Weiran Yao, Yuewen Sun, Alex Ho, Changyin Sun, and Kun Zhang. Learning temporally causal latent processes from general temporal data. ar Xiv preprint ar Xiv:2110.05428, 2021. [73] Kun Zhang and Laiwan Chan. Kernel-based nonlinear independent component analysis. In Independent Component Analysis and Signal Separation: 7th International Conference, ICA 2007, London, UK, September 9-12, 2007. Proceedings 7, pages 301 308. Springer, 2007. [74] Kun Zhang and Laiwan Chan. Minimal nonlinear distortion principle for nonlinear independent component analysis. Journal of Machine Learning Research, 9(Nov):2455 2487, 2008. [75] Kun Zhang, Mingming Gong, and Bernhard Schölkopf. Multi-source domain adaptation: A causal view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. [76] Kun Zhang, Mingming Gong, Petar Stojanov, Biwei Huang, Qingsong Liu, and Clark Glymour. Domain adaptation as a problem of inference on graphical models. Advances in neural information processing systems, 33:4965 4976, 2020. [77] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International conference on machine learning, pages 819 827. PMLR, 2013. [78] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In International conference on machine learning, pages 7404 7413. PMLR, 2019. [79] Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. Advances in neural information processing systems, 31, 2018. [80] Yujia Zheng, Ignavier Ng, and Kun Zhang. On the identifiability of nonlinear ica with unconditional priors. In ICLR2022 Workshop on the Elements of Reasoning: Objects, Structure and Causality, 2022. [81] Yongchun Zhu, Fuzhen Zhuang, and Deqing Wang. Aligning domain-specific distribution and classifier for cross-domain classification from multiple sources. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5989 5996, 2019. [82] Yongchun Zhu, Fuzhen Zhuang, Jindong Wang, Guolin Ke, Jingwu Chen, Jiang Bian, Hui Xiong, and Qing He. Deep subdomain adaptation network for image classification. IEEE transactions on neural networks and learning systems, 32(4):1713 1722, 2020.