# labelnoise_robust_domain_adaptation__886762ea.pdf Label-Noise Robust Domain Adaptation Xiyu Yu 1 Tongliang Liu 2 Mingming Gong 3 Kun Zhang 4 Kayhan Batmanghelich 5 Dacheng Tao 2 Domain adaptation aims to correct the classifiers when faced with distribution shift between source (training) and target (test) domains. State-of-theart domain adaptation methods make use of deep networks to extract domain-invariant representations. However, existing methods assume that all the instances in the source domain are correctly labeled; while in reality, it is unsurprising that we may obtain a source domain with noisy labels. In this paper, we are the first to comprehensively investigate how label noise could adversely affect existing domain adaptation methods in various scenarios. Further, we theoretically prove that there exists a method that can essentially reduce the side-effect of noisy source labels in domain adaptation. Specifically, focusing on the generalized target shift scenario, where both label distribution PY and the class-conditional distribution PX|Y can change, we discover that the denoising Conditional Invariant Component (DCIC) framework can provably ensures (1) extracting invariant representations given examples with noisy labels in the source domain and unlabeled examples in the target domain and (2) estimating the label distribution in the target domain with no bias. Experimental results on both synthetic and realworld data verify the effectiveness of the proposed method. 1. Introduction In the classical domain adaptation setting, given raw features {x T 1 , , x T n} from a target domain, we aim to learn 1Department of Computer Vision Technology (VIS), Baidu Incorporation 2UBTECH Sydney AI Centre, The University of Sydney 3School of Mathematics and Statistics, University of Melbourne 4Department of Biomedical Informatics, University of Pittsburgh 5Department of Philosophy, Carnegie Mellon University. Correspondence to: Xiyu Yu , Tongliang Liu . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). a function to predict the labels {y T 1 , , y T n } using labeled data {(x S 1 , y S 1 ), , (x S m, y S m)} from a different but related source domain (Wu et al., 2019). Let X and Y be the variables of features and labels, respectively. In contrast to the standard supervised learning, the joint distributions P S XY and P T XY are different. For example, in medical data analysis, health record data collected from patients of different age groups or hospital locations often vary (Purushotham et al., 2017). Inferring invariant knowledge from a domain (e.g., an age group or a location) with a large set of labeled examples to another with unlabeled data is desirable (Raghu et al., 2019) since it is often laborious to obtain high-quality labels for clinical data (Dubois et al., 2017). According to the assumptions about how the joint distribution PXY shifts across domains, several domain adaptation scenarios have been studied. (1) Covariate shift assumes that the marginal distribution PX changes but the conditional distribution PY |X stays the same. In this situation, methods have been proposed to correct the shift in PX, for instance, by importance reweighting (Huang et al., 2007) and invariant feature learning (Long et al., 2015; Kumagai et al., 2019; Meyerson & Miikkulainen, 2019; Chen et al., 2019a). (2) Model shift (Wang et al., 2014) assumes that PX and PY |X change independently. In this case, it also requires Y to be continuous, the change in PY |X to be smooth, and some labeled data to be available in the target domain. (3) Target shift (Zhang et al., 2013a; Azizzadenesheli et al., 2019) assumes that PY shifts while PX|Y stays the same. In this scenario, PX and PY |X change dependently because their changes are caused by the change in PY . (4) Generalized target shift (Zhang et al., 2013a) assumes that PX|Y and PY change independently across domains, causing PX and PY |X to change dependently. An interpretation of the difference between these scenarios from a causal standpoint was also provided (Sch olkopf et al., 2012a). Additionally, the aforementioned domain adaptation methods extract invariant features across different domains based on a strong assumption; that is, the source domain labels are accurate. However, since accurately labeling training set tends to be expensive, time-consuming, and sometimes impossible, this assumption is often violated in practice. For example, in medical data analysis, due to the subjectivity of domain experts, insufficient discriminative information, and digitalization errors (S aez et al., 2016), noisy labels are of- Label-Noise Robust Domain Adaptation ten inevitable. In computer vision, to reduce the expensive human supervision, we often prefer directly transferring knowledge from easily obtainable but imperfectly labeled source data such as webly-labeled data or machine-labeled data to target data (Xu et al., 2016; Lee et al., 2018). Therefore, in this paper, we consider the setting of domain adaptation that the observed labels in source domain are noisy. As such, we have no access to the true source distribution. One may think that this issue can be easy to solve by combining existing label-noise learning methods and domain adaptation methods. For example, simply applying the label-noise robust classifiers after extracting invariant features across domains by employing existing domain adaptation methods. However, for many setting, label noise will degenerate invariant feature extracting and the unlabeled data in the target domain is also helpful for denoising. Simple combination is therefore inefficient. As expected, except the covariate shift scenario in which correcting the shift in PX does not require label information, we can show that label noise can adversely affect most existing domain adaptation methods in different scenarios. Taking target shift as an example, by assuming PX|Y is invariant across domains, the shift in PY can be corrected by estimating the class ratio between P T Y and P S Y from a mixture proportion estimation problem (Zhang et al., 2013a; Iyer et al., 2014). However, when labels in source domain are corrupted, the information of PX|Y is unknown. Then, it is unclear whether the class ratio P T Y /P S Y can be estimated. Another example is generalized target shift. In this scenario, the estimated P T Y /P S Y can be possibly incorrect. Further, invariant features are often learned by matching distributions across domains which heavily rely on the estimate of P T Y /P S Y . As a result, label noise can lead to biased learning of features with incorrect estimate of P T Y /P S Y . Label noise also affects the learning in model shift, but we will not consider this case because we are concerned with discrete labels and the setting where no label exists in target domain. To address this issue, we propose a label-noise robust domain adaptation method in the generalized target shift scenario. To deal with label noise, we propose a novel method to denoise conditional invariant components. Our method can provably identify the changes in distribution PY and extract the conditional invariant representations by reducing the side effect of label noise using both source and target data. Specifically, we construct a new distribution P new X which is marginalized from the weighted noisy source distribution P S ρX ,Y . Here, we denote Pρ as the distributions associated with label noise. By matching P new X and P T X , the conditional invariant components and P T Y are identifiable from the noisy source data and unlabeled target data. Moreover, in our denoising conditional invariant component framework, we can also theoretically ensure the convergence of the estimate of label distribution in target domain. To verify the effectiveness of our method, we conduct comprehensive experiments on both synthetic and real-world data. The performance are evaluated on classification problems. For fair comparison, after extracting invariant features using domain adaptation methods, we train the robust classifier by employing the forward method in (Patrini et al., 2017). Compared with state-of-the-art domain adaptation methods, our method achieves superior performance. 2. Related Work Classification with Label Noise. Learning with noisy labels in classification has been widely studied (Long & Servedio, 2008; Van Rooyen et al., 2015). These methods can be coarsely categorized into three categories, i.e., unbiased losses or risk minimizers (Natarajan et al., 2013; Xu et al., 2019a; Sukhbaatar et al., 2014; Patrini et al., 2017; Han et al., 2018a), bootstrapping losses (Arazo et al., 2019), label noise reweighting and cleansing (Jiang et al., 2018; Han et al., 2018b; Chen et al., 2019b; Thulasidasan et al., 2019; Nguyen et al., 2020). Learning with complementary labels (Xu et al., 2019b; Yu et al., 2018b; Ishida et al., 2017; L. Feng & Sugiyama, 2020; Y.-T. Chou & Sugiyama, 2020) can also be viewed as a special case of learning with label noise. They often exploit similar ideas when designing robust models. Depending on whether we explicitly model label noise using transition matrix, label noise-robust methods can also be classified into transition matrix based methods (B. Han & Sugiyama, 2020; Xia et al., 2019; 2020) and transition matrix-free methods (Yang et al., 2019; Han et al., 2018c; Cheng et al., 2020; Liu & Guo, 2020; Wu et al., 2020). Here,, our method belongs to the first category. However, the problem considered here is more challenging because the clean source domain distribution is not assumed to be identical to the target domain distribution. In contrast to classification with label noise, our method can learn invariant features across different domains, where both PY and PX|Y may change and the labels of the source data is corrupted. Reports on the general results obtained in this setting are scarce. Traditional Generalized Target Shift Methods. Existing generalized target shift methods assume that there exists a transformation τ, e.g., location-scale transformation (Zhang et al., 2013a; Gong et al., 2016), such that the conditional distribution Pτ(X)|Y is invariant across domains. In this paper, we also assume that the conditional invariant components (CICs) exist. We aim to find a transformation τ such that P T (τ(X)|Y ) = P S(τ(X)|Y ) as in (Gong et al., 2016) and to estimate P T (Y ). However, we are given only samples drawn from the distribution P T X and the noisy distribution P S ρXY , which makes the problem challenging. Label-Noise Robust Domain Adaptation Figure 1. Possible situations of domain adaptation with label noise. V 1 s and V 2 s are independent domain-specific selection variables, leading to changing PXY across domains. (a) Model shift: V 1 s and V 2 s change PX and PY |X, respectively. (b) Generalized target shift: V 1 s and V 2 s change PY and PX|Y , respectively. In the first scenario, X is a cause for Y , whilst in the second scenario, Y is a cause of X. If V 2 s is not present, (a) reduces to covariate shift and (b) reduces to target shift. In our setting, the true labels Y in the source domain is unobservable. We only observe noisy labels ˆY . Note that our work is not a simple combination of traditional generalized target shift methods and robust classifiers. As aforementioned, simple combination of domain adaptation and label-noise robust classifier overlooks that the learning of invariant features can be affected by label noise, which thus produces biased results. In the setting where only noisy source data and unlabeled target data are available, learning τ becomes pretty challenging. This is because without clean label Y in both domains, no direct information is available to ensure the identity of conditional distributions P(τ(X)|Y ). As such, τ is hard to learn. Moreover, it is challenging to estimate P T (Y ) as briefly discussed in the introduction. Therefore, we proposed a novel denoising conditional invariant component framework. It is able to identify P T (Y ) and conditional invariant components τ(X) from the noisy source data and unlabeled target data. In this paper, the simple combinations of domain adaptation methods with robust classifiers are included as baselines in our experiments. Our method strongly outperforms the baselines, verifying that the superiority of the proposed method to extract invariant features across different domains. 3. The Effects of Label Noise In this section, we examine the effects of label noise in four different domain adaptation scenarios, namely 1) covariate shift, 2) model shift, 3) target shift, and 4) generalized target shift. From a causal perspective, 1) and 2) assume that X causes Y , indicating that PX and PY |X contain no information about each other (Sch olkopf et al., 2012b). In domain adaptation, the causal relation implies that changes in PX are independent of changes in PY |X. If the change in PY |X is large, then it is difficult to correct the shift in PY |X because we often have no or scarce labels in the target domain. On the contrary, 3) and 4) assume that Y is the cause for X, implying that changes in PY and PX|Y are independent, while changes in PX and PY |X depend on each other. Figure 1 represents the causal relations between variables in domain adaptation using selection diagram defined in (Pearl & Bareinboim, 2011). Here, although the noisy label ˆY is usually generated after X is observed, we exploit the causal model Y ˆY according to the assumption that flip rates are independent of features, which is widely employed in the label noise setting (Natarajan et al., 2013; Patrini et al., 2017; Scott, 2015). The effects of label noise in different scenarios are also summarized as follows: Covariate shift. In covariate shift (Huang et al., 2007; Zhang et al., 2013b), label noise has no effects on the correction of shift in PX. However, after correcting the shift in PX, one needs to take the effects of label noise into account when training a classifier on the source domain (Natarajan et al., 2013; Liu & Tao, 2016). This problem can be efficiently solved by a simple combination of label-noise learning and domain adaptation. Model shift. In model shift (Wang et al., 2014), since PX and PY |X change independently, we can correct them separately. Similar to covariate shift, correcting PX is not affected by label noise. However, correcting shift in PY |X requires matching P S Y |X and P T Y |X, which can be seriously harmed by label noise. In this scenario, since a small number of clean labels are assumed to be available in the target domain, PY |X is often assumed to change smoothly across domains to reduce the estimation error. The smoothness constraint can reduce the effects of label noise to some extent if one directly matches P S ρY |X and P T Y |X. Target shift. In target shift (Iyer et al., 2014; Zhang et al., 2013a; Jiaxian Guo & Tao, 2020), it is required that P S X|Y = P T X|Y . The changes in PY are often corrected by matching the marginal distribution of the reweighted source domain P new X = Pc i=1 P S X|Y =i P S Y =iβ(Y = i) and the target domain P T X, where β(Y = i) = P T Y =i/P S Y =i and c is the class number. In the presence of label noise, however, we only have access to P S ρX|Y and P S ρY in the source domain. In this situation, the estimate of P T Y can be incorrect. Take binary problem as an example, let P T X = ωρ1P S ρX|Y =1 + ωρ2P S ρX|Y =2, P T X = ω1P S X|Y =1 + ω2P S X|Y =2, and πij = P S(Y = j| ˆY = i), i, j {1, 2}. In fact, ωi and ωρi respent P T Y =i and P S ρY =i (i = 1, 2), respectively. Then, Proposition 1. We have ωρi = ωi, i = 1, 2 only when π12ω1 = π21ω2. Here, P S ρX|Y =1 = π11P S X|Y =1 + π12P S X|Y =2 and P S ρX|Y =2 = π21P S X|Y =1 + π22P S X|Y =2 are known as mutually contaminated distributions (Menon et al., 2015). We can see, πij and transition probability P( ˆY = j|Y = i), i, j {1, , c} can be related via Bayes rule. According to Proposition 1, in the special case where the label distribution of target domain is balanced and the label noise is symmetric, label noise does not affect the estimation of Label-Noise Robust Domain Adaptation P T Y . But in most cases, ωi = ωρi. This indicates that we cannot directly estimate P T Y from the noisy source data and unlabeled target data. Detailed proof of Proposition 1 can be found in the Supplementary Material. Generalized target shift. In general target shift (Zhang et al., 2013a; Gong et al., 2016), PX|Y also changes across domains, but it changes independently of PY . A widely-employed approach is learning conditional invariant components that satisfy P S X |Y = P T X |Y . Under the assumption of conditional invariant components, many works jointly learn X and P T (Y ) by matching P new X = Pc i=1 PX |Y =i P S Y =iβ(Y = i) and P T X , which naturally requires the information of P S XY and P T X. However, in the setting of label noise, similar to target shift, the estimates of invariant components and P T Y are very likely to be inaccurate if we directly use the noisy source distribution P S ρXY to correct distribution shift. Specifically, if we assume that X is successfully learned, the estimate of P T Y may be incorrect as that in target shift. A wrong estimate of P T Y can in turn result in the biased learning of invariant representations as in (Gong et al., 2016). In conclusion, we can observe that label noise is harmful for extracting invariant features and correcting distribution shift in most domain adaptation scenarios. We target to reduce these adverse effects of label noise in the following sections. 4. Label-Noise Robust Domain Adaptation Here, we study a new domain adaptation setting in which (1) both PX|Y and PY change across different domains; (2) and we have access to only noisy observations {(x S 1 , ˆy S 1 ), , (x S m, ˆy S m)} in the source domain and unlabeled data {x T 1 , , x T n} in the target domain. Here, ˆy refers to a noisy label; and we consider the class-conditional label noise (Natarajan et al., 2013). The label noise is stochastically modeled via a transition probability P( ˆY = j|Y = i), i.e., the flip rate from clean label i to noisy label j. All these transition probabilities are summarized into a transition matrix Q, where Qij = P( ˆY = j|Y = i). The class-conditional label noise is the vast majority noise setting adopted in the label noise community. It has been widely used and been proved to be effective for evaluating label noise methods such as (Natarajan et al., 2013; Chen et al., 2019b). In this section, we first study how to provably identify invariant feature across different domains and correct the distribution shift in the general target shift scenario with label noise. Then, an importance reweighting framework is introduced for correcting classifiers. Both our end-to-end deep domain adaptation model is finally presented. 4.1. Denoising Conditional Invariant Components In the label noise setting, learning invariant features and P T Y is challenging due to that we can only observe the noisy labels but have no clean label Y in the source domain. To address this issue, we first introduce the conditional invariant components to ensure this problem being tractable. That is, we assume that for every d-dimensional data X, there exists a transformation τ : Rd Rd satisfying P T τ(X)|Y = P S τ(X)|Y , (1) where X = τ(X) Rd are known as conditional invariant components (CICs) (Gong et al., 2016) across domains. Since label noise makes existing domain adaptation methods ineffective, we propose a novel method to denoise the conditional invariant components. We find that if the information of label noise model is available, a unique relationship between P S ρX Y and P T X can be built, which, in turn, is a clue for us to identify X . We observe that label noise does not affect the distribution of X . Then, intuitively, if we marginalize out the variable ˆY of the noisy labels, we may achieve Eq. (1) by matching the marginal distribution PX . But we need some nontrivial strategies to make it possible. Specifically, we first construct a new distribution P new X , which is marginalized from the reweighted distribution P S ρX Y as follows, P new X = X y βρ( ˆY = y )P S ρ (X , ˆY = y ) y βρ( ˆY = y )P S ρ (X , Y = y, ˆY = y ), (2) where βρ are the weights for noisy labels. Note that, in the rest of this paper, when no ambiguity occurs, we use Y as the variable for both clean and noisy labels; otherwise, both Y and ˆY are used as variables for clean and noisy label, respectively. Then, under mild conditions, by matching the distribution P T X with the new distribution P new X , we can provably identify the invariant components τ(X): Theorem 1. Suppose the transformation τ satisfies that P(τ(X)|Y = i), i {1, , c} are linearly independent, and that the elements in the set {vi P S(τ(X)|Y = i) + λi P T (τ(X)|Y = i); i {1, , c}; vi, λi (v2 i + λ2 i = 0)} are linearly independent. Then, if P new X = P T X , we have P T X |Y = P S X |Y ; and β(Y = y) = P y P S( ˆY = y |Y = y)βρ( ˆY = y ), y, y {1, , c}, where β(Y = y) = P T (Y = y)/P S(Y = y). Please see the proof of Theorem 1 in the Supplementary Material. Note that the linearly independent property is a weak assumption which has been widely used as the basic condition for class ratio estimation (Gong et al., 2016). Label-Noise Robust Domain Adaptation Let u = [β(Y = 1), , β(Y = c)] and uρ = [βρ(Y = 1), , βρ(Y = c)] . According to Theorem 1, we have u = Quρ. In label noise, we assume that Q is usually diagonally dominant and invertible. Then, the relationship between βρ and β is uniquely determined, as well as the relationship between P S ρ X Y and P T X . In this case, if Q is known and these two marginal distributions are successfully matched, we can (1) identify the conditional invariant components; (2) and learn βρ which indicates that the changes in the distribution PY is also identifiable. In practice, the transition matrix Q is not available, but we can estimate it by methods in (Liu & Tao, 2016; Patrini et al., 2017). In Thoerem 1, we focus on the linear independence assumption on PX |Y . In the following section, we exploit β and Q to correct βρ such that we can correct the distribution shift directly on the unbiased estimators of clean distributions. But it is also interesting to note that this theorem indicates that the learning of conditional invariant components are not affected by label noise. Let π be the matrix in which πij = P(Y = j| ˆY = i). Again, π and Q are related by Bayes rule. If Q is invertible, then it is easy to obtain that π is also invertible. In this condition, if we assume PX |Y =i, i {1, , c} are linear independent, then PρX | ˆY =i, i {1, , c} are also linear independent. According to Theorem 1 in (Gong et al., 2016), we can see that the conditional invariant components can be identified by correcting the changes in βρ( ˆY = y)Pρ(X , ˆY = y). That is to say, we provably find that CIC method in (Gong et al., 2016) is robust to label noise when identifying conditional invariant components. But this conclusion may be not empirically correct. In our experiments, we find that by correcting βρ to obtain an unbiased estimator of clean distributions, the proposed denoising maximum mean discrepancy (MMD) loss can perform better. The modified MMD loss is present as follows. Denoising MMD Loss. To enforce the matching between P new X and P T X , we employ the kernel mean matching of these two distributions and minimize the squared maximum mean discrepancy (MMD) loss: µP new X [ψ(X )] µP T X [ψ(X )] 2 = EX P new X [ψ(X )] EX P T X [ψ(X )] 2, (3) where ψ is a kernel mapping. According to Eq. (2), we have EX P new X [ψ(X )] = E(X ,Y ) P S ρX Y [βρ(Y )ψ(X )]. Therefore, minimizing Eq. (3) is equivalent to minimizing E(X ,Y ) P S ρX Y [βρ(Y )ψ(X )] EX P T X [ψ(X )] 2. In practice, we can only observe the corruptly labeled source data {(x1, ˆy S 1 ), , (xm, ˆy S m)} and the unlabeled target data {x T 1 , , x T n}. Therefore, we approximate the expected kernel mean values by the empirical ones: mψ(x S)βρ(ˆy S) 1 nψ(x T )1 2, (4) where βρ(ˆy S) = [βρ(ˆy1), , βρ(ˆym)] ; x denotes the matrix of the invariant representations. However, Eq. (4) is not explicitly formulated w.r.t. P T Y . If we directly optimizing Eq. (4) w.r.t. βρ(ˆy S), it will result in incorrect βρ that violates the fact that βρ(ˆy) should be the same for the same ˆy. It is thus impossible to identify P T Y . Therefore, we need to reparameterize the formulation by applying the relationship between βρ and P T Y in Theorem 1, i.e., βρ( ˆY = i) = Pc j=1 Q 1 ij P T (Y =j) P S(Y =j). It is also easy to derive that [P S(Y = 1), , P S(Y = c)]Q = [P S ρ (Y = 1), , P S ρ (Y = c)]. Given estimated ˆQ and [ ˆP S ρ (Y = 1), , ˆP S ρ (Y = c)] , we can construct the vectors gi = [ ˆ Q 1 i1 ˆ P S(Y =1), , ˆ Q 1 ic ˆ P S(Y =c)], i {1, , c}. If ˆyk = i, k {1, , m}, define the matrix G Rm c, where the k-th row of G is gi. Let βρ(ˆy S) = Gα. Then, α is an estimate of [P T (Y = 1), , P T (Y = c)] . The denoising MMD loss now can be reparametrized as mψ(x S)Gα 1 nψ(x T )1 2 m2 21 KT,SGα mn + 1 KT 1 where KS and KT are the kernel matrix of x S and x T , respectively; KT,S is the cross kernel matrix. In this paper, the Gaussian kernel, i.e., k(xi, xj) = exp xi xj 2 2σ2 is applied, where σ is the bandwidth. Therefore, according to Theorem 1, optimizing the denoising MMD loss in Eq. (5) ensures us to identify the conditional invariant components and P T (Y ). A New Perspective on Denoising MMD Loss. Here, we discuss why using β and Q to correct βρ can be more helpful. By correcting βρ by β and Q, we actually provide an unbiased estimator of P y β(Y = y)P S(X , Y = y). This proof is straightforward. We can easily prove that πQ is identity matrix when Q is invertible; and uρ = πu. Replace βρ( ˆY ) with β(Y ) using the above relationship. Then, we can easily obtain P new X = P y β(Y = y)P S(X , Y = y). That is to say, by correcting βρ, we can build the direct relationship between P T X and P S X |Y . This is very important because this enables us to directly correct the changes in P(Y = y)P(X |Y = y) and extract P T Y . Even though when Q is invertible, βρ is provably identifiable according to Theorem 1, the learning process is more difficult since the mixed noisy data are closer to each other especially when Label-Noise Robust Domain Adaptation only finite examples are given. This is why our denoising MMD loss can work better. 4.2. Importance Reweighting After adapting invariant features, we can now correct the classifiers. Here, we aim to learn a hypothesis function f : Rd Rc from the noisy source data that can generalize well on the target data. Ideally, f minimizes the expected loss E(X ,Y ) P T X Y [ℓ(f(X ), Y )], where ℓis the loss function; X are the conditional invariant components. In practice, we assume that f can predicts P T (Y |X ) (Reid & Williamson, 2010; Patrini et al., 2017) and arg maxi {1, ,c} f i predicts the label. Here, f i is the i-th entry of f . To facilitate the learning of f , we first imagine that the target domain has the same label noise model as the source domain. Note that, this does not necessarily imply that label noise really exists in target domain because, in our setting, we even have no label information of target data. We can see, the minimizer f ρ = arg minf R ℓ(f(X ), Y )P T ρ (X , Y )d X d Y is also assumed to be able to predict P T ρ (Y |X ). If the classifier f ρ is found and Q is invertible, we can obtain f according to the following relationship: [P T (Y = 1|X ), , P T (Y = c|X )]Q = [P T ρ (Y = 1|X ), , P T ρ (Y = c|X )]. (6) Thus, the problem remains to learn f ρ , which can be obtained by exploiting the importance reweighting strategy: f ρ = arg min f Z ℓ(f(X ), Y )P T ρ (X , Y )d X d Y = arg min f Z P T ρ (X , Y ) P S ρ (X , Y )ℓ(f(X ), Y )P S ρ (X , Y )d X d Y. Since P T ρ (X , Y ) is constructed from P T (X, Y ) by using the same transition matrix Q and P T (X |Y ) = P S(X |Y ), we can easily have P T ρ (X |Y ) = P S ρ (X |Y ) and thus f ρ = arg min f Z P T ρ (Y ) P S ρ (Y )ℓ(f(X ), Y )P S ρ (X , Y )d X d Y = arg min f Z γ(Y )ℓ(f(X ), Y )P S ρ (X , Y )d X d Y, where γ(Y ) = P T ρ (Y ) P S ρ (Y ). In practice, only the training sample is observable, we thus minimize the empirical loss, i=1 γ(ˆy S i )ℓ(f(x S i ), ˆy S i ), (7) to find the approximated classifier fρ. Instead of separately finding f ρ by minimizing Eq. (7) and transiting f ρ to f according to Eq. (6), in this paper, we employ the forward strategy proposed in (Patrini et al., 2017); that is, we directly minimize the following risk, i=1 γ(ˆy S i )ℓ(Q f(x S i ), ˆy S i ), (8) As we know, by minimizing the risk ˆR, Q f(x S i ) can approximately predict P T ρ (Y |X ). Then, according to Eq. (6), f(x S i ) can finally approximately predict P T (Y |X ). Note that, in practice, the ratio γ(Y ) is also unknown. But P S ρ (Y ) can be empirically estimated from the noisy source data, and P T (Y ) is estimated by our denoising MMD loss, P T ρ (Y ) can also be computed according to the relationship similar to Eq. (6). In this way, γ(Y ) can be obtained. 4.3. The Overall Models In order to extract conditional invariant components, the transformation τ varies from linear ones to non-linear ones depending on the complexity of input data space. Since linear model is similar except a two-stage procedure, we mainly present our end-to-end deep learning model. We modify the conventional deep neural network for classification, e.g., Alex Net (Krizhevsky et al., 2012), in two aspects: (1) Due to that the domain discrepancy becomes larger for the features in higher-level layers (Long et al., 2015; 2017), we impose the denoising MMD loss on a higher-level layer for extracting the invariant representations; (2) to learn a classifier robust to label noise, we add the forward procedure (Patrini et al., 2017) before the cross-entropy (CE) loss as in Eq. (8). Descriptions about linear model and the structure of deep model can be found in Supplementary Material. Let hl be the responses of the l-th hidden layer, W1:l be the parameters in the 1-th to l-th layers, and L be the total number of layers in our deep model. Suppose that we impose the denoising MMD loss on the features in the l-th layer; that is, τ(xi) = hl i. Then, the denoising MMD loss is ˆD(W1:l, α) = 1 mψ(hl S)Gα 1 nψ(hl T )1 2, (9) where hl is the matrix of the responses of the l-th layer. Denote f(xk) as the softmax output w.r.t. the input xk. According to Eq. (8), the loss for classification is ˆR(W1:L) = 1 k=1 γ(ˆy S k )CE(Q f(x S k ), ˆy S k ), (10) where γ(ˆy S k ) = α Q:i P S ρ (Y =i) if ˆy S k = i; Q:i denotes the i-th column of Q. Together with the regularization Ω(W1:L) Label-Noise Robust Domain Adaptation (e.g., l2 norm) of the parameters, our final model becomes min W1:L,α ˆR(W1:L) + λ1 ˆD(W1:l1, α) + λ2Ω(W1:L), i=1 αi = 1; αi 0, i {1, , c}, (11) where λ1 and λ2 are the tradeoff parameters of denoising MMD loss and regularization, respectively. Again, by minimizing Eq. (11), if Q f(X) approximates P T ρ (Y |X), then f(X) approximates P T (Y |X). We can then successfully learn the classifier for the target data. 4.4. Convergence Analysis In this subsection, we study the convergence rates of the estimates to the true label noise rates and optimal class priors. Estimation of noise rate can be viewed as a mixture proportion estimation problem (Yu et al., 2018a; Ramaswamy et al., 2016; Yao et al., 2020). The convergence rate for the label noise rates has been well studied under the anchor set condition that for any y there exist x in the domain of X such that P(Y = y|X) = 1 and P(Y = y |X) = 0, y = y, which is likely to be held in practice. For example, estimators with convergence guarantees has been proposed in (Liu & Tao, 2016). Recently, (Ramaswamy et al., 2016) exploited the anchor set condition in Hilbert space and designed estimators that can converge to the true label noise rates with an order of O(m 1 2 ). Some work based on a weaker assumption, i.e, linearly independent assumption, is also proposed to estimate label noise, and a fast convergence is also guaranteed (Yu et al., 2018a). Therefore, we mainly focus on the convergence analysis of estimating class ratios. To analyze the convergence rate of the estimated class prior ˆα to the optimal α in the presence of label noise, we first abuse the training samples {(x S 1 , ˆy S 1 ), , (x S m, ˆy S m)} and {x T 1 , , x T n} as i.i.d. variables, respectively. Abuse W as the parameters related to the transformation τ and let D(W, α) = E 1 mψ(x S)Gα E 1 nψ(x T )1 2. We analyze the convergence rate by deriving an upper bound for D(W, ˆα) D(W, α ) with fixed Q and W. Theorem 2. Given learned ˆQ and ˆW, let the induced RKHS be universal and upper bounded that ψ(τ(x)) ˆ W for all x in the source and target domains, and let the entries of G be bounded that |Gij| ˆ Q for all i {1, , m}, j {1, , c}. δ > 0, with probability at least 1 δ, we have D( ˆW, ˆα) D( ˆW, α ) 8( ˆ Q + 1)2 2 ˆ W s c m + c n + See the proof of Theorem 2 in the Supplementary Material. Although the bound in Theorem 2 involves two fixed parameters, the result is informative if Q and W are given or ˆQ 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Class Ratio Estimation Error 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Noise Level Estimation Error 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Sample Size Estimation Error Figure 2. The estimation error of β. (a), (b), and (c) present the estimate errors with the increasing class ratio β(Y = 1), the increasing flip rate ρ, and the increasing sample size n, respectively. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Noise Level Classification Error % DIP TCA CIC DCIC 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Noise Level Classification Error % DIP TCA CIC DCIC 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Noise Level Classification Error % DIP TCA CIC DCIC (c) Figure 3. The effectiveness of invariant components extraction. (a), (b), and (c) present the classification error with increasing flip rate ρ when β1 = 1.4, 1.6, and 1.8, respectively. and ˆW quickly converges to Q and W , respectively. From previous analyses, we know that fast convergence rates for estimating label noise rate are guaranteed. However, the convergence of ˆW to W is not guaranteed because the objective function is non-convex w.r.t. W. How to identify the transferable components τ(X) should be further studied. 5. Experiments To show the robustness of our method to label noise, we conduct comprehensive evaluations on both simulated and real data. We first compare our method, denoising conditional invariant components (abbr. as DCIC hereafter), with CIC (Gong et al., 2016) on identifying the changes in PY given noisy observations. The effectiveness of our method is then verified on both synthetic and real data. We compare DCIC with the domain invariant projection (DIP) (Baktashmotlagh et al., 2013), transfer component analysis (TCA) (Pan et al., 2011), Deep Adaptation Networks (DAN) (Long et al., 2015) and CIC (Gong et al., 2016). In our experiments, the bandwidth σ of the Gaussian kernel is set to be the median value of the pairwise distances between all invariant (resp. raw) features for deep (resp. linear) model. 5.1. Synthetic Data We use the linear model to verify the effectiveness of DCIC in two situations: (a) the estimation of class ratio β in the target shift (Tar S) scenario given the true flip rates (i.e., transition probabilities); and (b) the evaluation of the extracted invariant components in the generalized target shift (Ge Tar S) scenario, with various class ratios and different label flip rates. In all experiments, the flip rates are estimated using the method proposed in (Liu & Tao, 2016). We repeat the experiments for 20 times and report the average scores. Label-Noise Robust Domain Adaptation We generate the binary classification training and test data from a 2-dimensional mixture of Gaussians (Gong et al., 2016), i.e., x P2 i=1 πi N(θi, Σi) where the mean parameters θij, j = 1, 2 are sampled from the uniform distribution U( 0.25, 0.25) and the covariance matrices Σi are sampled from the Wishart distribution W(2 I2, 7). The class labels are the cluster indices. Under Tar S, PX|Y remains the same. We only change the class priors across domains. Under Ge Tar S, we apply location and scale transformations on the features to generate target domain data. To get the noisy observations, we randomly flip the clean labels in the source domain with the same transition probability ρ. First, we verify that with corrupted labels, the proposed DCIC can almost recover the correct class ratio under Tar S. We set the source class prior P S(Y = 1) to 0.5. The target domain class prior P T (Y = 1) varies from 0.1 to 0.9 with step 0.1. The corresponding class ratio β(Y = 1) = P T (Y = 1)/P S(Y = 1) varies from 0.2 to 1.8 with step 0.2. Then, we compare the proposed method with CIC (Gong et al., 2016) on finding the true class ratio β with noisy labels in source domain. We evaluate the performance by using the class ratio estimation error βest β / β , where βest is the estimated class ratio vector. Figure 2(a) shows that DCIC can find the solutions close to the true β for various class ratios. In this experiment, given large label noise (ρ = 0.4), β estimated by CIC is close to the true one only when β (Y = 1) is close to 0, 1, and 2. The estimation of CIC is accurate at β (Y = 1) = 1 because we set the class prior P S Y =1 to 0.5 in the clean source domain, which happens to make P S ρY = P S Y . If P S Y =1 = 0.5, then P S ρY = P S Y , the estimated β will be wrong (see Section 3). CIC gives accurate results when β (Y = 1) is close to 0, 2 because target domain collapses to a single class, rendering the estimated results trivially right. Figure 2(b) shows the superiority of the proposed method over CIC at different levels of label noise. When ρ > 0.1, CIC finds the incorrect solutions. However, our method can find a good solution even when ρ is close to 0.5. Figure 2(c) shows that the estimate of β improves as the sample size gets larger. Second, under Ge Tar S, we evaluate whether our method can discover the invariant representations given the noisy source data and unlabeled target data. In these experiments, we fix the sample size to 500, and the class prior P S(Y = 1) to 0.5. We use classification accuracies to measure the performance. The results in Figure 3 show that our method is more robust to the label noise than DIP, TCA, and CIC. 5.2. Real Data MNIST-USPS. USPS dataset is a handwritten digit dataset including ten classes 0-9 and contains 7,291 training images and 2,007 test images of size 16 16, which is rescaled to 28 28. MNIST shares the same 10 classes of digits which consist of 60,000 training images and 10,000 test images of size 28 28. In our experiments, these two datasets are resampled to construct the domain adaptation datasets in which the class priors PY across different domains vary. For MNIST, we assume that the class priors are unbalanced. For the first 5 classes, the class prior is set to 0.04. For the rest 5 classes, the class prior is equal to 0.16. For USPS, the class priors are balanced; that is, the class prior is set to 0.1 for each class. According to these class priors, we sample 5,000 images from both MNIST and USPS datasets to construct the new dataset mnist2usps. We switch the source/target pair to get another dataset usps2mnist. Same with (Patrini et al., 2017), in the source data, noise flips between the similar digits: 2 7, 3 8, 5 6, 7 1 with the transition probability ρ = 0.2 or 0.4. After the noisy data are obtained, we leave 10 percent of source data as validation set. The Le Net (Le Cun et al., 1998) structure in Caffe s (Jia et al., 2014) MNIST tutorial is employed to train the model from scratch. Our denoising MMD loss is imposed on the first fully connected layer. In all experiments, l2 regularization is applied and we set π1 = 1 and π2 = 1e 4. The batch sizes for both source and target data are set to 100. The initial learning rate r0 = 0.01 and is decayed exponentially according to r0(1 + 0.0001t) 0.75, where t is the index of current iteration. Each experiment is repeated 5 times. Here, DCIC is compared with the baseline that training with source data only (SO), DAN, and CIC. These methods are integrated with the forward procedure in (Patrini et al., 2017) to reduce the effects of label noise. They are denoted as methods with Forward Q (resp. ˆQ) given the true (resp. estimated) transition matrix. Note that, DAN has verified that adapting more layers and using MK-MMD are more helpful. Here, we use single-layer adaptation and the modified vanilla MMD to compare with baselines, which futher verified the effectiveness of our method. We are also aware that DAN is for covariate shift problem, so we extended it to CIC to adress generalized target shift. CIC is also added in the first fully connected layer and the vanilla MMD loss is used. Further, the exploited CIC here is not the original one in (Gong et al., 2016) but the extension of DAN with idea from (Gong et al., 2016). The results are shown in Table 1. When label noise is present, CIC based methods cannot correctly estimate the class ratios, which adversely affects the identification of the invariant components. It thus performs worse than the DAN based methods in some cases. The latter, however, ignores the change of PY in different domains. In contrast, our method often gives better estimation of the class ratios and can effectively identify the invariant components, which leads to the higher performances. VLCS. VLCS dataset (Torralba & Efros, 2011) consists of the images from five common classes: bird , car , chair , dog , and person in the datasets Pascal VOC 2007 (V), Label-Noise Robust Domain Adaptation Table 1. Classification accuracies and their standard deviations for USPS and MNIST datasets. mnist usps (ρ = 0.4) usps mnist (ρ = 0.4) mnist usps (ρ = 0.2) usps mnist (ρ = 0.2) SO+Forward Q 58.12 0.32 61.02 0.90 59.27 1.51 65.90 0.65 SO+Forward ˆQ 54.93 2.23 60.80 0.49 56.97 1.36 65.51 3.07 DAN+Forward Q 59.34 5.43 64.68 1.07 62.82 1.15 67.05 0.77 DAN+Forward ˆQ 54.76 1.62 63.87 0.84 61.28 1.44 65.70 1.24 CIC 65.23 2.63 58.09 2.17 66.70 1.31 61.02 3.96 CIC+Forward Q 65.37 2.49 63.35 4.43 66.84 3.62 68.45 0.91 CIC+Forward ˆQ 64.18 1.49 62.78 2.92 63.42 0.99 67.99 1.30 DCIC+Forward Q 69.94 2.25 68.77 2.34 72.33 2.15 70.80 1.59 DCIC+Forward ˆQ 68.50 0.37 66.78 1.53 69.29 4.07 70.47 2.29 Table 2. Classification accuracies and their standard deviations for VLCS dataset. VLS2C LCS2V VLC2S VCS2L SO+Forward Q 85.88 2.17 62.07 0.86 59.40 1.37 49.34 1.39 SO+Forward ˆQ 78.62 4.36 59.49 0.50 57.09 1.81 49.14 1.39 DAN+Forward Q 87.66 2.37 64.37 2.07 59.54 0.83 51.07 1.26 DAN+Forward ˆQ 84.69 0.24 58.64 1.91 57.51 1.25 50.41 1.20 CIC 75.15 6.23 54.69 0.96 53.61 2.35 49.30 0.48 CIC+Forward Q 86.83 2.53 64.22 0.27 60.36 0.36 51.76 0.82 CIC+Forward ˆQ 85.69 1.76 59.80 0.47 57.65 0.60 50.33 0.31 DCIC+Forward Q 91.60 0.51 65.67 0.37 61.79 0.77 52.47 0.50 DCIC+Forward ˆQ 87.28 1.18 63.35 0.37 58.88 0.74 51.60 1.48 Label Me (L), Caltech (C), and SUN09 (S), respectively. For these four datasets, we first randomly select at most 300 images for each class to construct the new datasets, respectively. Then, we construct the domain adaptation datasets by using the leave-one-domain-out evaluation strategy. For example, in VLS2C , the source data is the combination of the new Pascal VOC 2007, Label Me, and SUN09 datasets. The target dataset is the new Caltech. In each source data, the labels flip from person to car , chair to person , and dog to person with the probability ρ = 0.4. We leave 30% of the source data as the validation set. Each experiment is repeated 5 times. In this experiments, the source data is finetuned on the pretrained Alex Net (Krizhevsky et al., 2012) model with the parameters in conv1-conv3 layers being freezed. We impose our denoising MMD loss on the fc7 layer. As discussed in DAN, we also focus on high-level features because the transferability gap grows from low-level to high-level features, and that the gap becomes large in high-level ones. Further, high-level semantic features are more prone to be affected by wrong labels. However, if label noise possibly affects the low-level features, correcting the low-level features directly could be more powerful. The batch sizes for both source and target data are 32. The initial learning rate is 0.001 and decayed exponentially according to 0.001(1 + 0.002t) 0.75. The results are shown in Table 2. Our proposed method also improves the performances of the compared baselines, which indicates the effectiveness of the proposed model to correct the shift in different domains even though the label noise is present. 6. Conclusion We have studied domain adaptation with label noise. We found that label noise is detrimental to the performance of existing domain adaptation methods. In particular, when the label is the cause for the features, the estimate of target domain class distribution and conditional invariant representations can be unreliable. To alleviate the effects of label noise on domain adaptation, we have proposed the novel denoising MMD loss to improve the estimation of both target domain label distribution and conditional invariant components from the noisy source data and the unlabeled target data. We have provided both theoretical and empirical studies to demonstrate the effectiveness of our method. Acknowledgements This research was supported in part by Australian Research Council Projects, i.e., DE-190101473, FL-170100117, DP-180103424, IH-180100002, IC-190100031, and LE200100049. This work was partially supported by NIH Award Number 1R01HL141813-01, NSF 1839332 Tripod+X, and SAP SE. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. We were also grateful for the computational resources provided by Pittsburgh Super Computing grant number TG-ASC170024. Finally, thanks anonymous reviewers for their constructive comments. Label-Noise Robust Domain Adaptation Arazo, E., Ortego, D., Albert, P., O Connor, N. E., and Mc Guinness, K. Unsupervised label noise modeling and loss correction. In ICML. 2019. Azizzadenesheli, K., Liu, A., Yang, F., and Anandkumar, A. Regularized learning for domain adaptation under label shifts. In ICLR, 2019. B. Han, G. Niu, X. Y.-Q. Y. M. X. I. W. T. and Sugiyama, M. Sigua: Forgetting may make learning with noisy labels more robust. In ICML, 2020. Baktashmotlagh, M., Harandi, M. T., Lovell, B. C., and Salzmann, M. Unsupervised domain adaptation by domain invariant projection. In CVPR, pp. 769 776, 2013. Chen, C., Xie, W., Huang, W., Rong, Y., Ding, X., Huang, Y., Xu, T., and Huang, J. Progressive feature alignment for unsupervised domain adaptation. In CVPR, June 2019a. Chen, P., Liao, B., Chen, G., and Zhang, S. Understanding and utilizing deep neural networks trained with noisy labels. In ICML. 2019b. Cheng, J., Liu, T., Ramamohanarao, K., and Tao, D. Learning with bounded instance-and label-dependent label noise. 2020. Dubois, S., Romano, N., Jung, K., Shah, N., and Kale, D. C. The effectiveness of transfer learning in electronic health records data. In ICLR Workship Track, 2017. Gong, M., Zhang, K., Liu, T., Tao, D., Glymour, C., and Sch olkopf, B. Domain adaptation with conditional transferable components. In ICML, pp. 2839 2848, 2016. Han, B., Yao, J., Niu, G., Zhou, M., Tsang, I., Zhang, Y., and Sugiyama, M. Masking: A new perspective of noisy supervision. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), NIPS, pp. 5836 5846. 2018a. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), NIPS, pp. 8527 8537. 2018b. Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NIPS, pp. 8527 8537, 2018c. Huang, J., Gretton, A., Borgwardt, K. M., Sch olkopf, B., and Smola, A. J. Correcting sample selection bias by unlabeled data. In NIPS, pp. 601 608, 2007. Ishida, T., Niu, G., Hu, W., and Sugiyama, M. Learning from complementary labels. In NIPS, pp. 5639 5649, 2017. Iyer, A., Nath, J. S., and Sarawagi, S. Maximum Mean Discrepancy for class ratio estimation: Convergence bounds and kernel selection. In ICML, pp. 530 538, 2014. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In The 22nd ACM international conference on Multimedia (ACMMM), pp. 675 678. ACM, 2014. Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML. 2018. Jiaxian Guo, Mingming Gong, T. L.-K. Z. and Tao, D. Ltf: A label transformation framework for correcting label shift. In ICML, 2020. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097 1105, 2012. Kumagai, A., Iwata, T., and Fujiwara, Y. Transfer anomaly detection by inferring latent domain representations. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch e Buc, F., Fox, E., and Garnett, R. (eds.), NIPS, pp. 2467 2477. 2019. L. Feng, T. Kaneko, B. H.-G. N. B. A. and Sugiyama, M. Learning from multiple complementary labels. In ICML, 2020. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Lee, K.-H., He, X., Zhang, L., and Yang, L. Cleannet: Transfer learning for scalable image classifier training with label noise. In CVPR, 2018. Liu, T. and Tao, D. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3):447 461, 2016. Liu, Y. and Guo, H. Peer loss functions: Learning from noisy labels without knowing noise rates. 2020. Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep adaptation networks. In ICML, pp. 97 105, 2015. Long, M., Wang, J., and Jordan, M. I. Deep transfer learning with joint adaptation networks. In ICML, 2017. Label-Noise Robust Domain Adaptation Long, P. M. and Servedio, R. A. Random classification noise defeats all convex potential boosters. In ICML, pp. 608 615, 2008. Menon, A., Van Rooyen, B., Ong, C. S., and Williamson, B. Learning from corrupted binary labels via classprobability estimation. In ICML, pp. 125 134, 2015. Meyerson, E. and Miikkulainen, R. Modular universal reparameterization: Deep multi-task learning across diverse domains. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), NIPS, pp. 7901 7912. 2019. Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. In NIPS, pp. 1196 1204, 2013. Nguyen, D. T., Mummadi, C. K., Ngo, T. P. N., Nguyen, T. H. P., Beggel, L., and Brox, T. Self: Learning to filter noisy labels with self-ensembling. In ICLR, 2020. Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199 210, 2011. Patrini, G., Rozza, A., Menon, A. K., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017. Pearl, J. and Bareinboim, E. Transportability of causal and statistical relations: A formal approach. In AISTATS, pp. 247 254, 2011. Purushotham, S., Carvalho, W., Nilanon, T., and Liu, Y. Variational recurrent adversarial deep domain adaptation. In ICLR, 2017. Raghu, M., Zhang, C., Kleinberg, J., and Bengio, S. Transfusion: Understanding transfer learning for medical imaging. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), NIPS, pp. 3342 3352. 2019. Ramaswamy, H., Scott, C., and Tewari, A. Mixture Proportion Estimation via kernel embeddings of distributions. In ICML, pp. 2052 2060, 2016. Reid, M. D. and Williamson, R. C. Composite binary losses. Journal of Machine Learning Research, 11(Sep):2387 2422, 2010. S aez, J. A., Krawczyk, B., and Wo zniak, M. On the influence of class noise in medical data classification: Treatment using noise filtering methods. Applied Artificial Intelligence, 30(6):590 609, 2016. Sch olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. On causal and anticausal learning. In ICML, 2012a. Sch olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. On causal and anticausal learning. ar Xiv preprint ar Xiv:1206.6471, 2012b. Scott, C. A rate of convergence for Mixture Proportion Estimation, with application to learning from noisy labels. In AISTATS, pp. 838 846, 2015. Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., and Fergus, R. Training convolutional networks with noisy labels. ar Xiv preprint ar Xiv:1406.2080, 2014. Thulasidasan, S., Bhattacharya, T., Bilmes, J., Chennupati, G., and Mohd-Yusof, J. Combating label noise in deep learning using abstention. In ICML. 2019. Torralba, A. and Efros, A. A. Unbiased look at dataset bias. In CVPR, pp. 1521 1528. IEEE, 2011. Van Rooyen, B., Menon, A., and Williamson, R. C. Learning with symmetric label noise: The importance of being unhinged. In NIPS, pp. 10 18, 2015. Wang, X., Huang, T.-K., and Schneider, J. G. Active transfer learning under model shift. In ICML, pp. 1305 1313, 2014. Wu, S., Xia, X., Liu, T., Han, B., Gong, M., Wang, N., Liu, H., and Niu, G. Class2simi: A new perspective on learning with label noise. ar Xiv preprint ar Xiv:2006.07831, 2020. Wu, Y., Winston, E., Kaushik, D., and Lipton, Z. Domain adaptation with asymmetrically-relaxed distribution alignment. In ICML, 2019. Xia, X., Liu, T., Wang, N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? In NIPS, pp. 6838 6849, 2019. Xia, X., Liu, T., Han, B., Wang, N., Gong, M., Liu, H., Niu, G., Tao, D., and Sugiyama, M. Parts-dependent label noise: Towards instance-dependent label noise. ar Xiv preprint ar Xiv:2006.07836, 2020. Xu, Y., Cao, P., Kong, Y., and Wang, Y. L dmi: A novel information-theoretic loss function for training deep nets robust to label noise. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), NIPS, pp. 6222 6233. 2019a. Xu, Y., Gong, M., Chen, J., Liu, T., Zhang, K., and Batmanghelich, K. Generative-discriminative complementary learning. ar Xiv preprint ar Xiv:1904.01612, 2019b. Label-Noise Robust Domain Adaptation Xu, Z., Huang, S., Zhang, Y., and Tao, D. Webly-supervised fine-grained visual categorization via deep domain adaptation. TPAMI, 40(5):1100 1113, 2016. Y.-T. Chou, G. Niu, H.-T. L. and Sugiyama, M. Unbiased risk estimators can mislead: A case study of learning with complementary labels. In ICML, 2020. Yang, H., Yao, Q., Han, B., and Niu, G. Searching to exploit memorization effect in learning from corrupted labels. ar Xiv preprint ar Xiv:1911.02377, 2019. Yao, Y., Liu, T., Han, B., Gong, M., Niu, G., Sugiyama, M., and Tao, D. Towards mixture proportion estimation without irreducibility. ar Xiv preprint ar Xiv:2002.03673, 2020. Yu, X., Liu, T., Gong, M., Batmanghelich, K., and Tao, D. An efficient and provable approach for mixture proportion estimation using linear independence assumption. In CVPR, 2018a. Yu, X., Liu, T., Gong, M., and Tao, D. Learning with biased complementary labels. In ECCV, pp. 68 83, 2018b. Zhang, K., Sch olkopf, B., Muandet, K., and Wang, Z. Domain adaptation under target and conditional shift. In ICML, pp. 819 827, 2013a. Zhang, K., Zheng, V., Wang, Q., Kwok, J., Yang, Q., and Marsic, I. Covariate shift in hilbert space: A solution via sorrogate kernels. In ICML, pp. 388 395, 2013b.