# adversariallearned_loss_for_domain_adaptation__5c7c064d.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Adversarial-Learned Loss for Domain Adaptation Minghao Chen, Shuai Zhao, Haifeng Liu, Deng Cai State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, Hangzhou, China Fabu Inc., Hangzhou, China Alibaba-Zhejiang University Joint Institute of Frontier Technologies, Hangzhou, China {minghaochen01, zhaoshuaimcc}@gmail.com, {haifengliu, dcai}@zju.edu.cn Recently, remarkable progress has been made in learning transferable representation across domains. Previous works in domain adaptation are majorly based on two techniques: domain-adversarial learning and self-training. However, domain-adversarial learning only aligns feature distributions between domains but does not consider whether the target features are discriminative. On the other hand, selftraining utilizes the model predictions to enhance the discrimination of target features, but it is unable to explicitly align domain distributions. In order to combine the strengths of these two methods, we propose a novel method called Adversarial-Learned Loss for Domain Adaptation (ALDA). We first analyze the pseudo-label method, a typical selftraining method. Nevertheless, there is a gap between pseudolabels and the ground truth, which can cause incorrect training. Thus we introduce the confusion matrix, which is learned through an adversarial manner in ALDA, to reduce the gap and align the feature distributions. Finally, a new loss function is auto-constructed from the learned confusion matrix, which serves as the loss for unlabeled target samples. Our ALDA outperforms state-of-the-art approaches in four standard domain adaptation datasets. Our code is available at https://github.com/ZJULearning/ALDA. Introduction In recent years, deep learning has made impressive progress in the classification task. The success of deep neural networks is based on the large scale datasets with a tremendous amount of labeled samples (Deng et al. 2009). However, in many practical situations, a large number of labeled samples are inaccessible. The deep neural networks pre-trained on existing datasets cannot generalize well on the new data with different appearance characteristics. Essentially, the difference in data distribution between domains makes it difficult to transfer knowledge from the source to target domains. This transferring problem is known as domain shift (Torralba and Efros 2011). Unsupervised domain adaptation (UDA) tackles the above domain shift problem while transferring the model Corresponding author Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: The illustration of proposed adversarial-learned loss (ALDA). There is a gap between pseudo-label predicted by the model and the ground truth which is unavailable on the target domain. We employ a discriminator network to produce a confusion matrix to correct the pseudo-label, which then serves as the training label for the target sample. from a labeled source domain to an unlabeled target domain. The common idea of UDA is to make features extracted by neural networks similar between domains (Long et al. 2015; Ganin et al. 2016). In particular, the domain-adversarial learning methods (Ganin et al. 2016; Tzeng et al. 2017) train a domain discriminator to distinguish whether the feature is from the source domain or target domain. To fool the discriminator, the feature generator has to output similar source and target feature distributions. However, it is challenging for this type of UDA methods to learn discriminative features on the target domain (Saito et al. 2018; Xie et al. 2018). That is because they overlook whether the aligned target features can be discriminated by the classifier. Recently, self-training based methods (French, Mackiewicz, and Fisher 2018; Zou et al. 2018; Chen, Xue, and Cai 2019) become another solution for UDA and achieve stateof-the-art performance on multiple tasks. A typical way of self-training is to generate pseudo-labels corresponding to large prediction probability of target samples and train the model with these pseudo-labels. In this way, the features contributing to the target classification are enhanced. How- ever, the alignment between the source and target feature distributions is implicit and has no theoretical guarantee. With unmatched target features, self-training based methods can lead to a drop of performance in the case of shallow networks (Zou et al. 2018; Saito et al. 2019). In conclusion, domain-adversarial learning is able to align the feature distributions with a theoretical guarantee, while self-training can learn discriminative target features. It is ideal to have a method to combine the advantages of these two types of methods. To achieve this goal, we first analyze the loss function of self-training with pseudo-labels (Zou et al. 2018) on the unlabeled target domain. Previous works in learning from noisy labels (Sukhbaatar and Fergus 2014; Zhang and Sabuncu 2018) proposed accounting for noisy labels with a confusion matrix. Following their analyzing approach, we reveal that the loss function using pseudolabels (Zou et al. 2018) differs from the loss function learned with the ground truth by a confusion matrix. Concretely, the commonly used cross entropy loss becomes: k=1 p(y = k|x) log p(ˆy = k|x) l=1 p(y = k|ˆy = l, x)p(ˆy = l|x) log p(ˆy = k|x), where K represents the number of categories, y is the ground truth label for the sample x, ˆy is the model prediction, i.e., pseudo-labels, and p(y = k|ˆy = l, x) is the (k, l)- th component of the confusion matrix. If the confusion matrix can be estimated correctly, we can minimize the noise in pseudo-labels and boost the training of target samples. In this paper, we propose a novel method called Adversarial-learned Loss for Domain Adaptation (ALDA). As illustrated in Fig. 1, we generate the confusion matrix with a discriminator network. After multiplying with the confusion matrix, the pseudo-label vector turns into a corrected label vector, which serves as the training label on the target domain. As there is no direct way to optimize the confusion matrix, we learn it with noise-correcting domain discrimination. Specifically, the domain discriminator has to produce different corrected labels for different domains, while the feature generator aims to confuse the domain discriminator. The adversarial process finally leads to a proper confusion matrix on the target domain. The main contributions of this paper are as follows: We analyze the noise in pseudo-labels with the confusion matrix, and propose our Adversarial-learned Loss for Domain Adaptation (ALDA) method, which uses adversarial learning to estimate the confusion matrix. We theoretically prove that ALDA can align the feature distributions between domains and correct the target prediction of the classifier. In this way, ALDA takes the strengths of domain-adversarial learning and self-training based methods. ALDA can outperform state-of-the-art methods on four standard unsupervised domain adaptation datasets. Related Work Unsupervised Domain Adaptation. With the success of deep learning, unsupervised domain adaptation (UDA) (Tzeng et al. 2014; Long et al. 2015; 2017b; Ganin et al. 2016) has been embedded into deep neural networks to transfer the knowledge between the labeled source domain and unlabeled target domain. It has been revealed that the accuracy of the classifier on the target domain is bounded by the accuracy of the source and the domain discrepancy (Ben-David et al. 2010). Therefore, the major line of the current UDA study is to align the distributions between the source and target domains. The distribution divergence between domains can be measured by Maximum Mean Discrepancy (MMD) (Tzeng et al. 2014; Long et al. 2015) or second-order statistics (Sun and Saenko 2016). Domain-adversarial Methods. The domain-adversarial learning-based methods (Ganin et al. 2016; Tzeng et al. 2017) utilize a domain discriminator to represent the domain discrepancy. These methods play a minimax game: the discriminator is trained to distinguish the feature come from the source or target sample while the feature generator has to confuse the discriminator. However, due to practical issues, e.g., mode collapse (Che et al. 2017), domain-adversarial learning cannot match the multi-modal distributions. Recently, together with the prediction of classifier (Long et al. 2017a; Hong et al. 2018), the discriminator can match the distributions of each category, which significantly enhances the final classification results. Self-training Methods. Semi-supervised learning (Lee 2013; Grandvalet and Bengio 2004; Tarvainen and Valpola 2017) is a similar task with domain adaptation, which also deals with labeled and unlabeled samples. With the data manifold assumption, some methods train the model based on the prediction of itself to smooth the decision boundary around the data. In particular, (Grandvalet and Bengio 2004) minimizes the prediction entropy as a regularizer for unlabeled samples. Pseudo-label method (Lee 2013) selects high-confidence predictions as training target for unlabeled samples. Mean Teacher method (Tarvainen and Valpola 2017) sets the exponential moving average of the model as the teacher model and lets the prediction of the teacher model guide the original model. Recently, many works apply the above self-training based methods to unsupervised domain adaptation (Zou et al. 2018; Chen, Xue, and Cai 2019; French, Mackiewicz, and Fisher 2018). These UDA methods implicitly encourage the class-wise feature alignment between domains and achieve surprisingly good results on multiple UDA tasks. Methods Preliminaries For unsupervised domain adaptation, we have a labeled source domain DS = {(xi s, yi s)}ns i=1 and a unlabeled target domain DT = {xj t}nt j=1. We train a generator network G to extract the high-level feature from the data xs or xt, and a classifier network C to finish the K-class classification task on the feature space. The classifier C outputs probability Figure 2: The illustration of noise-correcting domain discrimination (K = 3). The confusion matrix η is class-wise uniform with the vector ξ generated by the discriminator D. The corrected pseudo-label c is generated by multiplying the confusion matrix η and the pseudo-label vector ˆy. For the source sample, the target of c is the ground truth ys, and the target is the opposite distribution for the target sample. The generator G is designed to confuse the above targets. Therefore, we add a gradient reverse layer (GRL) (Ganin et al. 2016) to achieve the minimax optimization. vectors ps, pt RK, indicating the prediction probability of xs, xt respectively. In this paper, we consider providing a proper loss function on the target domain. Theoretically, the ideal loss function is the loss with the ground truth yt: LT (xt, L) = k=1 p(yt = k|xt)L(pt, k), (1) where L is a basic loss function, e.g., cross entropy (CE), mean absolute error (MAE). However, the target ground truth yt is unavailable in the UDA setting. Pseudo-label method (Lee 2013; Zou et al. 2018) substitutes yt with the model prediction: ˆyt = argmaxk pk t , if maxk pk t > δ, where δ is a threshold. As mentioned in the introduction, we analyze the difference between the ideal loss and the loss with pseudo-labels: LT (xt, L) = k=1 p(yt = k|xt)L(pt, k) (2) l=1 p(yt = k|ˆyt = l, xt)p(ˆyt = l|xt)L(pt, k) (3) l=1 η(xt) kl p(ˆyt = l|xt)L(pt, k), (4) where η(xt) is the confusion matrix. The confusion matrix is unknown on the unlabeled target domain. For brevity, we define c(xt) k = l η(xt) kl p(ˆyt = l|xt) and name c(xt) as the corrected label vector. In previous works studying noisy labels (Zhang and Sabuncu 2018), it is commonly assumed that the confusion matrix is conditionally independent of inputs xt and uniform with noise rate α. The unhinged loss has been proved to be robust to the uniform noise (van Rooyen, Menon, and Williamson 2015; Ghosh, Kumar, and Sastry 2017), Lunh(p, k) = 1 pk. (5) However, these assumptions cannot hold in the case of pseudo-labels, which makes the problem more intractable. Adversarial-Learned Loss The general idea of our method is that if we can adequately estimate the noise matrix η(xt) kl , the noise in pseudo-labels will be corrected and we can approximately optimize the ideal loss function on the target domain. Firstly, to simplify the noisy label problem, we assume that the noise is class-wise uniform with vector ξ(xt). Definition 1. Noise is class-wise uniform with vector ξ(xt) RK, if η(xt) kl = ξ(xt) k for k = l, and η(xt) kl = 1 ξ(xt) l K 1 for k = l. In this work, we propose to use an extra neural network, called noise-correcting domain discriminator, to learn the vector ξ(xt). Noise-correcting Domain Discrimination As shown in Fig. 2, the noise-correcting domain discriminator D is a multi-layer neural network, which takes the deep feature G(x) as the input and outputs a multi-class score vector D(G(x)) RK. After a sigmoid layer, the discriminator produces the noise vector ξ(x) = σ(D(G(x))). Each component of ξ(x) denotes the probability that the pseudo label is the same as the correct label: ξ(x) k = p(y = k|ˆy = k, x). We adopt the idea of the domain-adversarial learning (Ganin et al. 2016) that makes the discriminator and the generator play a minimax game. Instead of letting the discriminator perform a domain classification task, we let the discriminator generate different noise vectors for the source and target domains. As illustrated in Fig. 2, for the source feature G(xs), the discriminator aims to minimize the discrepancy between the corrected label vector c(xs) and the ground truth ys(= one hot(ys)). The adversarial loss for the source data is: LAdv(xs, ys) = LBCE(c(xs), ys) (6) k ysk log c(xs) k (1 ysk) log(1 c(xs) k ). (7) As for the target feature G(xt), the discriminator do the opposite way. The discriminator will correct pseudo-labels to the opposite distribution u(ˆyt) RK, in which u(ˆyt) k = 0 for k = ˆyt and u(ˆyt) k = 1 (K 1) for k = ˆyt. The adversarial loss for the target data is: LAdv(xt) = LBCE(c(xt), u(ˆyt)). (8) The total adversarial loss becomes: LAdv(xs, ys, xt) = LAdv(xs, ys) + LAdv(xt). (9) The discriminator D needs to minimize the loss function to distinguish between the source and target feature. On the other hand, the generator G has to fool the discriminator, by maximizing the above loss function. Compared to the common domain-adversarial learning, this adversarial loss takes the classifier prediction and the label information into consideration. In this way, our noise-correcting domain discriminator can achieve the class-wise feature alignment. Regularization Term As revealed in the works of generative adversarial networks (GANs) (Mao et al. 2017), the training process of adversarial learning can be unstable. Following (Odena, Olah, and Shlens 2016), we add a classification task on the source domain to the discriminator to make its training more stable. Consequently, the discriminator not only has to distinguish the source and target domains but also correctly classify the source samples. To embed the classification task into training, we add a regularization term to the loss of the discriminator: LReg(xs, ys) = LCE(p(xs) D , ys), (10) where p(xs) D = softmax(D(G(xs))) and LCE is the cross entropy loss. Then the final loss function for the discriminator becomes: min D E(xs,ys),xt(LAdv(xs, ys, xt) + LReg(xs, ys)). (11) Corrected Loss Function After the adversarial learning of the confusion matrix η(xt), we can construct a proper loss function for the target samples. As the unhinged loss (Eq. 5) is robust to the uniform part of noise, we choose the unhinged loss Lunh as the basic loss function L: LT (xt, Lunh) = k,l η(xt) kl p(ˆyt = l|xt)Lunh(pt, k) (12) k c(xt) k Lunh(pt, k). (13) Together with the supervised loss on the source domain, the losses for the classifier and the generator become: min C E(xs,ys),xt(LCE(ps, ys) + λLT (xt, Lunh)) (14) min G E(xs,ys),xt(LCE(ps, ys) + λLT (xt, Lunh) λLAdv(xs, ys, xt)), (15) where λ [0, 1] is a trade-off parameter. Theoretical Insight In the feature space F generated by the generator G, the source and target feature distributions are Ps = {G(xs)|xs Ds} and Pt = {G(xt)|xt Dt} respectively. If we assume that both distributions are continuous with densities Ps and Pt, for a feature vector f F, the probabilities that it belongs to source and target distributions are Ps(f) and Pt(f) respectively. Theorem 1. When the noise-correcting domain discrimination max G min D E(xs,ys),xt LAdv(xs, ys, xt) (16) achieves the optimal point D and G , the feature distributions generated by G are aligned: Ps = Pt. Proof. The proof is given in the supplemental material. As a result, the noise-correcting domain discrimination can align the feature distribution between the source and target domain. According to the theory of (Ben-David et al. 2010), the expected error on the target samples can be bounded by the expected error on the source domain and feature discrepancy between domains. Therefore, the target expected error of our noise-correcting domain discrimination is theoretically bounded. Furthermore, we can prove that by optimizing the corrected loss function, the noise in pseudo-labels is reduced. Theorem 2. When the optimal point D and G are achieved in Theorem 1, if there is a optimal labeling function y (fs) = ys, fs Ps in the feature space F, then xt Pt and ft = G (xt), we have: c(xt) = hy (ft) ˆyt = y (ft) u(ˆyt) otherwise , Method A W D W W D A D D A W A Avg Res Net-50 (He et al. 2016) 68.4 0.2 96.7 0.1 99.3 0.1 68.9 0.2 62.5 0.3 60.7 0.3 76.1 DANN (Ganin et al. 2016) 82.0 0.4 96.9 0.2 99.1 0.1 79.7 0.4 68.2 0.4 67.4 0.5 82.2 ADDA (Tzeng et al. 2017) 86.2 0.5 96.2 0.3 98.4 0.3 77.8 0.3 69.5 0.4 68.9 0.5 82.9 JAN (Long et al. 2017b) 85.4 0.3 97.4 0.2 99.8 0.2 84.7 0.3 68.6 0.3 70.0 0.4 84.3 MADA (Pei et al. 2018) 90.0 0.1 97.4 0.1 99.6 0.1 87.8 0.2 70.3 0.3 66.4 0.3 85.2 CBST (Zou et al. 2018) 87.8 0.8 98.5 0.1 100 0.0 86.5 1.0 71.2 0.4 70.9 0.7 85.8 CAN (Zhang et al. 2018) 92.5 98.8 100.0 90.1 72.1 69.9 87.2 CDAN+E (Long et al. 2017a) 94.1 0.1 98.6 0.1 100.0 0.0 92.9 0.2 71.0 0.3 69.3 0.3 87.7 MCS (Liang et al. 2019) - - - - - - 87.8 ALDA 95.6 0.5 97.7 0.1 100.0 0.0 94.0 0.4 72.2 0.4 72.5 0.2 88.7 Table 1: Accuracy (%) of different unsupervised domain adaptation methods on Office-31 (Res Net-50) Method Ar Cl Ar Pr Ar Rw Cl Ar Cl Pr Cl Rw Pr Ar Pr Cl Pr Rw Rw Ar Rw Cl Rw Pr Avg Res Net-50 (He et al. 2016) 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1 DANN (Ganin et al. 2016) 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6 JAN (Long et al. 2017b) 45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3 CDAN+E (Long et al. 2017a) 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8 TAT (Liu et al. 2019) 51.6 69.5 75.4 59.4 69.5 68.6 59.5 50.5 76.8 70.9 56.6 81.6 65.8 ALDA 53.7 70.1 76.4 60.2 72.6 71.5 56.8 51.9 77.1 70.2 56.3 82.1 66.6 Table 2: Accuracy (%) of different unsupervised domain adaptation methods on Office-Home (Res Net-50) where c(xt) = hy (ft) denotes that c(xt) k = 1 2 for k = ˆyt and c(xt) k = 1 2K 2 otherwise. Proof. The proof is given in the supplemental material. As Theorem 2 shows, when we optimize the target loss LT (xt, L) = k c(xt) k L(pt, k), the loss of pseudo-labels L(pt, ˆyt) will be enhanced when ˆyt = y (xt) (c(xt) ˆyt = 1 and suppressed otherwise (c(xt) ˆyt = 0). In this way, the training of classifier can be corrected by the discriminator on the target domain and will be more efficient than the original pseudo-label method. Experiments We evaluate the proposed adversarial-learned loss for domain adaptation (ALDA) with state-of-the-art approaches on four standard unsupervised domain adaptation datasets: digits, office-31, office-home, and Vis DA-2017. Datasets Digits. Following the evaluation protocol of (Long et al. 2017a), we experiment on three adaptation scenarios: USPS to MNIST (U M), MNIST to USPS (M U), and SVHN to MNIST (S M). MNIST (Le Cun 1998) contains 60, 000 images of handwritten digits and USPS (Hull 1994) contains 7, 291 images. Street View House Numbers (SVHN) (Netzer et al. 2011) consists of 73, 257 images with digits and numbers in natural scenes. We report the evaluation results on the test sets of MNIST and USPS. Office-31 (Saenko and Kulis 2010) is a commonly used dataset for unsupervised domain adaptation, which contains 4, 652 images and 13 categories collected from three domains: Amazon (A), Webcam (W) and DSLR (D). We evaluate all methods across six domain adaptation tasks: A W, D W, W D, A D, D A and W A. Office-Home (Venkateswara et al. 2017) is a more difficult domain adaptation dataset than office-31, including 15, 500 images from four different domains: Artistic images (Ar), Clip Art (Cl), Product images (Pr) and Real-World (Rw). For each domain, the dataset contains images of 65 object categories that are common in office and home scenarios. We evaluate all methods in 12 adaptation scenarios. Vis DA-2017 (Peng et al. 2017) is a large-scale dataset and challenge for unsupervised domain adaptation from simulation to real. The dataset contains 152, 397 synthetic images as the source domain and 55, 388 real-world images as the target domain. 12 object categories are shared by these two domains. Following previous works (Saito et al. 2018; Long et al. 2017a), we evaluate all methods on the validation set of Vis DA. For digits datasets, we adopt the generator and classifier networks used in (French, Mackiewicz, and Fisher 2018) and optimize the model using Adam (Kingma and Ba 2015) gradient descent with learning rate 1 10 3. For the other three datasets, we employ Res Net-50 (He et al. 2016) as the generator network. The Res Net-50 is pretrained on Image Net (Deng et al. 2009). Our discriminator consists of three fully connected layers with dropout, which is the same as other works (Ganin et al. 2016; Long et al. 2017a). As we train the classifier and discriminator from scratch, we set their learning rates to be 10 times that of the generator. We train the model with Stochastic Gradient Descent (SGD) optimizer with the momentum of 0.9. We schedule the learning rate with the strategy in (Ganin et al. 2016): the learning rate is adjusted by ηp = η0 (1+αq)β , where q is the training progress linearly changing from 0 to 1, η0 = 0.01, α = 10, β = 0.75. We implement the algo- Method Backbone plane bcycl bus car house knife mcycl person plant sktbrd train truck Avg 55.1 53.3 61.9 59.1 80.6 17.9 79.7 31.2 81.0 26.5 73.5 8.5 52.4 DANN (Ganin et al. 2016) 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4 MCD (Saito et al. 2018) 87.0 60.9 83.7 64.0 88.9 79.6 84.7 76.9 88.6 40.3 83.0 25.8 71.9 CBST (Zou et al. 2018) 87.2 78.8 56.5 55.4 85.1 79.2 83.8 77.7 82.8 88.8 69.0 72.0 76.4 ALDA 93.8 74.1 82.4 69.4 90.6 87.2 89.0 67.6 93.4 76.1 87.7 22.2 77.8 Sourceonly Res Net50 74.6 26.8 56.0 53.5 58.0 26.2 76.5 17.6 81.7 34.8 80.3 27.2 51.1 CDAN+E (Long et al. 2017a) - - - - - - - - - - - - 70.0 ALDA 87.0 61.3 78.7 67.9 83.7 89.4 89.5 71.0 95.4 71.9 89.6 33.1 76.5 Table 3: Accuracy (%) of different unsupervised domain adaptation methods on Vis DA-2017. Method U M M U S M Avg Sourceonly 77.5 0.8 82.0 1.2 66.5 1.9 75.3 DANN (Ganin et al. 2016) 74.0 91.1 73.9 79.7 ADDA (Tzeng et al. 2017) 90.1 89.4 76.0 85.2 CDAN+E (Long et al. 2017a) 98.0 95.6 89.2 94.3 MT+CT 92.3 8.6 88.1 0.34 93.3 5.8 91.2 (French, Mackiewicz, and Fisher 2018) MCD (Saito et al. 2018) 94.1 0.3 96.5 0.3 96.2 0.4 95.6 MCS (Liang et al. 2019) 98.2 97.8 91.7 95.9 ALDA (δ = 0.9) 98.1 0.2 94.8 0.1 95.6 0.6 96.2 ALDA (δ = 0.8) 98.2 0.1 95.4 0.4 97.5 0.3 97.0 ALDA (δ = 0.6) 98.6 0.1 95.6 0.3 98.7 0.2 97.6 ALDA (δ = 0.0) 98.4 0.2 95.0 0.1 97.0 0.2 96.8 Targetonly 99.5 0.0 97.3 0.2 99.6 0.1 98.8 Table 4: Accuracy (%) of different unsupervised domain adaptation methods on the digits datasets. We use the base model in (French, Mackiewicz, and Fisher 2018). rithms using Py Torch (Paszke et al. 2017). There are two hyper-parameters in our method: the threshold δ of pseudo-labels and the trade-off λ. If the prediction of a target sample is below the threshold, we ignore these samples in training. We set δ to 0.6 for digit adaptation and 0.9 for office-31, office-home datasets and Vis DA dataset. In all experiment, λ is gradually increased from 0 to 1 by 2 1+exp( 10 q) 1, same as (Long et al. 2017a). Image Results. Table 1 reports the results with Res Net-50 on Office-31. ALDA significantly outperforms state-of-theart methods. Because ALDA combines with self-training methods to learn discriminative features, ALDA achieves better results than the domain-adversarial learning-based methods, e.g., DANN, JAN, MADA. Similar to ALDA, CDAN+E also takes the classification prediction into the discrimination and uses the entropy of prediction as an importance weight. However, ALDA outperforms CDAN+E on hard transfer tasks, e.g., A W, A D, D A and W A. The outstanding results show that it is important to combine the domain-adversarial learning and self-training based methods properly. Table 2 summarizes the results with Res Net-50 on Officehome. For these more difficult adaptation datasets, ALDA still exceeds the most advanced methods. Compared to Office-31, Office-Home has more categories and has a larger appearance gap between domains. A larger number of categories indicates more components of the discriminator out- put ξ in ALDA, which results in a stronger capacity of classwise domain discrimination. Table 3 shows the quantitative results with Res Net-50 and Res Net-101 on Vis DA classification dataset. Even though only based on Res Net-50, our ALDA performs better than other domain adaptation methods. Digits Results. Table 4 summarizes the experimental results for digits adaption comparing with state-of-the-art methods. For fair comparisons, we only resize and normalize the image and do not apply any addition data augment like (French, Mackiewicz, and Fisher 2018). We conduct each experiment three times and report their average results and variance. As the table shows, ALDA outperforms the most advanced distribution alignment methods, e.g., DANN, MCD, CDAN, and self-training based methods, e.g., Mean Teacher with a confident threshold (MT+CT). ALDA also reduces the performance gap between UDA and the supervised learning on the target domain by a large margin. In Table 4, we also investigate the effect of the threshold δ for pseudo-labels on the digits datasets. As we decrease the threshold δ from 0.9 to 0.6, the performances are improved. It is because the digits datasets are relatively easy to transfer and do not require high thresholds to obtain high precision pseudo-labels. The lower threshold will take more target samples into training, which promotes the training of samples with low prediction confidence. For the digits datasets, ALDA with δ = 0.6 achieves the best result. Analysis In Table 5, we perform an ablation study on Office-31 to investigate the effect of different components in ALDA. Firstly, we apply self-training (Zou et al. 2018) to unsupervised domain adaptation, which is denoted as ST . DANN+ST denotes that we directly combine the domainadversarial learning and the self-training methods. However, the performance of DANN+ST is inferior to ALDA , proving the importance of properly combining these two methods. To investigate the effect of the regularization term LReg in Eq. 10, we remove the LReg term in the final loss of the discriminator, denoted as ALDA w/o LReg . The results show that without LReg, the performance of ALDA drops dramatically. This phenomenon is because the regularization term can enhance the stability of the adversarial process. To investigate the effect of the corrected target loss LT in Eq.13, we remove the LT and only keep the noise-correcting domain discrimination, denotes as ALDA w/o LT . As Method A W D W W D A D D A W A Avg Res Net-50 (He et al. 2016) 68.4 0.2 96.7 0.1 99.3 0.1 68.9 0.2 62.5 0.3 60.7 0.3 76.1 DANN (Ganin et al. 2016) 82.0 0.4 96.9 0.2 99.1 0.1 79.7 0.4 68.2 0.4 67.4 0.5 82.2 ST 89.0 99.0 100.0 86.3 67.5 63.0 84.1 DANN + ST 91.8 98.4 100.0 89.1 68.8 68.7 86.1 ALDA w/o LReg 93.8 98.7 100.0 91.5 70.4 67.3 87.0 ALDA w/o LT 95.0 97.5 100.0 94.0 70.8 69.0 87.7 ALDA+ST w/o LT 94.8 98.0 100.0 95.4 71.0 65.9 87.8 ALDA w/ LT (x, LCE) 95.1 97.6 100.0 92.7 69.4 70.5 87.6 ALDA 95.6 0.5 97.7 0.1 100.0 0.0 94.0 0.4 72.2 0.4 72.5 0.2 88.7 Table 5: Ablation study on Office-31 (Res Net-50). ST denotes self-training with pseudo-labels (Zou et al. 2018). (a) Res Net-50 (b) Self-training Figure 3: T-SNE of (a) Res Net-50, (b) Self-training, (c) DANN, (d) ALDA for A W adaptation(red: A; blue: W). Table 5 shows, ALDA w/o LT can achieve competitive results but inferior to ALDA . The phenomenon shows the superiority of our noise-correcting domain discrimination and the importance of combining domain discrimination and corrected pseudo-labels to enhance the performance. Additionally, we replace the corrected target loss LT with uncorrected target loss, i.e., self-training with pseudolabels, which is denoted as ALDA+ST w/o LT . However, ALDA+ST w/o LT does not improve the performance, which manifests the importance of correcting pseudo-labels. As mentioned before, the unhinged loss has been proved to be robust to the uniform part of the noise. To verify the effect of choosing the unhinged loss Lunh as basic loss function, we substitute the unhinged loss with the cross-entropy loss LCE in the target loss LT (x, L), denoted as ALDA w/ LT (x, LCE) . The results in Table 5 demonstrate that the cross-entropy loss performs worse than the unhinged loss in ALDA. The unhinged loss can remove the uniform part of the noise, which facilitates the noise-correcting process. Visualization We use t-SNE (van der Maaten and Hinton 2008) to visualize the feature extracted by Res Net-50, Self-training, DANN and ALDA for A W adaptation (31 classes) in Fig. 3. When using Res Net-50 only, the target feature distribution is not aligned with the source. Although self-training and DANN can align the distributions of the source and target domain, their target clusters are not fully matched with source clusters. For ALDA, the target clusters are closely matched with the corresponding source clusters, which demonstrates the target features extracted by ALDA are well aligned and discriminative. In this paper, we propose Adversarial-Learned Loss for Domain Adaptation (ALDA) to combine the strengths of domain-adversarial learning and self-training. We first introduce the confusion matrix to represent the noise in pseudolabels. As the confusion matrix is unknown, we employ noise-correcting domain discrimination to learn the confusion matrix. Then the target classifier is optimized with the corrected loss function. Our ALDA is theoretically and experimentally proven to be effective for unsupervised domain adaption and achieves state-of-the-art performance on four standard datasets. Acknowledgments This work was supported in part by The National Key Research and Development Program of China (Grant Nos: 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant Nos: 61936006, 61973271). Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning 79(1):151 175. Che, T.; Li, Y.; Jacob, A. P.; Bengio, Y.; and Li, W. 2017. Mode regularized generative adversarial. In ICLR. Chen, M.; Xue, H.; and Cai, D. 2019. Domain adaptation for semantic segmentation with maximum squares loss. In The IEEE International Conference on Computer Vision (ICCV). Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; and Li, F. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. French, G.; Mackiewicz, M.; and Fisher, M. 2018. Selfensembling for visual domain adaptation. In ICLR. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. S. 2016. Domain-adversarial training of neural networks. JMLR 17:2096 2030. Ghosh, A.; Kumar, H.; and Sastry, P. S. 2017. Robust loss functions under label noise for deep neural networks. In AAAI. Grandvalet, Y., and Bengio, Y. 2004. Semi-supervised learning by entropy minimization. In NIPS. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR. Hong, W.; Wang, Z.; Yang, M.; and Yuan, J. 2018. Conditional generative adversarial network for structured domain adaptation. In CVPR. Hull, J. J. 1994. A database for handwritten text recognition research. PAMI 16:550 554. Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR. Le Cun, Y. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, 2278 2324. Lee, D.-H. 2013. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. In ICML. Liang, J.; He, R.; Sun, Z.; and Tan, T. 2019. Distant supervised centroid shift: A simple and efficient approach to visual domain adaptation. In CVPR. Liu, H.; Long, M.; Wang, J.; and Jordan, M. 2019. Transferable adversarial training: A general approach to adapting deep classifiers. In ICML. Long, M.; Cao, Y.; Wang, J.; and Jordan, M. I. 2015. Learning transferable features with deep adaptation networks. In ICML. Long, M.; Cao, Z.; Wang, J.; and Jordan, M. I. 2017a. Conditional adversarial domain adaptation. In Neur IPS. Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I. 2017b. Deep transfer learning with joint adaptation networks. In ICML. Mao, X.; Li, Q.; Xie, H.; Lau, R. Y. K.; Wang, Z.; and Smolley, S. P. 2017. Least squares generative adversarial networks. In ICCV. Netzer, Y.; Fillet, M.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS. Odena, A.; Olah, C.; and Shlens, J. 2016. Conditional image synthesis with auxiliary classifier gans. In ICML. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch. In NIPS-W. Pei, Z.; Cao, Z.; Long, M.; and Wang, J. 2018. Multiadversarial domain adaptation. In AAAI. Peng, X.; Usman, B.; Kaushik, N.; Hoffman, J.; Wang, D.; and Saenko, K. 2017. Visda: The visual domain adaptation challenge. Ar Xiv abs/1710.06924. Saenko, K., and Kulis, B. 2010. Adapting visual category models to new domains. In ECCV. Saito, K.; Watanabe, K.; Ushiku, Y.; and Harada, T. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR. Saito, K.; Kim, D.; Sclaroff, S.; Darrell, T.; and Saenko, K. 2019. Semi-supervised domain adaptation via minimax entropy. Ar Xiv abs/1904.06487. Sukhbaatar, S., and Fergus, R. 2014. Learning from noisy labels with deep neural networks. In ICLR. Sun, B., and Saenko, K. 2016. Deep coral: Correlation alignment for deep domain adaptation. In ECCV Workshops. Tarvainen, A., and Valpola, H. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In ICLR. Torralba, A., and Efros, A. A. 2011. Unbiased look at dataset bias. In CVPR. Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; and Darrell, T. 2014. Deep domain confusion: Maximizing for domain invariance. Co RR abs/1412.3474. Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain adaptation. In CVPR. van der Maaten, L., and Hinton, G. E. 2008. Visualizing data using t-sne. In JMLR. van Rooyen, B.; Menon, A. K.; and Williamson, R. C. 2015. Learning with symmetric label noise: The importance of being unhinged. In NIPS. Venkateswara, H.; Eusebio, J.; Chakraborty, S.; and Panchanathan, S. 2017. Deep hashing network for unsupervised domain adaptation. In CVPR. Xie, S.; Zheng, Z.; Chen, L.; and Chen, C. 2018. Learning semantic representations for unsupervised domain adaptation. In ICML. Zhang, Z., and Sabuncu, M. R. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In NIPS. Zhang, W.; Ouyang, W.; Li, W.; and Xu, D. 2018. Collaborative and adversarial network for unsupervised domain adaptation. In CVPR. Zou, Y.; Yu, Z.; Kumar, B. V. K. V.; and Wang, J. 2018. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV.