# multisource_distilling_domain_adaptation__772494f7.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Multi-Source Distilling Domain Adaptation

Sicheng Zhao,1 # Guangzhi Wang,2# Shanghang Zhang,1# Yang Gu,2 Yaxian Li,2,3 Zhichao Song,2 Pengfei Xu,2 Runbo Hu,2 Hua Chai,2 Kurt Keutzer1

1University of California, Berkeley, USA, 2Didi Chuxing, China, 3Renmin University of China, China {schzhao, gzwang98, shzhang.pku}@gmail.com, liyaxian@ruc.edu.cn, {guyangdavid, songzhichao, xupengfeipf, hurunbo, chaihua}@didiglobal.com, keutzer@berkeley.edu

Deep neural networks suffer from performance decay when there is domain shift between the labeled source domain and unlabeled target domain, which motivates the research on domain adaptation (DA). Conventional DA methods usually assume that the labeled data is sampled from a single source distribution. However, in practice, labeled data may be collected from multiple sources, while naive application of the single-source DA algorithms may lead to suboptimal solutions. In this paper, we propose a novel multi-source distilling domain adaptation (MDDA) network, which not only considers the different distances among multiple sources and the target, but also investigates the different similarities of the source samples to the target ones. Speciﬁcally, the proposed MDDA includes four stages: (1) pre-train the source classiﬁers separately using the training data from each source; (2) adversarially map the target into the feature space of each source respectively by minimizing the empirical Wasserstein distance between source and target; (3) select the source training samples that are closer to the target to ﬁne-tune the source classiﬁers; and (4) classify each encoded target feature by corresponding source classiﬁer, and aggregate different predictions using respective domain weight, which corresponds to the discrepancy between each source and target. Extensive experiments are conducted on public DA benchmarks, and the results demonstrate that the proposed MDDA signiﬁcantly outperforms the state-of-the-art approaches. Our source code is released at: https://github.com/daoyuan98/MDDA.

Introduction One key element of the signiﬁcant success of deep learning algorithms is the availability of large-scale labeled data (He et al. 2016). However, in many practical applications, only limited or even no training data is provided. On the one hand, it is prohibitively labor-intensive and expensive to obtain abundant labeled data. On the other hand, visual data possess variance in nature, which fundamentally limits the scalability and applicability of supervised learning models for handling new scenarios with few labeled examples (Ni, Zhang, and Xie 2019). In such cases, conven-

Corresponding Author. # Equal Contribution. Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

$JJUHJDWLRQ

Figure 1: Illustration of MDDA which explores the relationships among different sources and the target. We employ a discriminator D to measure the similarity ω between each source and target in an adversarial manner. The samples that are closer to the target are selected to distill the source classiﬁer C . The prediction of different distilled source classiﬁers are aggregated based on the domain similarity to obtain the ﬁnal prediction of the target samples.

tional deep learning approaches suffer from performance decay. Directly transferring the learned models trained on labeled source domains to unlabeled target domains may result in unsatisfying performance, because of the presence of domain shift (Torralba and Efros 2011), which calls for domain adaptation (DA) methods (Bousmalis et al. 2016; Zhao et al. 2018b; Hoffman et al. 2018). Unsupervised DA (UDA) addresses such problems by establishing knowledge transfer from a labeled source domain to an unlabeled target domain, and exploring domain-invariant structures and representations to bridge the domain gap (Netzer et al. 2011). Both theoretical results (Ben-David et al. 2010; Gopalan, Li, and Chellappa 2014; Tzeng et al. 2017) and algorithms for domain adaptation (Pan and Yang 2010; Long et al. 2015; Hoffman et al. 2018; Zhao et al. 2019b) have been proposed recently. Though these methods make progress on DA, most of

8QVKDUHG ZHLJKWV

6RXUFH ODEHOHG LPDJHV

&ODVV ODEHO

6RXUFH ODEHOHG LPDJHV

6RXUFH LPDJHV

7DUJHW LPDJHV

:DVVHUVWHLQ

,QLWLDOL]DWLRQ ܥ 'LVWLOOHG VRXUFH ODEHOHG LPDJHV

6RXUFH LPDJHV

7DUJHW LPDJHV

,QLWLDOL]DWLRQ ܥ 'LVWLOOHG VRXUFH ODEHOHG LPDJHV

6WDJH 6RXUFH FODVVLILHU SUH WUDLQLQJ 6WDJH $GYHUVDULDO GLVFULPLQDWLYH DGDSWDWLRQ 6WDJH 6RXUFH GLVWLOOLQJ

ܥ ᇱ 7DUJHW LPDJH

$JJUHJDWLRQ

&ODVV ODEHO

&ODVV ODEHO &ODVV ODEHO

3UHGLFWHG ODEHO

6WDJH $JJUHJDWHG WDUJHW SUHGLFWLRQ

'RPDLQ ZHLJKW

:DVVHUVWHLQ

:DVVHUVWHLQ

Figure 2: The framework of the proposed multi-source distilling domain adaptation (MDDA) network. Dashed rectangles and trapezoids indicate ﬁxed network parameters. F, C, and D are short for feature extractor, classiﬁer, and domain discriminator, respectively. For simplicity, we just consider the ith and kth source domains. The Proposed MDDA consists of four stages, as shown from left to right: Source classiﬁer pre-training, Adversarial discriminative adaptation, source distilling, and aggregated target prediction. Best viewed in color.

them focus on the single-source setting (Sun et al. 2011; Ganin et al. 2016) and fail to consider a more practical scenario in which there are multiple labeled source domains with different distributions. Naive application of the single-source DA algorithms may lead to suboptimal solution (Shen et al. 2017), which calls for effective multi-source domain adaptation (MDA) techniques. Recently, some deep MDA approaches have been proposed (Zhao et al. 2018a; Xu et al. 2018; Li et al. 2018; Peng et al. 2019; Zhao et al. 2019a), but most of them suffer from the following limitations. (1) They sacriﬁce the discriminative property of the extracted features for the desired task learner in order to learn domain invariant features. (2) They treat the multiple sources equally and fail to consider the different discrepancy among sources and target, as illustrated in Figure 1. Such treatment may lead to suboptimal performance when some sources are very different from the target (Zhao et al. 2018a). (3) They treat different samples from each source equally, without distilling the source data based on the fact that different samples from the same source domain may have different similarities from the target. (4) The adversarial learning based methods suffer from vanishing gradient problem when the domain classiﬁer network can perfectly distinguish target representations from the source ones.

In this paper, we propose a novel multi-source distilling domain adaptation (MDDA) network to address the above challenges by thoroughly exploring the relationships among different sources and the target. As shown in Figure 2, MDDA can be divided into four stages. (1) We ﬁrst pre-train the source classiﬁers separately using the training data from each source. (2) We ﬁx the feature extractor of each source and adversarially map the target into the feature space of each source respectively by minimizing the empirical Wasserstein distance between the source and target (Arjovsky, Chintala, and Bottou 2017), which provides more stable gradients even when the target and source dis-

tributions are non-overlap. (3) We select the source training samples that are closer to the target to ﬁne-tune the source classiﬁers. (4) We build the target predictor by aggregating the source predictions based on the source domain weights, which corresponds to the discrepancy between each source and target. We propose a mechanism to automatically choose a weighting strategy over source domains to emphasize more relevant sources and suppress the irrelevant ones, and aggregate multiple source classiﬁers based on these weights. With the above four stages, the proposed MDDA can extract features that are both discriminative for the learning task and indiscriminate with respect to the shift among the multiple source and target domains. The main contributions of this paper are summarized as follows:

We propose MDDA to explore the relationships among different sources and target, and achieve more accurate inference on the target by ﬁnetuning and aggregating the source classiﬁers based on these relationships. Compared to (Xu et al. 2018), which symmetrically maps the multiple sources and target into the same space, MDDA learns more discriminative target representations and avoids the oscillation from the simultaneous changing of the multi-source and target distributions by using separate feature extractors that asymmetrically map the target to the feature space of the source in an adversarial manner. Wasserstein distance is used in the adversarial training to achieve more stable gradients even when the target and source distributions are non-overlap. We propose the source distilling mechanism to select the source training samples that are closer to the target and ﬁne-tune the source classiﬁers with these samples. We propose a novel mechanism to automatically choose a weighting strategy over source domains to emphasize more relevant sources and suppress the irrelevant ones,

and aggregate the multiple source classiﬁers based on these weights to build more accurate target predictor. We extensively evaluate MDDA on the public benchmarks, achieving the state-of-the-art performance and verifying the efﬁcacy of MDDA.

Related Work

Single-source UDA The emphasis of recent single-source UDA (SUDA) methods has shifted to deep learning architectures in an end-to-end fashion. Most deep SUDA methods employ a conjoined architecture with two streams to respectively represent the models for the source domain and the target domain (Zhuo et al. 2017). Generally, these methods are trained jointly with a traditional task loss based on the labeled source data and another loss to tackle the domain shift problem, such as discrepancy loss, adversarial loss, reconstruction loss, etc. Discrepancy-based methods explicitly measure the discrepancy between the source and target domains of the two network streams, such as the multiple kernel variant of maximum mean discrepancies (Long et al. 2015), correlation alignment (CORAL) (Sun, Feng, and Saenko 2016; 2017; Zhuo et al. 2017), and contrastive domain discrepancy (Kang et al. 2019). Adversarial generative models combine the domain discriminative model with a generative component to generate fake source or target data generally based on GAN (Goodfellow et al. 2014) and its variants, such as Co GAN (Liu and Tuzel 2016), Sim GAN (Shrivastava et al. 2017), Cycle GAN (Zhu et al. 2017; Zhao et al. 2019b), and Cy CADA (Hoffman et al. 2018). Adversarial discriminative models usually employ an adversarial objective with respect to a domain discriminator to encourage domain confusion (Ganin et al. 2016; Tzeng et al. 2017; Chen et al. 2017; Shen et al. 2017; Tsai et al. 2018; Huang, Huang, and Krahenbuhl 2018). Most of these methods suffer from low accuracy when directly applied to the MDA problem. Multi-source DA MDA assumes training data are collected from multiple sources (Sun, Shi, and Wu 2015; Zhao et al. 2019a). There are some theoretical analysis (Ben David et al. 2010; Hoffman, Mohri, and Zhang 2018) to support existing MDA algorithms. The early MDA methods mainly focus on shallow models, including two categories (Sun, Shi, and Wu 2015): feature representation approaches (Sun et al. 2011; Duan, Xu, and Chang 2012; Chattopadhyay et al. 2012; Duan, Xu, and Tsang 2012) and combination of pre-learned classiﬁers (Xu and Sun 2012; Sun and Shi 2013). Some novel shallow MDA methods aim to deal with special cases, such as incomplete MDA (Ding, Shao, and Fu 2018) and target shift (Redko et al. 2019). Recently, some representative deep learning based MDA methods are proposed, such as multisource domain adversarial network (MDAN) (Zhao et al. 2018a), deep cocktail network (DCTN) (Xu et al. 2018), and moment matching network (MMN) (Peng et al. 2019). All these MDA methods employ a shared feature extractor network to symmetrically map the multiple sources and target into the same space. For each source-target pair in MDAN and DCTN, a discriminator is trained to distinguish the source and target features.

MDAN concatenates all extracted source features and labels into one domain to train a single task classiﬁer, while DCTN trains a classiﬁer for each source domain and combines the predictions of different classiﬁers for a target image using perplexity scores as weights. MMN transfers the learned knowledge from multiple sources to the target by dynamically aligning moments of their feature distributions. The ﬁnal prediction of a target image is averaged uniformly based on the classiﬁers from different source domains. Different from these works, we employ an unshared feature extractor to obtain the feature representation for each source, match the target feature to each source feature space asymmetrically, distill the pre-trained classiﬁers with selected representative samples, and combine the predictions of different classiﬁers using a novel weighting strategy.

Problem Deﬁnition Suppose we have M source domains S1, S2, , SM and one target domain T. In unsupervised domain adaptation (UDA) scenario, S1, S2, , SM are labeled and T is fully unlabled. For the ith source domain Si, the observed images and corresponding labels drawn from the source distribution pi(x, y) are Xi = {xj i}Ni j=1 and Yi = {yj i }Ni j=1, where Ni is the number of source images. The target images drawn from the target distribution p T (x, y) are XT = {xj T }NT j=1 without label observation, where NT is the number of target images. Unless otherwise speciﬁed, we assume (1) homogeneity, i.e. xj i Rd, xj T Rd, which indicates that the data from different domains are observed in the same feature space but exhibit different distributions; (2) closed set, i.e. yj i Y, yj T Y, where Y is the class label space, indicating that all the domains share their categories. Our goal is to learn an adaptation model that can correctly predict a sample from the target domain based on {(Xi, Yi)}M i=1 and {XT }. Please note that our method can be easily extended to tackle heterogeneous DA (Li et al. 2014; Hubert Tsai, Yeh, and Frank Wang 2016) by changing the network structure of the target feature extractor, open set DA (Panareda Busto and Gall 2017) by adding an unknown class, or category shift DA (Xu et al. 2018) by reweighing the predictions of only those domains that contain the speciﬁed category. We will investigate such study in our future work.

Multi-source Distilling Domain Adaptation In this section, we introduce the proposed multi-source distilling domain adaptation (MDDA) network. MDDA is a novel approach to overcome the limitations of existing methods for multiple source domain adaptation by thoroughly exploring the relationships among different sources and the target. It achieves more accurate inference on the target by ﬁnetuning and aggregating the source classiﬁers based these relationships. As shown in Figure 2, MDDA can be divided into four stages. We ﬁrst pre-train the source classiﬁers separately with the training data from each source. Then, we ﬁx the feature extractor of each source and map the target into the feature space of each source adversarially by minimizing the estimated Wasserstein distance between the

source and target. MDDA learns more discriminative target representations and avoids the oscillation from the simultaneous changing of the multi-source and target distributions by using separate feature extractors that asymmetrically map the target to the feature space of the source in an adversarial manner. In the third stage, the source samples closer to the target are selected to ﬁne-tune the source classiﬁers. Finally, we build the target predictor by aggregating the source predictions based on the discrepancy between each source and target. We propose a novel mechanism to automatically choose a weighting strategy over source domains to emphasize more relevant sources and suppress the irrelevant ones. With the above four stages, MDDA extracts features that are both discriminative for the learning task and indiscriminate with respect to the shift among the multiple source and target domains. We will explain each stage in the following subsections.

Source Classiﬁer Pre-training To extract more task discriminative features and learn accurate classiﬁers, we pre-train a feature extractor Fi and classiﬁer Ci for each labeled source domain Si with unshared weights between different domains. Take the N-class classiﬁcation task as an example, Fi and Ci are optimized by minimizing the following cross-entropy loss:

Lcls(Fi, Ci) =

E(xi,yi) pi

n=1 1[n=yi] log(σ(Ci(Fi(xi)))), (1)

where σ is the softmax function, and 1 is an indicator function. Comparing with a shared feature extractor network to extract domain-invariant features among different source domains (Zhao et al. 2018a; Xu et al. 2018), the unshared feature extractor network can obtain the discriminative feature representations and accurate classiﬁers for each source domain. When aggregating the multiple predictions based on the source classiﬁer and matched target features in the later stage, the ﬁnal target prediction would be better boosted.

Adversarial Discriminative Adaptation After the pre-training stage, we learn separate target encoder F T i to map the target feature into the same space of source Si. A discriminator Di is trained adversarially to maximize the Wasserstein distance of correctly classifying the encoded target features from F T i and the encoded source feature from pre-trained Fi, while F T i tries to maximize the probability of Di making a mistake, i.e. minimizing the Wasserstein distance. Similar to GAN (Goodfellow et al. 2014), we model this as a two-player minimax game. Following (Arjovsky, Chintala, and Bottou 2017), we suppose the discriminators {Di} are all 1-Lipschitz and then we can optimize Di by maximizing the Wasserstein distance

Lwd D(Di) = Exi pi Di(Fi(xi)) Ex T p T [Di(F T i (x T ))], (2)

while F T i is obtained by minimizing

Lwd F (F T i ) = Ex T p T Di(F T i (x T )). (3)

In this way, the target encoder F T i tries to confuse the discriminator Di by minimizing the Wasserstein distance between the encoded target features as the source ones. To enforce the Lipschitz constraint (Goodfellow et al. 2014), we add a gradient penalty for the parameters of each discriminator Di as in (Gulrajani et al. 2017)

Lgrad(Di) = ( ˆx Di(ˆx) 2 1)2, (4)

where ˆx is a feature set that contains not only the source and target features but also the random points along the straight line between source and target feature pairs (Gulrajani et al. 2017). Di can then be optimized by

max Di Lwd D(Di) αLgrad(Di), (5)

where α is a balancing coefﬁcient, the value of which can be empirically set.

Source Distilling We further dig into each source domain to select the source training samples that are closer to the target based on the estimated Wasserstein distance to ﬁne-tune the source classiﬁers. Such source distilling mechanism utilizes more relevant training data and further improves the target performance on the aggregated source classiﬁers. We select the source samples based on the estimated Wasserstein distance, since it can represent the divergence between source data and target data. For each source sample xj i in the ith source domain, we calculate the Wasserstein distance between each source sample and target domain:

τ j i = ||Di(Fi(xj)) 1 NT

k=1 Di(F T i (xk))||. (6)

For each source sample xj i, τ j i reﬂects the its distance to the target domain. The smaller the τ j i value is, the closer it is to the target domain. Therefore, in each source domain Xi, we select Ni

2 of the source data ˆpi = {ˆxj i, ˆyj i }Ni j=1 whose τ j i is larger than the left ones. With these selected source data, we ﬁnetune Ci by minimizing the following objective:

Ldistill(Ci) =

E(ˆxi,ˆyi) pi

n=1 1[n=ˆyi] log(σ(Ci(Fi(ˆxi)))), (7)

Aggregated Target Prediction In the testing stage, the goal is to accurately classify a given target image x T . Corresponding to each source domain, we extract the features F T i (x T ) of the target image based on the learned target encoder from stage 2, and obtain sourcespeciﬁc prediction C i(F T i (x T )) using the distilled source classiﬁer. Next, we combine the different predictions from each source classiﬁer to obtain the ﬁnal prediction:

Result(x T ) =

i=1 ωi C i(F T i (x T )). (8)

The key problem here is how to select the weights ωi for the predictions from different source classiﬁers. We design

a novel weighting strategy based on the discrepancy between each source and target to emphasize more relevant sources and suppress the irrelevant ones. We assume after training in stage 2, the estimated Wasserstein distance Lwd Di between each source Si and target T subordinates to a standard Gaussian Distribution N(0, 1). Therefore, the weight of each domain can be computed by the following equation

Experiments We evaluate the proposed MDDA model on multi-source domain adaptation task in visual classiﬁcation applications, including digit recognition and object classiﬁcation.

Experimental Settings Benchmarks Digits-ﬁve includes 5 digit image datasets sampled from different domains, including handwritten mt (MNIST) (Le Cun et al. 1998), combined mm (MNIST-M) (Ganin and Lempitsky 2015), street image sv (SVHN) (Netzer et al. 2011), synthetic sy (Synthetic Digits) (Ganin and Lempitsky 2015), and handwritten up (USPS) (Hull 1994). Following (Xu et al. 2018; Peng et al. 2019), we sample 25,000 images for training and 9,000 for testing in mt, mm, sv, sy, and select the entire 9,298 images in up as a domain. Ofﬁce-31 (Saenko et al. 2010) contains 4,110 images within 31 categories, which are collected from ofﬁce environment in 3 image domains: A (Amazon) downloaded from amazon.com, W (Webcam) and D (DSLR) taken by web camera and digital SLR camera, respectively.

Baselines To compare MDDA with the state-of-the-art approaches for MDA, we select the following methods as baselines. (1) Source-only, i.e. train on the source domains and test on the target domain directly. We can view this as a lower bound of DA. (2) Single-source DA, perform multisource DA via single-source DA, including conventional models, i.e. TCA (Pan et al. 2011) and GFK (Gong et al. 2012), and deep methods, i.e. DDC (Tzeng et al. 2015), DRCN (Ghifary et al. 2016), Rev Grad (Ganin and Lempitsky 2015), DAN (Long et al. 2015), RTN (Long et al. 2016), CORAL (Sun, Feng, and Saenko 2016), DANN (Ganin et al. 2016), and ADDA (Tzeng et al. 2017). (3) Multisource DA, extend some single-source DA method to multisource settings, including DCTN (Xu et al. 2018) and MDAN (Zhao et al. 2018a). For the source-only and single-source DA standards, we employ two strategies: (1) source-combined, i.e. all source domains are combined into a traditional single source; (2) single-best, i.e. performing adaptation on each single source and selecting the best adaptation result in the target test set.

Implementation Details In Digits-ﬁve experiments, we use three convlutional layers and two fully connected layers as encoder and one fully connected layer as classiﬁer. In Ofﬁce-31 experiments, we use Alexnet as our backbone. The last layer is used as classiﬁer and the other layers are used as encoder. Following (Gulrajani et al. 2017), we set α in Eq. (5) to 10.

Comparison with the State-of-the-art The performance comparisons between MDDA and the state-of-the-art approaches as measured by classiﬁcation accuracy are shown in Table 1 and Table 2 on Digits-ﬁve and Ofﬁce-31 datsets, respectively. From the results, we have the following observations. (1) The source-only method i.e. directly transferring the models trained on the source domains to the target domain performs the worst in most adaptation settings. Due to the presence of domain shift, the joint probability distributions of observed images and class labels greatly differ in the source and target domains. This results in the model s low transferability from the source domains to the target domain. Further, even with more training samples, the Combined setting does not guarantee to perform better than the Singlebest one. This is because domain shift also exists across different source domains, which may confuse the classiﬁer. For example, if one source domain and the target is very similar, such as sv and sy, and the other source domains are quite different, simple combination would enlarge the domain shift between the Single-best and the target. This observation demonstrates the necessity of designing DA algorithms to address the domain shift problem. (2) Almost all adaptation methods outperform the sourceonly methods, demonstrating the effectiveness of DA in image classiﬁcation. Comparing the Single-best DA and Source-combined DA, it is clear that on average the Sourcecombined DA performs better, which is different from the source-only scenario. This is because after adaptation, domain-invariant representations are learned for the samples of different domains. Therefore, the Source-combined DA works better with the help of more training data. (3) Generally, multi-source DA performs better than other adaptation standards. This is more clear when comparing the methods that employ similar adaptation architectures, such as our MDDA vs. ADDA (Tzeng et al. 2017) and MDAN (Zhao et al. 2018a) vs. DANN (Ganin et al. 2016). Not only the domain shift between the sources and the target, but also the shift across the different source domains is bridged in multi-source DA, which boosts the adaptation by exploring the complementarity of different sources. (4) The proposed MDDA model performs better than state-of-the-art multi-source methods in most cases. On one hand, the performance improvements of MDDA over the best Source-combined method are 3.1% and 0.5% on Digits-ﬁve and Ofﬁce-31 datsets, respectively. On the other hand, the proposed MDDA method achieves 3.3%, 4.8% and 0.4%, 0.9% performance improvements as compared to DCTN (Xu et al. 2018) and MDAN (Zhao et al. 2018a) on Digits-ﬁve and Ofﬁce-Home datsets, respectively. These results demonstrate that the proposed MDDA model can achieve superior performance relative to state-of-the-art approaches. The performance improvements beneﬁt from the advantages of MDDA. First, the unshared weights enable to learn the best feature extractor and classiﬁers for each source domain, which would boost the performance when aggregation. Second, a novel weighting strategy based on the Wasserstein distance can better emphasize the domains that are more closer to the target. Finally, for each source do-

Table 1: Classiﬁcation accuracy (%) on Digits-ﬁve dataset for multi-source unsupervised domain adaptation. The best method is emphasized in bold. Our method achieves 88.1% accuracy, signiﬁcantly outperforming the state-of-the-art approaches.

Standards Models mm mt up sv sy Avg

Source-only Combined 63.7 92.3 87.2 66.3 84.8 78.9 Single-best 59.2 97.2 84.7 77.7 85.2 80.8

Single-best DA

DAN (2015) 63.8 96.3 94.2 62.5 85.4 80.4 CORAL (2016) 62.5 97.2 93.5 64.4 82.8 80.1 DANN (2016) 71.3 97.6 92.3 63.5 85.3 82.0 ADDA (2017) 71.6 97.9 92.8 75.5 86.5 84.9 Sourcecombined DA

DAN (2015) 67.9 97.5 93.5 67.8 86.9 82.7 DANN (2016) 70.8 97.9 93.5 68.5 87.4 83.6 ADDA (2017) 72.3 97.9 93.1 75.0 86.7 85.0

Multi-source DA

DCTN (2018) 70.5 96.2 92.8 77.6 86.8 84.8 MDAN (2018a) 69.5 98.0 92.5 69.2 87.4 83.3 MDDA (ours) 78.6 98.8 93.9 79.3 89.7 88.1

Figure 3: The t-SNE (Maaten and Hinton 2008) visualization of of the digit-5 dataset features for (a) mt up and (b) mm sy. In each pair, the features are extracted using the last layer of source domain encoder from the samples of source and target domain in the ﬁrst image, and the target domain features are extracted using the the last layer of adapted encoder in the second one.

Table 2: Classiﬁcation accuracy (%) on Ofﬁce31 dataset for multi-source unsupervised domain adaptation. The best method is emphasized in bold. Our method achieves 84.2% accuracy, achieving the state-of-the-art performances.

Standards Models D W A Avg

Source-only Combined 97.1 92.0 51.6 80.2 Single-best 99.0 95.3 50.2 81.5

Single-best DA

TCA (2011) 95.2 93.2 51.6 80.0 GFK (2012) 95.0 95.6 52.4 81.0 DDC (2015) 98.5 95.0 52.2 81.9 DRCN (2016) 99.0 96.4 56.0 83.8 Rev Grad (2015) 99.2 96.4 53.4 83.0 DAN (2015) 99.0 96.0 54.0 83.0 RTN (2016) 99.6 96.8 51.0 82.5 ADDA (2017) 99.4 95.3 54.6 83.1 Sourcecombined DA

Rev Grad (2015) 98.8 96.2 54.6 83.2 DAN (2015) 98.8 96.2 54.9 83.3 ADDA (2017) 99.2 96.0 55.9 83.7

Multi-source DA

DCTN (2018) 99.6 96.9 54.9 83.8 MDAN (2018a) 99.2 95.4 55.2 83.3 MDDA (ours) 99.2 97.1 56.2 84.2

main, selective samples are distilled to ﬁne-tune the source classiﬁer, which also adapt better to the target features.

Table 3: Ablation study of different weighting strategies in the proposed MDDA model on Digits-ﬁve dataset for multisource unsupervised domain adaptation.

Weighting mm mt up sv sy Avg Uniform 74.3 95.8 93.7 64.2 79.3 81.5 Ours 78.6 98.8 93.9 79.3 89.7 88.1

Table 4: Ablation study of different weighting strategies in the proposed MDDA model on Ofﬁce31 dataset for multisource unsupervised domain adaptation.

Weighting D W A Avg Uniform 98.4 95.2 55.7 83.1 Ours 99.2 97.1 56.2 84.2

Interpretability and Ablation Study

Feature Visualization To show the adaptation ability of the proposed MDDA model, we visualize the features before and after adversarial adaptation with t-SNE embedding (Maaten and Hinton 2008) in tasks: mt up and mm sy. As illustrated in Figure 3, we have two observations: (1) target features become more dense while using adversarial adaptation; (2) target domain ﬁts source domain more tightly after the adversarial adaptation, which demonstrates that MDDA can align the distributions between the source

Table 5: Ablation study of whether distilling the source classiﬁers in the proposed MDDA model on Digits-ﬁve dataset for multi-source unsupervised domain adaptation.

Weighting mm mt up sv sy Avg w/o 78.4 98.8 93.2 79.1 89.6 87.8 w 78.6 98.8 93.9 79.3 89.7 88.1

Table 6: Ablation study of whether distilling the source classiﬁers in the proposed MDDA model on Ofﬁce31 dataset for multi-source unsupervised domain adaptation.

D W A Avg w/o 99.2 96.0 55.8 83.7 w 99.2 97.1 56.2 84.2

Table 7: An example of detailed distilling result from each source to the target sy on Digits-ﬁve dataset.

Source mt mm sv up Avg w/o 52.0 70.8 89.4 38.6 62.7 w 54.5 71.0 89.5 40.8 64.0

and target domains.

Ablation Study The proposed MDDA model contains two major components: source distilling for ﬁne-tuning the source classiﬁers and a novel weighting strategy for aggregating target prediction. We conduct ablation study to further verify their effectiveness by changing one component while ﬁxing the other. We compare the proposed weighting strategy with one straightforward baseline: uniform weight. The results on Digits-ﬁve and Ofﬁce31 datasets are shown in Table 3 and Table 4, respectively. From the results, we can observe that the proposed weighting strategy outperforms the uniform weight. This is reasonable because the uniform weight does not reveal the importance of different sources, which might have different similarities to the target. By considering the relative similarity of different sources to the target based on the Wasserstein distance, the proposed MDDA achieves 6.6% and 1.1% improvements on Digits-ﬁve and Ofﬁce31 datasets, respectively. These observations demonstrate the effectiveness of the proposed weighting strategy. Table 5 and Table 6 show the comparison between with and without ﬁne-tuning the source classiﬁers by the distilled source samples on Digits-ﬁve and Ofﬁce31 datasets, respectively. It is clear that without distilling, the adaptation performance drops in most cases. For example, we can achieve 0.3% and 0.5% average accuracy improvements by source distilling on Digits-ﬁve and Ofﬁce31 datasets. This conﬁrms the validity of distilling the sources, since the selected source samples are more similar to the target ones and the ﬁnetuned classiﬁer can enhance the transferability. To better demonstrate the effectiveness of source distilling, we give an example of Wasserstein Distance based ADDA method before and after distilling on the Digits-ﬁve dataset when sy is set as the target domain and the others

Figure 4: Comparison of the attention maps before and after adversarial training on Ofﬁce-31 dataset. From left to right: (a) original image; (b) attention map before adversarial training; (c) image with attention map before adversarial training; (d) attention map after adversarial training; (e) image with attention map after adversarial training. Brighter regions indicate more attention. Comparison shows the attention shifts to more discriminative regions of the image after adversarial training. Best viewed in color.

as source domains. As shown in Table 7, we ﬁnd that the performance gains of source distilling vary across different sources. For the sources with larger domain discrepancies to the target, e.g. mt to sy and up to sy, source distilling may yield higher improvement (2.5% and 2.1%, respectively), while the improvement is not that obvious for the sources with smaller discrepancy to the target, e.g. sv to sy (0.1%), mm to sy (0.2%). This is reasonable because when one source domain is far away from the target, the distilled samples can lead the classiﬁer closer to target domain. If the source is already very similar to the target, the inﬂuence of distilled samples will be not that obvious.

Model Interpretability In order to show the interpretability of our model, we use the heat map generated by the Grad-Cam algorithm (Selvaraju et al. 2017) to visualize the attention before and after our proposed domain adaptation method. As illustrated in Figure 4, we observe that after the domain adaptation: the attentions generated by our model can better focus on the more discriminative regions, which indicates that our model can pay more attention to the discriminative regions of the objects for classiﬁcation even though the background or view point are changed. Such observation veriﬁes that our model learns the features that are more invariant to different domains, while they are discriminative for the desired learning task (i.e. image classiﬁcation). For example, the ring binder in the ﬁrst row shows that before adaptation, the model focuses on a region in the background, instead of the central target object. However, after

our domain adaptation, the model can correctly focus on the ring binder and thus is more discriminative for the classiﬁcation. Similar observations can be found in the second and third rows. In the last row, we ﬁnd that attention is enhanced on the discriminative regions of the object (the laptop) after our domain adaptation.

In this paper, we have proposed an effective multi-source domain adaptation approach MDDA. The separately pretrained feature extractor and classiﬁer for each source domain can sufﬁciently explore the discriminability of labeled source data. The adversarial discriminative-adaptation and source distilling aim to match the target feature distribution to the source ones and to ﬁne-tune the pre-trained classiﬁers. A novel weighting strategy is designed to jointly combine the predictions from different source classiﬁers. The extensive experiments conducted on Digits-ﬁve and Ofﬁce31 benchmarks demonstrate that MDDA achieves 3.3% and 0.4% performance improvements as compared to the stateof-the-art multi-source domain adaptation approaches (i.e. DCTN) for digit and object classiﬁcation. In future studies, we plan to extend the MDDA model to more challenging vision tasks, such as scene segmentation. We also aim to investigate methods that can combine generative and discriminative pipelines for multi-source domain adaptation.

Acknowledgments

This work is supported by Berkeley Deep Drive and the National Natural Science Foundation of China (No. 61701273).

Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. ar Xiv:1701.07875. Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning. Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS. Chattopadhyay, R.; Sun, Q.; Fan, W.; Davidson, I.; Panchanathan, S.; and Ye, J. 2012. Multisource domain adaptation and its application to early detection of fatigue. ACM TKDD. Chen, Y.-H.; Chen, W.-Y.; Chen, Y.-T.; Tsai, B.-C.; Frank Wang, Y.-C.; and Sun, M. 2017. No more discrimination: Cross city adaptation of road scene segmenters. In ICCV. Ding, Z.; Shao, M.; and Fu, Y. 2018. Incomplete multisource transfer learning. IEEE TNNLS. Duan, L.; Xu, D.; and Chang, S.-F. 2012. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In CVPR. Duan, L.; Xu, D.; and Tsang, I. W.-H. 2012. Domain adaptation from multiple sources: A domain-dependent regularization approach. IEEE TNNLS.

Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. JMLR. Ghifary, M.; Kleijn, W. B.; Zhang, M.; Balduzzi, D.; and Li, W. 2016. Deep reconstruction-classiﬁcation networks for unsupervised domain adaptation. In ECCV. Gong, B.; Shi, Y.; Sha, F.; and Grauman, K. 2012. Geodesic ﬂow kernel for unsupervised domain adaptation. In CVPR. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS. Gopalan, R.; Li, R.; and Chellappa, R. 2014. Unsupervised adaptation across domain shifts by generating intermediate data representations. IEEE TPAMI. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. C. 2017. Improved training of wasserstein gans. In NIPS. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR. Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.-Y.; Isola, P.; Saenko, K.; Efros, A. A.; and Darrell, T. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In ICML. Hoffman, J.; Mohri, M.; and Zhang, N. 2018. Algorithms and theory for multiple-source adaptation. In Neur IPS. Huang, H.; Huang, Q.; and Krahenbuhl, P. 2018. Domain transfer through deep activation matching. In ECCV. Hubert Tsai, Y.-H.; Yeh, Y.-R.; and Frank Wang, Y.-C. 2016. Learning cross-domain landmarks for heterogeneous domain adaptation. In CVPR. Hull, J. J. 1994. A database for handwritten text recognition research. IEEE TPAMI. Kang, G.; Jiang, L.; Yang, Y.; and Hauptmann, A. G. 2019. Contrastive adaptation network for unsupervised domain adaptation. In CVPR. Le Cun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; et al. 1998. Gradient-based learning applied to document recognition. PIEEE. Li, W.; Duan, L.; Xu, D.; and Tsang, I. W. 2014. Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE TPAMI. Li, Y.; Murias, M.; Major, S.; Dawson, G.; and Carlson, D. E. 2018. Extracting relationships by multi-domain matching. In Neur IPS. Liu, M.-Y., and Tuzel, O. 2016. Coupled generative adversarial networks. In NIPS. Long, M.; Cao, Y.; Wang, J.; and Jordan, M. 2015. Learning transferable features with deep adaptation networks. In ICML. Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I. 2016. Unsupervised domain adaptation with residual transfer networks. In NIPS.

Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. JMLR. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS Workshops. Ni, J.; Zhang, S.; and Xie, H. 2019. Dual adversarial semantics-consistent network for generalized zero-shot learning. In Neur IPS. Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. IEEE TKDE. Pan, S. J.; Tsang, I. W.; Kwok, J. T.; and Yang, Q. 2011. Domain adaptation via transfer component analysis. IEEE TNN. Panareda Busto, P., and Gall, J. 2017. Open set domain adaptation. In ICCV. Peng, X.; Bai, Q.; Xia, X.; Huang, Z.; Saenko, K.; and Wang, B. 2019. Moment matching for multi-source domain adaptation. In ICCV. Redko, I.; Courty, N.; Flamary, R.; and Tuia, D. 2019. Optimal transport for multi-source domain adaptation under target shift. In AISTATS. Saenko, K.; Kulis, B.; Fritz, M.; and Darrell, T. 2010. Adapting visual category models to new domains. In ECCV. Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV. Shen, J.; Qu, Y.; Zhang, W.; and Yu, Y. 2017. Wasserstein distance guided representation learning for domain adaptation. ar Xiv:1707.01217. Shrivastava, A.; Pﬁster, T.; Tuzel, O.; Susskind, J.; Wang, W.; and Webb, R. 2017. Learning from simulated and unsupervised images through adversarial training. In CVPR. Sun, S.-L., and Shi, H.-L. 2013. Bayesian multi-source domain adaptation. In ICMLC. Sun, Q.; Chattopadhyay, R.; Panchanathan, S.; and Ye, J. 2011. A two-stage weighting framework for multi-source domain adaptation. In NIPS. Sun, B.; Feng, J.; and Saenko, K. 2016. Return of frustratingly easy domain adaptation. In AAAI. Sun, B.; Feng, J.; and Saenko, K. 2017. Correlation alignment for unsupervised domain adaptation. In Domain Adaptation in Computer Vision Applications. Sun, S.; Shi, H.; and Wu, Y. 2015. A survey of multi-source domain adaptation. INF. Torralba, A., and Efros, A. A. 2011. Unbiased look at dataset bias. In CVPR. Tsai, Y.-H.; Hung, W.-C.; Schulter, S.; Sohn, K.; Yang, M.- H.; and Chandraker, M. 2018. Learning to adapt structured output space for semantic segmentation. In CVPR. Tzeng, E.; Hoffman, J.; Darrell, T.; and Saenko, K. 2015. Simultaneous deep transfer across domains and tasks. In ICCV.

Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain adaptation. In CVPR. Xu, Z., and Sun, S. 2012. Multi-source transfer learning with multi-view adaboost. In ICONIP. Xu, R.; Chen, Z.; Zuo, W.; Yan, J.; and Lin, L. 2018. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In CVPR. Zhao, H.; Zhang, S.; Wu, G.; Moura, J. M.; Costeira, J. P.; and Gordon, G. J. 2018a. Adversarial multiple source domain adaptation. In Neur IPS. Zhao, S.; Zhao, X.; Ding, G.; and Keutzer, K. 2018b. Emotiongan: unsupervised domain adaptation for learning discrete probability distributions of image emotions. In ACM MM. Zhao, S.; Li, B.; Yue, X.; Gu, Y.; Xu, P.; Hu, R.; Chai, H.; and Keutzer, K. 2019a. Multi-source domain adaptation for semantic segmentation. In Neur IPS. Zhao, S.; Lin, C.; Xu, P.; Zhao, S.; Guo, Y.; Krishna, R.; Ding, G.; and Keutzer, K. 2019b. Cycleemotiongan: Emotional semantic consistency preserved cyclegan for adapting image emotions. In AAAI. Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV. Zhuo, J.; Wang, S.; Zhang, W.; and Huang, Q. 2017. Deep unsupervised convolutional domain adaptation. In ACM MM.