# adversarial_domain_adaptation_with_domain_mixup__707ed677.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Adversarial Domain Adaptation with Domain Mixup

Minghao Xu,1 Jian Zhang,1 Bingbing Ni,1,2 Teng Li,3 Chengjie Wang,4 Qi Tian,5 Wenjun Zhang1

1Shanghai Jiao Tong University, China, 2Huawei Hisilicon, 3Anhui University, 4Youtu Lab, Tencent, 5Huawei Noah s Ark Lab {xuminghao118, stevenash0822, nibingbing, zhangwenjun}@sjtu.edu.cn, nibingbing@hisilicon.com, tenglwy@gmail.com, jasoncjwang@tencent.com, tian.qi1@huawei.com

Recent works on domain adaptation reveal the effectiveness of adversarial learning on ﬁlling the discrepancy between source and target domains. However, two common limitations exist in current adversarial-learning-based methods. First, samples from two domains alone are not sufﬁcient to ensure domain-invariance at most part of latent space. Second, the domain discriminator involved in these methods can only judge real or fake with the guidance of hard label, while it is more reasonable to use soft scores to evaluate the generated images or features, i.e., to fully utilize the inter-domain information. In this paper, we present adversarial domain adaptation with domain mixup (DM-ADA), which guarantees domain-invariance in a more continuous latent space and guides the domain discriminator in judging samples difference relative to source and target domains. Domain mixup is jointly conducted on pixel and feature level to improve the robustness of models. Extensive experiments prove that the proposed approach can achieve superior performance on tasks with various degrees of domain shift and data complexity.

Introduction In recent years, with the appearance of Convolutional Neural Networks (CNNs), many classiﬁcation-based challenges have been tackled with an extremely high accuracy. These powerful CNN architectures, like Alex Net (Krizhevsky, Sutskever, and Hinton 2012) and Res Net (He et al. 2016), are capable of efﬁciently extracting low-level and high-level features with the guidance of labeled data. However, because of the existence of domain shift, models trained on a speciﬁc domain suffer from poorer performance when transferred to another domain. This problem is of vital signiﬁcance in the case where labeled data are unavailable on target domain. Thus how to use these unlabeled data from target domain to ﬁll the domain discrepancy is the main issue in the ﬁeld of domain adaptation (Pan and Yang 2010). Beginning with the work of gradient reverse (Ganin and Lempitsky 2015), a group of domain adaptation methods based on adversarial learning were proposed. In the research of this subﬁeld, a domain discriminator is introduced to judge the domain attribute on feature or image level. In order to fool this domain discriminator, the extracted features

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. *The corresponding author is Bingbing Ni.

Source Domain

Target Domain

Flexible Margin

Mixup Ratio

Figure 1: The mixup samples are obtained by linearly interpolating between source and target domain images with mixup ratio λ. The domain discriminator learns how to output soft scores for generated images with the guidance of mixup images and soft labels (mixup ratios). Also, a ﬂexible margin is learned through comparing mixup samples difference relative to two domains. (Best viewed in color.)

should be domain-invariant, which is the basic motivation of adversarial domain adaptation. However, just like most variants of Generative Adversarial Networks (GANs) (Goodfellow et al. 2014), the domain discriminator is only guided by the hard label information and rarely explores the intrinsic structure of data distribution. Namely, each data point that shifts from both domains should be judged by a latent soft label, i.e., a probability value, instead of a hard assignment of 1 or 0 . In addition, the distribution of domain-invariant latent vectors is ﬁtted using limited patterns of source and target features, i.e., the interaction of features from two domains has not been considered to enrich feature patterns. In this paper, inspired by VAE-GAN (Larsen et al. 2016), we develop a generative-adversarial-based framework to simultaneously train the classiﬁcation network and generate auxiliary source-like images from learned embeddings of both domains. On the basis of this framework, domain mixup on pixel and feature level is proposed to alleviate two existing drawbacks. (1) Just as shown in Figure 1, we would like to instruct the domain discriminator to explore the in-

trinsic structure of source and target distributions through triplet loss with a ﬂexible margin and applying domain classiﬁcation with mixed images and corresponding soft labels (i.e., mixup ratio), which provides abundant intermediate status between two separate domains. (2) In order to expand the searching range in latent space, linear interpolations of source and target features are exploited. Since the subsequent nonlinear neural network can easily ruin the linear mixed information, the extracted features of mixed images are not used for augmentation directly. This operation leads to a more continuous domain-invariant latent distribution, which beneﬁts the performance on target domain when the oscillation of data distribution occurs in the test phase. We evaluate the image recognition performance of our approach on three benchmarks with different extent of domain shift. Experiments prove the effectiveness of our approach, and we achieve state-of-the-art in most settings. The contributions of our work are summarized as follows: We design an adversarial training framework which maps both domains to a common latent distribution, and efﬁciently transfer our knowledge learned on the supervised domain to its unsupervised counterpart. Domain mixup on pixel and feature level accompanied with well-designed soft domain labels is proposed to improve the generalization ability of models. This method promotes the generalization ability of feature extractor and obtains a domain discriminator judging samples difference relative to two domains with reﬁned scores. We extensively evaluate our approach under different settings, and our approach achieves superior results even when the domain shift is high and the data distribution is complex.

Related Work Domain adaptation is a frequently used technique to promote the generalization ability of models trained on a single domain in many Computer Vision tasks. In this section, we describe existing domain adaptation methods and compare our approach with them. The transferability of Deep Neural Networks is proved in (Yosinski et al. 2014), and deep learning methods for domain adaptation can be classiﬁed into several categories. Maximum Mean Discrepancy (MMD) (Gretton et al. 2012; Tzeng et al. 2014) is a way to measure the similarity of two distributions. Weighted Domain Adaptation Network (WDAN) (Yan et al. 2017) deﬁnes the weighted MMD with class conditional distribution on both domains. The multiple kernel version of MMD (MK-MMD) is explored in (Long et al. 2015) to deﬁne the distance between two distributions. In addition, speciﬁc deep neural networks are constructed to restrict the domain-invariance of top layers by aligning the second-order statistics (Sun and Saenko 2016). Adversarial Training (Ajakan et al. 2013; Ganin et al. 2016) is another way to transfer domain information. Rev Grad (Ganin and Lempitsky 2015) designs a double branched architecture for object classiﬁcation and domain classiﬁcation respectively. Adversarial Discriminative Domain Adaptation (ADDA) (Tzeng et al. 2017) trains two

feature extractors for source and target domains respectively, and produces embeddings fooling the discriminator. Other works optimize the performance on target domain by capturing complex multimode structures (Pei et al. 2018; Long et al. 2018), exploring task-speciﬁc decision boundaries (Saito et al. 2018b; 2018a; Tran et al. 2019), aligning the attention regions (Kang et al. 2018) and applying structure-aware alignment (Ma, Zhang, and Xu 2019). In addition, the label information of target domain is explored in recent works (Zhang et al. 2018b; Xie et al. 2018; Ma, Zhang, and Xu 2019). Another group of methods perform adaptation by applying adversarial loss on pixel level. Source domain images are adapted as if they are drawn from target domain using generative adversarial networks in (Bousmalis et al. 2017; Liu and Tuzel 2016), and generated samples expand the training set. Furthermore, image generation and training the task-speciﬁc classiﬁer are accomplished simultaneously in (Ghifary et al. 2016; Sankaranarayanan et al. 2018). Cycleconsistency is also considered in (Hoffman et al. 2018) to enforce the consistency of relevant semantics. Comparison with existing GAN-based approaches. Although former works use GAN as a manner of data augmentation (Bousmalis et al. 2017; Liu and Tuzel 2016) or producing domain adaptive gradient information (Ghifary et al. 2016; Sankaranarayanan et al. 2018), they may be trapped in the mismatch between generated data and assigned hard labels. We further explore the usage of domain mixup on pixel and feature level to enhance the robustness of adaptation models. On one hand, pixel-level mixup prompts the domain discriminator to excavate the intrinsic structure of source and target distributions. On the other hand, featurelevel mixup facilitates a more continuous feature distribution in the latent space with low domain shift.

Adversarial Domain Adaptation with Domain Mixup In unsupervised domain adaptation, a source domain dataset (Xs, Ys) = {(xs i, ys i )}ns i=1 with ns labeled samples and a target domain dataset Xt = {xt j}nt j=1 with nt unlabeled samples are available. It is assumed that source samples xs obey the source distribution Ps, and target samples xt obey the target distribution Pt. In addition, both domains share the same label space Y = {1, 2, 3, , K}, where K is the number of classes.

Framework of DM-ADA In this work, a variant of VAE-GAN (Larsen et al. 2016) is applied to the domain adaptation task. Figure 2 presents an overview of the whole framework. For the input, there are three kinds: source domain images, target domain images and mixup images obtained by pixel-wise addition of source and target images. Just as conventional variational autoencoder (Kingma and Welling 2013), an encoder Ne maps inputs from source and target domains to the standard Gaussian distribution N(0, I). For every sample, a mean vector μ and a standard deviation vector σ are served as the feature embedding. On feature level, the feature embeddings of

Source inputs (SVHN)

Target inputs (MNIST)

Mixup inputs

Object classification

{1,2,3, ,K}

Classifier Decoder

Discriminator

Generated source images

Object classification

{1,2,3, ,K}

Pixel-level mixup

Feature-level mixup

Mean Standard deviation

Domain classification

soft label: ɉ א ሾͲǡ ͳሿ

Figure 2: Illustration of the pipeline in the training phase. First, mixup inputs xm are obtained by mixing two domains inputs xs and xt. Encoder Ne maps source and target inputs to (μs, σs) and (μt, σt) respectively. In the latent space, feature embeddings of two domains are mixed to produce mixup features (μm, σm). After that, the framework is split to two branches. On one branch, classiﬁer C performs K-way object classiﬁcation. On the other branch, latent codes are decoded by Nd, and the minmax game between Nd and D facilitates domain-invariance on category level. (Best viewed in color.)

two domains are also linearly mixed to produce mixup features (μm, σm). After that, the framework is split into two branches. For one branch, the embedding of source domain is used to do K-way object classiﬁcation by the classiﬁer C. For the other branch, source and target domain are aligned on category level through enforcing the decoded images to be source-like and preserve class information of inputs. Details are stated in the following parts. Domain mixup on two levels. To explore the internal structure of data from two domains, source domain images xs and target domain images xt are linearly interpolated (Zhang et al. 2018a) to produce mixup images xm and corresponding soft domain labels lm dom as follows:

xm = λxs + (1 λ)xt, (1)

lm dom = λls dom + (1 λ)lt dom = λ, (2) where λ [0, 1] is the mixup ratio, and λ Beta(α, α), in which α is constantly set as 2.0 in all experiments. ls dom and lt dom represent the domain label of source and target data, which are manually set as 1 and 0. Inputs of source and target domains are then embedded to (μs, σs) and (μt, σt) in the latent space by a shared encoder Ne. In order to yield a more continuous domain-invariant latent distribution, two domains embeddings are linearly mixed to produce mixup feature embedding (μm, σm):

μm = λμs + (1 λ)μt, (3)

σm = λσs + (1 λ)σt, (4) where λ equals to the one used in pixel-level mixup. Restricting encoder with priori. Just as conventional VAE (Kingma and Welling 2013), the encoder Ne is regularized by a standard gaussian priori over the latent distribution. The objective is to narrow the Kullback-Leibler divergence between posteriori and priori:

min Ne LKL, (5)

LKL = DKL N(μ, σ)||N(0, I) , (6)

where μ and σ are the encoded mean and standard deviation of source and target images. Supervised training for classiﬁer. The classiﬁer C is optimized with cross entropy loss deﬁned on source domain, and the objective is as follows:

min Ne,C LC, (7)

LC = Exs Ps

i=1 ys i log C([μs, σs]) , (8)

where [ ] denotes concatenation. It is worth noticing that classiﬁer C can t be replaced by the object classiﬁcation branch of discriminator D, since the adapted features are only passed directly to C, which enhances C s performance on target domain. Decoding latent codes. Before the generation phase, we ﬁrst deﬁne the one-hot object class label lcls and a onedimensional uncertainty compensation lcomp for both domains and mixup features as below:

ls cls = [0, 0, , 1, , 0], ls comp = 0,

lt cls = [0, 0, , 0, , 0], lt comp = 1,

lm cls = [0, 0, , λ, , 0], lm comp = 1 λ,

where 1 and λ are on the ys-th position of ls cls and lm cls to indicate the known class label for both features respectively. For all features derived from target domain or mixup procedure, since the class labels remain uncertain, lcomp is set as a compensation to normalize the sum of vector lcls and lcomp to 1. After that, decoder Nd predicts the auxiliary generated images xg as below:

xg = Nd([μ, σ, z, lcls, lcomp]), (10)

where z is the noise vector randomly sampled from standard Gaussian distribution. Adversarial domain alignment. Compared with previous adversarial-learning-based methods (Ganin and Lempitsky 2015; Tzeng et al. 2017; Pei et al. 2018), we constrain domain-invariance not only on source and target domains, but also on the intermediate representations between two domains. The min-max optimization objective on different domains are deﬁned as follows:

min Ne,Nd max D Ls adv + Lt adv + Lm adv, (11)

Ls adv = Exs Ps log Ddom(xs) + log 1 Ddom(xs g) , (12)

Lt adv = Ext Pt log 1 Ddom(xt g) , (13)

Lm adv = Exs Ps,xt Pt log 1 Ddom(xm g ) , (14)

where Ddom is the domain classiﬁcation branch of D. During training process, the mixup features can well be mapped to somewhere in-between source and target domain on pixel level, and it is more proper to assign them with scores between 0 and 1. Domain classiﬁcation loss Lm soft is utilized to guide domain discriminator output such soft scores:

min D Lm soft, (15)

Lm soft = Exs Ps,xt Pt lm domlog Ddom(xm)

+ (1 lm dom)log 1 Ddom(xm) . (16)

We further introduce a triplet loss Lm tri to constrain mixup samples distance to source and target domains, which makes domain discriminator easier to converge:

min D Lm tri, (17)

Lm tri =Exs Ps,xt Pt ||f D(a) f D(p)||2 2 ||f D(a) f D(n)||2 2 + ftri(λ)

where f D is the feature extractor of D, and [ ]+ = max(0, ) denotes the hinge loss function; (a, p, n) = (xm, xs, xt), when λ 0.5, and (a, p, n) = (xm, xt, xs), otherwise. Considering that samples with more source or target domain components should have larger difference with the counterpart domain, a ﬂexible margin ftri(λ) = |2λ 1| is used. Category-level domain alignment. In order to ensure the identical categories features of two domains are mapped nearby in the latent space, classiﬁcation loss Ls cls and Lt cls are introduced to ensure the class-consistency between decoded images and inputs:

min Ne,Nd,D Ls cls + Lt cls, (19)

Ls cls = Exs Ps

i=1 ys i log Dcls(xs g) , (20)

Algorithm 1 Training procedure of DM-ADA

Input: Source domain: Xs and Ys, target domain: Xt and the number of iterations N. Output: Conﬁgurations of DM-ADA. Initialize ω and ϕ Initialize θNe, θNd, θC and θD for n = 1 to N do (xs, ys) RANDOMSAMPLE(Xs, Ys) (xt) RANDOMSAMPLE(Xt) λ RANDOMSAMPLE(Beta(α, α)) (xm, lm dom) Eq. (1, 2) # Get mixup images (μ, σ) Ne(x) # Get feature embeddings (μm, σm) Eq. (3, 4) # Get mixup features xg = Nd(μ, σ, z, lcls, lcomp) # Generate images Optimize four subnetworks D, Nd, C and Ne in turn:

θD + θD Ls cls + ω(Lm tri + Lm soft)

+ ϕ(Ls adv + Lt adv + Lm adv)

θNd + θNd (Ls cls ϕLs adv)

θC + θC(LC)

θNe + θNe LC + Ls cls + Lt cls

+ ωLKL ϕ(Lt adv + Lm adv)

Lt cls = Ext Pt

i=1 ˆyt i log Dcls(xt g) , (21)

where Dcls is the object classiﬁcation branch of D, and ˆyt is the pseudo label estimated by classiﬁer C. So as to eliminate falsely labeled samples which harm domain adaptation, we ﬁlter out those samples whose classiﬁcation conﬁdence below a certain threshold τ. Considering the fact that domain discrepancy is gradually ﬁlled along training, τ is adaptively adjusted following the strategy in (Zhang et al. 2018b).

Training Procedure

The proposed iterative training procedure is summarized in Algorithm 1. In each iteration, the input source and target samples are ﬁrst mixed on pixel level to instruct the domain discriminator to output soft labels. After the samples of two domains are mapped to the latent space, their embeddings are mixed to produce mixup features. The images generated on the basis of these feature embeddings are constrained to be source-like and preserve inputs class information, so that the latent distribution is facilitated to be domain-invariant and discriminative. In all experiments, we set α as 2.0, since domain mixup can t effectively explore the linear space between two domains when the value of α is small, and more analysis of α can be found in supplementary material. ω and ϕ are hyper-parameters that trade off among losses with different orders of magnitude. According to the after sensitivity analysis, the adaptation performance of our approach is not too sensitive to the value of ω and ϕ, and these hyperparameters share the same value among different tasks.

w/o domain continuity w/ domain continuity

Figure 3: The comparison between with and without domain continuity. Circle & triangle: two classes; Blue & orange: source and target domain; p: test sample. (Best viewed in color.)

Pixel-level domain mixup. The work of (Zhang et al. 2018a) proposes the mixup vicinal distribution as a manner to encourage the model to behave linearly in-between training examples. Another work (Berthelot et al. 2019) improves interpolation s continuity in latent space and beneﬁts downstream tasks. In adversarial domain adaptation, we also would like to lead the domain discriminator to behave linearly between source and target domains. As a result, the domain discriminator is of high capacity to accurately judge the generated images containing oscillations to two domains. In our implementation, such discriminator is trained with pairs of linearly mixed image xm = λxs + (1 λ)xt and corresponding soft label lm dom = λ, where xm simulates an oscillation mode to two domains and λ provides the guidance. Combined with feature-level mixup, pixel-level mixup can further narrow the domain discrepancy, which is shown in the after ablation study. Feature-level domain mixup. Existing works attempt to map source and target domains to a common latent distribution, while limited data can not guarantee most parts of the latent space domain-invariant. In order to yield a more continuous domain-invariant latent distribution, the mixup features of two domains are exploited. We use an intuitive example to illustrate the effectiveness of domain continuity on aligning source and target domains. As shown in Figure 3, the biased test sample p may be misclassiﬁed without the constraint of domain continuity. However, through adding the mixup feature embedding to the training process, the latent codes between the same class of two domains should also be domain-invariant, which forms the intra-class clusters C1 and C2. Thus the decision boundary is reﬁned, and the biased samples in these clusters can be classiﬁed correctly.

Experiments

In this section, we ﬁrst introduce the experimental setup. Then, the classiﬁcation performance on three domain adaptation benchmarks are presented. Finally, ablation study and sensitivity analysis are conducted for the proposed approach.

Table 1: Classiﬁcation accuracy (mean std %) values of target domain over ﬁve independent runs on the digits datasets. The best performance is indicated in bold and the second best one is underlined.

Method MN MN US SV US (p) US (f) MN MN Source only 76.0 1.8 79.3 0.7 59.5 1.9 62.1 1.2 MMD (2015) 81.1 0.3 71.1 0.5 Rev Grad (2015) 77.1 1.8 85.1 0.8 73.0 2.0 73.9 1.2 Co GAN (2016) 91.2 0.8 89.1 1.0 DRCN (2016) 91.8 0.1 73.7 0.0 82.0 0.1 ADDA (2017) 89.4 0.2 90.1 0.8 76.0 1.8 Pixel DA (2017) 95.9 0.7 MSTN (2018) 92.9 1.1 91.7 1.5 GTA (2018) 92.8 0.9 95.3 0.7 90.8 1.3 92.4 0.9 ADR (2018a) 93.2 2.5 96.1 0.3 93.1 1.3 95.0 1.9 DM-ADA (ours) 94.8 0.7 96.7 0.5 94.2 0.9 95.5 1.1

Experimental Setup In this part, we describe the network architectures and hyperparameters of different tasks. Our approach is implemented with Py Torch deep learning framework (Paszke et al. 2017). Digits experiments. In this part of experiments, we construct four subnetworks with train-from-scratch architectures following (Sankaranarayanan et al. 2018). Four Adam optimizers with base learning rate 0.0004 are utilized to optimize these submodels for 100 epochs. The hyper-parameters ω and ϕ are set as 0.1 and 0.01 respectively, and their values are constant in all experiments. All of the input images of encoder and discriminator are resized to 32 32. Ofﬁce experiments. For the encoder, the last layer of Alex Net (Krizhevsky, Sutskever, and Hinton 2012) is replaced with two parallel fully connected layers producing 256 dimensional vectors respectively, and former layers are initialized with the model pretrained on Image Net (Russakovsky et al. 2015). The encoder is ﬁne-tuned with base learning rate 0.0001 for 100 epochs, and the base learning rate of other three submodels is set as 0.001. The inputs of encoder and discriminator are resized to 227 227 and 64 64 respectively. Vis DA experiments. Res Net-101 (He et al. 2016) serves as the base architecture, and it is initialized with the model pretrained on Image Net (Russakovsky et al. 2015). The learning rate setting is same as that in the ofﬁce experiments, and the results are reported after 20 epochs training. The inputs of encoder and discriminator are resized to 224 224 and 64 64 respectively.

Classiﬁcation on Digits Datasets Dataset. In this set of experiments, three digits datasets are used: MNIST (L ecun et al. 1998), USPS (Hull 1994) and Street View House Numbers (SVHN) (Netzer et al. 2011). Each dataset contains ten classes corresponding to number 0 to 9. Four settings are used for measurement: MN US (p): sampling 2000 images from MNIST and 1800 images from USPS; MN US (f) and US MN: using the full training set of MNIST and USPS; SV MN: using the full training set of SVHN and MNIST. Results. Table 1 presents the results of our approach in

Table 2: Classiﬁcation accuracy (mean std %) values of target domain over ﬁve independent runs on the Ofﬁce-31 dataset. The best performance is indicated in bold and the second best one is underlined.

Method A W D W W D A D D A W A Average Alex Net (source only) (2012) 60.6 0.4 95.4 0.2 99.0 0.1 64.2 0.3 45.5 0.5 48.3 0.5 68.8 TCA (2011) 59.0 0.0 90.2 0.0 88.2 0.0 57.8 0.0 51.6 0.0 47.9 0.0 65.8 DDC (2014) 61.0 0.5 95.0 0.3 98.5 0.3 64.9 0.4 47.2 0.5 49.4 0.4 69.3 DAN (2015) 68.5 0.3 96.0 0.1 99.0 0.1 66.8 0.2 50.0 0.4 49.8 0.3 71.7 Rev Grad (2015) 73.0 0.5 96.4 0.3 99.2 0.3 72.3 0.3 52.4 0.4 50.4 0.5 74.1 DRCN (2016) 68.7 0.3 96.4 0.3 99.0 0.2 66.8 0.5 56.0 0.5 54.9 0.5 73.6 MADA (2018) 78.5 0.2 99.8 0.1 100 0.0 74.1 0.1 56.0 0.2 54.5 0.3 77.1 MSTN (2018) 80.5 0.4 96.9 0.1 99.9 0.1 74.5 0.4 62.5 0.4 60.0 0.6 79.1 GCAN (2019) 82.7 0.1 97.1 0.1 99.8 0.1 76.4 0.5 64.9 0.1 62.6 0.3 80.6 DM-ADA (ours) 83.9 0.4 99.8 0.1 99.9 0.1 77.5 0.2 64.6 0.4 64.0 0.5 81.6

Table 3: Classiﬁcation accuracy on the validation set of Vis DA-2017 challenge.

Method Accuracy (%) Res Net-101 (source only) (2016) 52.4 Rev Grad (2015) 57.4 DAN (2015) 62.8 JAN (2017) 65.7 GTA (2018) 69.5 MCD-DA (2018b) 71.9 ADR (2018a) 73.5 DM-ADA (ours) 75.6

comparison with other adaptation approaches on the digits datasets. For the source only test, we use the same encoder and classiﬁer architectures as the ones used in our approach. The reported results are averaged over ﬁve independent runs with random initialization. Our approach achieves the stateof-the-art performance on all four settings. Especially, it outperforms former GAN-based approaches (Bousmalis et al. 2017; Ghifary et al. 2016; Sankaranarayanan et al. 2018), which illustrates the effectiveness of the proposed architecture on aligning source and target domains.

Classiﬁcation on Ofﬁce-31

Dataset. Ofﬁce-31 (Saenko et al. 2010) is a standard domain adaptation benchmark commonly used in previous researches. Three distinct domains, Amazon(A), Webcam(W) and DSLR(D), make up of the whole Ofﬁce-31 dataset. Each domain contains the same 31 classes of ofﬁce supplies. All transfer tasks of three domains are used for evaluation. Results. Table 2 reports the performance of our method compared with other works. The results of Alex Net trained with only source domain data serves as the lower bound. Our approach obtains the best performance in three of four hard cases: A W, W A and A D. For two easier cases: W D and D W, our approach achieves accuracy higher than 99.5% and ranks the ﬁrst two places. Given the fact that the number of samples per class is limited in the Ofﬁce-31 dataset, our approach manages to improve the performance by providing augmented samples and features.

Table 4: Effectiveness of pixel-level mixup (PM), featurelevel mixup (FM) and triplet loss (Tri).

PM FM Tri A-distance Accuracy (%) 1.528 76.7 1.519 78.1 1.508 79.4 1.497 82.1 1.492 83.2 1.489 83.9

Table 5: Effectiveness of Dcls and pseudo target labels.

Dcls pseudo A-distance Accuracy (%) 1.503 80.6 1.496 82.3 1.489 83.9

Classiﬁcation on Vis DA-2017 Dataset. The Vis DA-2017 (Peng et al. 2017) challenge proposes a large-scale dataset for visual domain adaptation. The training domain is composed of synthetic renderings of 3D models. The validation domain is made up of photo-realistic images drawn from MSCOCO (Lin et al. 2014). Both domains contain the same 12 classes of objects. Results. Table 3 reports the results on the Vis DA-2017 cross-domain classiﬁcation dataset. The Res Net-101 model pretrained on Image Net acts as the baseline. Our approach achieves the highest accuracy among all adaptation approaches, and exceeds the baseline with a great margin. Under the condition that large domain shift exists, like transferring from synthetic objects to real images in this task, we think that the triplet loss and soft label play a critical role in excavating intermediate status between two domains.

Ablation Study Metrics. Two metrics are employed. (1) A-distance (Ben David et al. 2010; Mansour, Mohri, and Rostamizadeh 2009) serves as a measure of cross-domain discrepancy. Inputted with extracted features of two domains, a SVM classiﬁer is

(a) baseline (b) w/ PM (c) w/ PM & FM (d) full model

Figure 4: The t-SNE (Maaten and Hinton 2008) visualization of target domain s feature distribution on the transfer task SVHN MNIST under four different conﬁgurations. (Best viewed in color.)

conf: 0.7 source target

Figure 5: The generated images with conﬁdence 0.9, 0.7 and 0.5 using 0/1 and our discriminator. (task: SV MN)

used to classify the source and target domain features, and the generalization error is deﬁned as ϵ. Then the A-distance can be calculated as: d A = 2(1 2ϵ). (2) Classiﬁcation accuracy on target domain serves as a measure of task-speciﬁc performance. In this part of experiments, both metrics are evaluated on the task A W. Effect of pixel-level and feature-level mixup. Table 4 examines the effectiveness of pixel-level mixup (PM) and feature-level mixup (FM). The ﬁrst row only uses the images and feature embeddings from two domains for training, and it serves as the baseline. In the fourth row, featurelevel mixup achieves notable improvement compared with baseline, since the domain-invariant latent space is facilitated to be more continuous in this conﬁguration. In the ﬁfth row, pixel-level mixup further enhance model s performance through guiding discriminator output soft scores between 0 and 1, which means it is an essential auxiliary scheme for feature-level mixup. In Figure 5, compared with traditional 0/1 discriminator, our discriminator leads to more sourcelike generated images, which means the domain discrepancy can be further narrowed via pixel-level mixup. Effect of triplet loss. In Table 4, we evaluate another key component, i.e., triplet loss (Tri). In the third and sixth rows, it can also be observed that model s performance is improved after adding the triplet loss to discriminator s training process, since this loss ease the convergence of domain discriminator. We further utilize t-SNE (Maaten and Hinton 2008) to visualize the feature distribution of target domain on the task SV MN. As shown in Figure 4, the features of different classes are separated most clearly in the full model, i.e., with domain mixup on two levels and triplet loss. Effect of Dcls and pseudo target labels. In order to conduct category-aware alignment between source and target domains, the classiﬁcation branch of discriminator Dcls and pseudo target labels are employed, and the effectiveness of them is examined in Table 5. After appending Dcls, classiﬁ-

Accuracy (%)

Accuracy (%)

0.03 0.05 0.08 0.1 0.13 0.15 0.18 0.2

A W A D D A W A

0.003 0.005 0.008 0.01 0.013 0.015 0.018 0.02

A W A D D A W A

Figure 6: The sensitivity analysis of ω (left) and ϕ (right). Four hard-to-transfer tasks of Ofﬁce-31 dataset are involved.

cation accuracy increases by 1.7%, since this branch facilitates generated images to preserve the class information contained in inputs, which makes domain adaptation perform on the same categories of two domains. On such basis, pseudo target labels introduce the discriminative information of target domain to the adaptation process and make model s performance state-of-the-art.

Sensitivity Analysis In this section, we discuss our approach s sensitivity to hyper-parameters ω and ϕ which trade off among losses with different orders of magnitude. Four hard-to-transfer tasks of Ofﬁce-31 dataset are used for evaluation. In Figure 6, it can be observed that the transfer performance is not sensitive to the variance of ω and ϕ near 0.1 and 0.01, respectively. In consequence, we can set ω and ϕ as 0.1 and 0.01 for all tasks, and the transfer performance should be satisfactory.

Conclusion In this paper, we address the problem of unsupervised domain adaptation. A GAN-based architecture is constructed to transfer knowledge from source domain to target domain. In order to facilitate a more continuous domain-invariant latent space and fully utilize the inter-domain information, we propose the domain mixup on pixel and feature level. Extensive experiments on adaptation tasks with different extent of domain shift and data complexity demonstrate the predominant performance of our approach.

Acknowledge This work was supported by National Science Foundation of China (61976137, U1611461). This work was also supported by SJTU-BIGO Joint Research Fund, and CCFTencent Open Fund.

Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; and Marchand, M. 2013. Domain-adversarial neural networks. Co RR abs/1412.4446. Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning 79(1-2):151 175. Berthelot, D.; Raffel, C.; Roy, A.; and Goodfellow, I. 2019. Understanding and improving interpolation in autoencoders via an adversarial regularizer. In ICLR. Bousmalis, K.; Silberman, N.; Dohan, D.; Erhan, D.; and Krishnan, D. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR. Ganin, Y., and Lempitsky, V. S. 2015. Unsupervised domain adaptation by backpropagation. In ICML. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domainadversarial training of neural networks. JMLR 17(1):2096 2030. Ghifary, M.; Kleijn, W. B.; Zhang, M.; Balduzzi, D.; and Li, W. 2016. Deep reconstruction-classiﬁcation networks for unsupervised domain adaptation. In ECCV. Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde Farley, D.; Ozair, S.; Courville, A. C.; and Bengio, Y. 2014. Generative adversarial nets. In Neur IPS. Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Sch olkopf, B.; and Smola, A. J. 2012. A kernel two-sample test. JMLR 13:723 773. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR. Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.; Isola, P.; Saenko, K.; Efros, A. A.; and Darrell, T. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In ICML. Hull, J. J. 1994. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5):550 554. Kang, G.; Zheng, L.; Yan, Y.; and Yang, Y. 2018. Deep adversarial attention alignment for unsupervised domain adaptation: The beneﬁt of target expectation maximization. In ECCV. Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. Co RR abs/1312.6114. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In Neur IPS. Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; and Winther, O. 2016. Autoencoding beyond pixels using a learned similarity metric. In ICML. L ecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324. Lin, T.; Maire, M.; Belongie, S. J.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft COCO: common objects in context. In ECCV. Liu, M., and Tuzel, O. 2016. Coupled generative adversarial networks. In Neur IPS. Long, M.; Cao, Y.; Wang, J.; and Jordan, M. I. 2015. Learning transferable features with deep adaptation networks. In ICML. Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I. 2017. Deep transfer learning with joint adaptation networks. In ICML. Long, M.; Cao, Z.; Wang, J.; and Jordan, M. I. 2018. Conditional adversarial domain adaptation. In Neur IPS.

Ma, X.; Zhang, T.; and Xu, C. 2019. Gcan: Graph convolutional adversarial network for unsupervised domain adaptation. In CVPR. Maaten, L. V. D., and Hinton, G. 2008. Visualizing data using t-sne. JMLR 9(2605):2579 2605. Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009. Domain adaptation: Learning bounds and algorithms. In Annual Conference on Learning Theory. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. Neur IPS Workshop. Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10):1345 1359. Pan, S. J.; Tsang, I. W.; Kwok, J. T.; and Yang, Q. 2011. Domain adaptation via transfer component analysis. IEEE Trans. Neural Networks 22(2):199 210. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch. In Neur IPS Workshop. Pei, Z.; Cao, Z.; Long, M.; and Wang, J. 2018. Multi-adversarial domain adaptation. In AAAI. Peng, X.; Usman, B.; Kaushik, N.; Hoffman, J.; Wang, D.; and Saenko, K. 2017. Visda: The visual domain adaptation challenge. Co RR abs/1710.06924. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. S.; Berg, A. C.; and Li, F. 2015. Imagenet large scale visual recognition challenge. IJCV 115(3):211 252. Saenko, K.; Kulis, B.; Fritz, M.; and Darrell, T. 2010. Adapting visual category models to new domains. In ECCV. Saito, K.; Ushiku, Y.; Harada, T.; and Saenko, K. 2018a. Adversarial dropout regularization. In ICLR. Saito, K.; Watanabe, K.; Ushiku, Y.; and Harada, T. 2018b. Maximum classiﬁer discrepancy for unsupervised domain adaptation. In CVPR. Sankaranarayanan, S.; Balaji, Y.; Castillo, C. D.; and Chellappa, R. 2018. Generate to adapt: Aligning domains using generative adversarial networks. In CVPR. Sun, B., and Saenko, K. 2016. Deep coral: Correlation alignment for deep domain adaptation. In ECCV Workshop. Tran, L.; Sohn, K.; Yu, X.; Liu, X.; and Chandraker, M. 2019. Gotta adapt em all: Joint pixel and feature-level domain adaptation for recognition in the wild. In CVPR. Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; and Darrell, T. 2014. Deep domain confusion: Maximizing for domain invariance. Co RR abs/1412.3474. Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain adaptation. In CVPR. Xie, S.; Zheng, Z.; Chen, L.; and Chen, C. 2018. Learning semantic representations for unsupervised domain adaptation. In ICML. Yan, H.; Ding, Y.; Li, P.; Wang, Q.; Xu, Y.; and Zuo, W. 2017. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In CVPR. Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In Neur IPS. Zhang, H.; Ciss e, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2018a. mixup: Beyond empirical risk minimization. In ICLR. Zhang, W.; Ouyang, W.; Li, W.; and Xu, D. 2018b. Collaborative and adversarial network for unsupervised domain adaptation. In CVPR.