# consensus_adversarial_domain_adaptation__d78748b8.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Consensus Adversarial Domain Adaptation

Han Zou,1 Yuxun Zhou,1 Jianfei Yang,2 Huihan Liu,1 Hari Prasanna Das,1 Costas J. Spanos1

1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA 2School of Electrical and Electronics Engineering, Nanyang Technological University, Singapore {hanzou, yxzhou, liuhh, hpdas, spanos}@berkeley.edu, yang0478@ntu.edu.sg

We propose a novel domain adaptation framework, namely Consensus Adversarial Domain Adaptation (CADA), that gives freedom to both target encoder and source encoder to embed data from both domains into a common domaininvariant feature space until they achieve consensus during adversarial learning. In this manner, the domain discrepancy can be further minimized in the embedded space, yielding more generalizable representations. The framework is also extended to establish a new few-shot domain adaptation scheme (F-CADA), that remarkably enhances the ADA performance by efﬁciently propagating a few labeled data once available in the target domain. Extensive experiments are conducted on the task of digit recognition across multiple benchmark datasets and a real-world problem involving Wi Fi-enabled device-free gesture recognition under spatial dynamics. The results show the compelling performance of CADA versus the state-of-the-art unsupervised domain adaptation (UDA) and supervised domain adaptation (SDA) methods. Numerical experiments also demonstrate that F-CADA can signiﬁcantly improve the adaptation performance even with sparsely labeled data in the target domain.

Introduction In recent years, a booming development of deep learning methods has been witnessed, partially as a consequence of the availability of a large amount of labeled data to train and validate more advanced models. More often than not, these recognition models trained with the large datasets perform extremely well in one domain, i.e., the source domain. However, they often fail to generalize well to new datasets or new environments, i.e., the target domain, due to domain shift or dataset bias (Tzeng et al. 2017). To alleviate the issue of domain shift, a large body of research has been carried out on domain adaptation, which aims to distill the shared knowledge across domains and therefore improving the generalization of the learned model. Domain adaptation methods can be categorized into 2 classes, unsupervised domain adaptation (UDA) and supervised domain adaptation (SDA), depending on whether

Han Zou is the corresponding author. Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

labeled data is available in the target domain. In many realworld cases, it is time-consuming, labor-intensive and expensive to collect and annotate a huge number of samples in the target domain. Thus, in practice, research on UDA is arguably more popular than SDA since collecting unlabeled target data is usually a trivial task. Conventional UDA methods (e.g., DDC (Tzeng et al. 2014), Rev Grad (Ganin and Lempitsky 2015) and DRCN (Ghifary et al. 2016)) map data from both domains into a common feature space to reduce the domain shift. This is achieved by minimizing some measure of distance between the target and source feature distributions, e.g., correlation distances or maximum mean discrepancy. The goal is to identify a feature space in which samples from both target and source domain are indistinguishable. Once the task is accomplished, the model constructed in the source domain can be applied to the tasks in the target domain by embedding the dataset with the learned transformation.

Meanwhile, with the unprecedented success of Generative Adversarial Network (GAN) (Goodfellow et al. 2014), some researchers have proposed to construct an adversarial loss to accommodate the domain shift, which is commonly referred as adversarial domain adaptation (ADA) or adversarial UDA (Tzeng et al. 2017). GAN trains a generator and a discriminator in a min-max fashion, where the generator learns to generate high-quality data to fool the discriminator, and the discriminator aims to distinguish the real and synthetic data. Similar to the setup of GAN, ADA aims to minimize an approximate domain discrepancy distance through an adversarial objective with respect to a domain discriminator. Through adversarial learning, it trains a source encoder and a target encoder such that a wellformed domain discriminator cannot determine the domain label of the encoded samples. Adversarial UDA methods, e.g., Co GAN (Liu and Tuzel 2016) and ADDA (Tzeng et al. 2017), achieve appealing performance compared to traditional UDA methods. However, the feature mapping is usually deﬁned by the source encoder in these methods. More speciﬁcally, previous methods align embedded feature representations of the target domain to the source domain by ﬁxing parameters of the source encoder during adversarial learning. Additionally, their network settings follow that of

GAN s exactly, where the real image distribution remains ﬁxed and synthesized one is learned to match it. Similar to GAN, the source feature representation is considered as an absolute good reference for the target, leading to the stagnated parameters of the source encoder. However, in practice, the domain discrepancy between source and target is considerably signiﬁcant. This assumption, therefore, might not always hold and the target data cannot be completely embedded into the imposed representation space. Both of the aforementioned concerns would result in sub-optimal adaptation, particularly when the target representation is far from source features in the latent space or the source encoder already exhibits over-ﬁtting. In this paper, we propose Consensus Adversarial Domain Adaptation (CADA), a novel unsupervised ADA scheme that gives freedom to both target encoder and source encoder. As such, they can achieve consensus and transform data from both domains into a general domain-invariant feature space to further accommodate the domain discrepancy and avoid over-ﬁtting models in neither domain. After obtaining a source encoder and a source classiﬁer as a good reference in the source domain, CADA trains a target encoder and also gives freedom to the source encoder by ﬁnetuning it through adversarial learning. In this manner, both unlabeled target data and labeled source data are embedded to a domain-invariant feature space deﬁned by both domains, in a way that a domain discriminator cannot distinguish the domain labels of them. The original source classiﬁer is further ﬁne-tuned as a shared classiﬁer with the source dataset and the source encoder is reﬁned via ADA. In the target domain, we employ the trained target encoder to embed the target samples into the domain-invariant feature space and infer its class using the shared classiﬁer. In certain applications, a few labeled data samples can be collected opportunistically in the target domain. This availability is scarce in practice but is precious for model improvement. To leverage the extra information, we also propose a few-shot version of CADA (F-CADA) which exploits the prior labeled data by greedy label propagation for further performance enhancement. In a nutshell, given a well-deﬁned metric in the latent feature space, F-CADA assigns presumptive labels to unlabeled data points in the target domain, by greedily minimizing an information entropy loss function. The target encoder is then ﬁne-tuned and a target classiﬁer is constructed using both the prior and presumptively labeled data. The whole process can be repeated until convergence. The class of each target test sample is inferred using the ﬁnal target encoder and classiﬁer. The performance of CADA and F-CADA are validated to the task of digit recognition across domains on standard digit adaptation dataset (MNIST, USPS, and SVHN digits datasets) and the task of spatial adaptation for Wi Fi-enabled device-free gesture recognition (GR). Experimental results demonstrate that CADA achieves outstanding domain adaptation results and outperforms state-of-the-art methods on both digit adaptation and spatial adaptation for GR. For the challenging SVHN MNIST scenario, it improves the digit recognition accuracy from 60% to 91%. It also enhances the GR accuracy by 25% over non-adapted classiﬁer under

environmental dynamics. Moreover, F-CADA achieves further performance gain over the best few-shot ADA methods when only one labeled target sample per class is available. It validates that the proposed label learning method indeed contributes to the overall performance improvement.

Related Work Unsupervised Domain Adaptation The performance of conventional classiﬁers degrade severely when the data distribution in source domain and target domain are different. Unsupervised domain adaptation (UDA) aims to reduce the difference in the feature distribution between the source and target domain to improve generalization performance without requiring any labeled data in target domain (Tzeng et al. 2017). Some metrics have been proposed to measure the domain shift between source and target domains for their difference minimization. For instance, maximum mean discrepancy is leveraged by DDC (Tzeng et al. 2014), that estimates the norm of mean difference and matches higher order statistics of the two distributions in a reproducing kernel Hilbert space. Rev Grad (Ganin and Lempitsky 2015) and DRCN (Ghifary et al. 2016) treat domain invariance as a binary classiﬁcation problem and maximize the classiﬁer loss by reversing its gradients. Adversarial Domain Adaptation Recently, with the booming development of Generative Adversarial Network (GAN) (Goodfellow et al. 2014), researchers have proposed to construct an adversarial loss to accommodate the domain shift, which is commonly referred to as adversarial domain adaptation (ADA) (Shen et al. 2018). Similar to the learning conﬁguration of GAN, the generator of ADA aims to fool the discriminator to make the target domain samples look like the source domain ones, and the discriminator tries to identify the domain labels (source or target) instead of fake or real image in GAN. Co GAN (Liu and Tuzel 2016) trains 2 GANs to synthesize both source and target images and achieves a domain invariant feature space by tying the high-level layer parameters of the 2 GAN to solve the domain transfer problem. ADDA (Tzeng et al. 2017) learns a discriminative representation using the labels in the source domain and then a separate encoding that maps the target data to the same space using an asymmetric mapping learned through a standard GAN loss without weights sharing. A cycle-consistency loss is designed in Cy CADA (Hoffman et al. 2017) to enforce both structural and semantic consistency during ADA. One major limitation for these methods is that the adversarial discriminative models focus on aligning feature embeddings of target domain to source domain deﬁned by the source encoder. Since the parameters of source encoder are ﬁxed during ADA, there is no freedom for the source encoder. Thus, the ADA performance is not guaranteed when the target representation is far from source features. Supervised Domain Adaptation Though UDA achieves acceptable performance using large amounts of unlabeled target data, it still cannot deal with large covariate shift in the distributions of the samples between two datasets. In reality, it is reasonable to label only a few samples for each class in the target dataset and then supervised domain adaptation

Figure 1: An overview of CADA. Step 1: A source encoder and a source classiﬁer are trained with the labeled source data. Step 2: A target encoder is trained and the source encoder is ﬁne-tuned through unsupervised adversarial domain adaptation to map both target data and source data to a domain-invariant feature space such that a domain discriminator cannot distinguish the domain labels of the data. Step 3: A shared classiﬁer is constructed with the labeled source data. Step 4: During testing, the class of each target test sample is inferred using the target encoder trained in Step 2 and shared classiﬁer obtained in Step 3. The network parameters in solid line boxes are ﬁxed and those in dashed line boxes are trained in each step.

(SDA) can efﬁciently transfer the knowledge from source domain to this sparsely labeled target domain. Also, SDA does not demand large annotation overhead commonly required by standard supervised learning approaches. In (Luo et al. 2017), the authors propose a framework that learns a representation transferable across different domains and tasks in a label efﬁcient manner. It tackles the problem of high sensitivity and overﬁtting during ﬁne-tuning stage using a novel end-to-end SDA approach. Apart from featurebased SDA method, Ren et al. (2018) designed a prototypical network from the perspective of metric learning. It maps the samples into a space where the samples from the same class are close and those from different classes are far apart. Recently, few-shot learning has become attractive because only a few labeled data is required for adaptation. In domain adaptation, few-shot adversarial domain adaptation (FADA) (Motiian et al. 2017a) is proposed to transfer knowledge with only several annotated samples in each class. It exploits adversarial learning to learn an embedded subspace that simultaneously maximizes the confusion between two domains while semantically aligning their embeddings. However, they only consider a few labeled target samples but never use unlabeled target samples that are more easily obtained. In our approach, we build the few-shot domain adaptation based on semi-supervised learning, which minimizes the domain confusion via adversarial training and guides the adaptation process in a few-shot manner.

Consensus Adversarial Domain Adaptation The objective of CADA is to improve the generalization capability of a classiﬁer across domains without collecting labeled data in the target domain via ADA. The rationale behind CADA is to embed data from both domains into a common feature space until they achieve consensus during ADA. It is different from existing methods, which force representation alignment of the target to the source. The systematic training procedure of CADA is demonstrated in Fig. 1, which is consisted of 4 steps. The detailed methodology of each step is elaborated as follows. Step 1: Suppose Ns samples Xs with labels Ys are collected in the source domain with L possible classes. As

the ﬁrst step of CADA, we train a source encoder Ms and a source classiﬁer Cs so that the source samples can be recognized with high classiﬁcation accuracy. Mathematically, Step 1 solves the following minimization via backpropagation: min Ms,Cs LCs(Xs, Ys) =

E(xs,ys) (Xs,Ys)

l=1 [I[l=ys] log Cs(Ms(xs))] (1)

This is indispensable since a good baseline of the feature space and the classiﬁer for the sub-sequent steps is needed. Step 2: More often than not, data labeling in the target domain is a time-consuming and expensive process. On the other hand, accumulating unlabeled data in the target domain, denoted by Xt, is usually a trivial task. As the most essential step of CADA, in Step 2 we train a target encoder Mt and ﬁne-tune the source encoder Ms such that a discriminator D cannot tell whether a sample is from the source domain or from the target domain after the associated feature mapping. In other words, after the feature embedding in the target and source domain via Mt(Xt) and Ms(Xs), respectively, the domain label cannot be effectively recognized by a well-formed discriminator D. This task is similar to the original GAN, that aims to generate a fake image that is indistinguishable from the real image. In our case, the labels for the discriminator D are domain labels (source and target) instead of fake and real. We formulate this step as an optimization of the following adversarial loss, min Ms,Mt max D LD(Xs, Xt, Ms, Mt) =

Exs Xs[log D(Ms(xs))] + Ext Xt[log(1 D(Mt(xt)))] (2) The GAN loss for the source encoder Ms is min Ms LMs(Xs, Xt, D) = Exs Xs[log D(Ms(xs))] (3)

and the inverted label GAN loss (Goodfellow et al. 2014) is employed to train the target encoder Mt as follows, min Mt LMt(Xs, Xt, D) = Ext Xt[log D(Mt(xt))]. (4)

Figure 2: An overview of F-CADA. Step 1 and Step 2 are the same as CADA. Step 3: in the target domain, presumptive labels are generated for target unlabeled data with target few labeled data. Then, the target encoder is ﬁne-tuned and a target classiﬁer is constructed with unlabeled and few labeled target data. Step 4: During testing, the class of each target test sample is inferred using the target encoder and target classiﬁer obtained in Step 3. The network parameters in solid line boxes are ﬁxed and those in dashed line boxes are trained in each step.

The parameters in Mt and Ms are initialized with those of the source encoder Ms learned in the Step 1 for burn-in training. It is worth pointing out the novelty of CADA and its difference from the state-of-the-art ADA methods (e.g. ADDA (Tzeng et al. 2017) and DIFA (Volpi et al. 2017)). Under previous methods, the parameters of the source encoder are ﬁxed during the training process of the target encoder via ADA. Consequently, the feature mapping is deﬁned by the source encoder and ADA essentially tries to align the feature embeddings of the target domain with the source domain. In this way, the obtained source encoder is used as an absolute reference, which may deteriorate the domain adaptation performance because the alignment could be sub-optimal when the target samples cannot be completely embedded into the imposed representation space. The issue can be substantial particularly when the source and the target domain exhibit material discrepancy or the source encoder already bears some overﬁtting. This bottleneck is well-addressed in the proposed CADA framework, where the parameters of the source encoder are not ﬁxed but are instead given the freedom to be ﬁne-tuned together with the target encoder. Therefore, the feature space is deﬁned by the consensus between Mt and Ms, yielding better generalization in both the domains. Step 3: When the discriminator D in Step 2 is not able to identify the domain label of target samples and source samples, it is an indication that the target encoder Mt and the source encoder Ms achieved consensus by mapping the corresponding input data to a shared domain-invariant feature space. Given that, we ﬁx the parameters of the source encoder Ms and train a shared classiﬁer Csh using the labeled source domain data {Xs, Ys}. The learning process is equivalent to minimizing the cross-entropy loss:

min Csh LCsh(Xs, Ys) =

E(xs,ys) (Xs,Ys)

l=1 [I[l=ys] log Csh(Ms(xs))] (5)

The shared classiﬁer Csh can be directly used in the target domain since the target encoder Mt has embedded the target samples to the domain-invariant feature space.

Step 4: During testing in the target domain, we map the target test samples to the domain-invariant feature space through the target encoder Mt trained in Step 2, and then use the shared classiﬁer Csh obtained in Step 3 to identify category of the samples in the target domain without collecting any labeled target data. In summary, the complete learning objective of CADA can be formulated as follows:

LCADA(Xs, Xt, Ys, D, Ms, Mt) = LCs(Xs, Ys) (6) + LD(Xs, Xt, Ms, Mt) + LMs(Xs, Xt, D) + LMt(Xs, Xt, D) + LCsh(Xs, Ys)

The training of CADA is thusly equivalent to solving:

min Csh min Mt,Ms max D min Ms,Cs LCADA(Xs, Xt, Ys, D, Ms, Mt)

As is illustrated in Fig. 1, we ﬁrstly train a source encoder Ms and a source classiﬁer Cs with the labeled source data by optimizing LCs as described in equation (1). After that, in Step 2, we train a target encoder Mt and ﬁne-tune the source encoder Ms via adversarial learning by optimizing LD, LMs and LMt, i.e., equation (2)-(4). Then, a shared classiﬁer Csh is constructed in Step 3 with the labeled source domain data by optimizing LCsh as described in equation (5). During testing in Step 4, we employ the trained target encoder Mt to map the test sample from the target domain into the domain-invariant feature space and use the shared classiﬁer Csh directly to identify the category of each testing sample in the target domain.

Few-shot Consensus Adversarial Domain Adaptation (F-CADA) A powerful extension of the CADA learning framework is to enrich it with the ability to integrate a few labeled data that may be available in the target domain for information fusion and model improvement. This task, although seems challenging in many other learning paradigms, can be achieved efﬁciently with F-CADA. Notation-wise, we assume that Ns samples Xs with labels Ys are available in the source domain and the target domain contains N u t unlabeled samples Xu t . Additionally, a few samples, numbered N l t and denoted

by Xl t, are assumed available with associated labels Y l t in the target domain. To conform with the scenario of few-shot learning, the number of labeled samples in the target domain is much smaller than the that of unlabeled ones, i.e., N l t N u t . Moreover, it is also much smaller than the number of source domain samples, i.e., N l t Ns. The overall training procedure of F-CADA is presented in Fig. 2. All the other steps are pretty much similar to CADA except for step 3, detailed below: Step 3: Suppose few labeled samples {Xl t, Y l t } are available in the target domain. As the most vital step in F-CADA, we design a label learning algorithm to assign presumptive labels Y l t to target unlabeled samples Xu t . Then, we ﬁnetune the target encoder obtained in Step 2 and build up a target classiﬁer Ct using both unlabeled target samples with presumptive labels {Xu t , Y l t } and labeled target samples {Xl t, Y l t }. Assume ki labeled samples are available for class i in the target domain, we can compute in the embedded space (1) the centroid vector ci for each class and (2) a similarity metric between each unlabeled target sample xu t,j Xu t and the speciﬁc centroid, denoted by ψ f(xu t,j), ci . Depending on the dimension of the transformed feature space, this similarity metric can simply be a Gaussian kernel to capture local similarity (Maaten and Hinton 2008), or the inverse of Wasserstein distance (Shen et al. 2018) for better generalization with complex networks. Ideally, the semi-supervised scheme should be able to (1) identify the correct labels of unlabeled target samples, and (2) update the encoder with the additional information. Using information entropy as the measure of goodness of separation , we can formulate the joint objective into the following minimization:

min yu t,j Y u t ,f H LU(Xu t , Xl t, Y l t ) =

xu t,j Xu t H σ(ψ(f(xu t,j), cyu t,j)/τ)

where the H( ) is the entropy function, σ( ) is the softmax function, and τ is a decay factor that controls the neighborhood proximity. The above problem is combinatorial in nature due to the discrete presumptive labels yu t,j. We establish an alternating approach that recursively performs (1) ﬁxing the feature mapping f and propagating presumptive labels using a greedy assignment, i.e., the jth unlabeled sample is presumed to have the same label to its closest centroid, and (2) updating the feature mapping (the encoder) as supervised learning by treating the presumptive labels as true labels. The proposed greedy propagation, intuitively simple and practically easy to implement, in fact has theoretical guarantees since the entropy objective is approximately submodular when the feature mapping is ﬁxed. Interested readers are referred to (Zhou and Spanos 2016) for a detailed theoretical analysis. The above is conducted alternately until the convergence of the feature mapping and presumptive label assignment. In practice, it is observed that the convergence is usually achieved in few iterations. Adding the above objective function to that of CADA in equation (6), we obtain

the overall learning formulation of F-CADA. In the testing step (step 4 in Fig. 2), we map the target testing samples to the latent feature space through the updated target encoder Mt, and then apply the updated target classiﬁer Ct to identify their classes.

Experiments

We evaluate CADA and F-CADA for 2 real-world domain adaptation problems: 1) digit classiﬁcation adaptation across 3 benchmark splits of public digit datasets; 2) spatial adaptation for Wi Fi-enabled device-free gesture recognition.

Digit Adaptation

3 public digit datasets, MNIST (Le Cun et al. 1998), USPS (Hull 1994), and SVHN (Netzer et al. 2011), which consist 10 classes of digits are used in our digit adaptation experiments. We evaluate our methods across 3 adaptation shifts: MNIST USPS, USPS MNIST, and SVHN MNIST, that are commonly adopted for digit adaptation assessment. The models are trained using the full training sets and evaluated on the full testing sets. We leverage a variant of the Le Net architecture as the encoder and the classiﬁer for CADA and F-CADA for all digit shifts. We repeat the experiment 50 times for each digit adaptation case and performed model selection based on the recent Bayesian optimization technique (Malkomes, Schaff, and Garnett 2016) to identify optimal choices of all hyper-perimeters, e.g., the structure and dropout rate of the encoder, the decay fact of F-CADA, etc.

Figure 3: The t-SNE visualization of features embedded using distinct encoders in target domain. (SVHN MNIST).

Performance of CADA We compare the performance of CADA with 3 traditional UDA methods (DDC, Rev Grad, DRCN), and 3 state-of-the-art adversarial UDA methods (Co GAN, ADDA, Cy CADA). Table 1 reports the classiﬁcation accuracies of these methods for each shift. The 2nd column shows the accuracies when the non-adapted source classiﬁers are applied as the lower-baseline, and the last column reports the accuracies when the target classiﬁers are trained with full target training samples as the upperbaseline. It can be observed that CADA outperforms others

Table 1: Digit adaptation across MNIST-USPS-SVHN datasets.

Scenario Source only Traditional UDA Adversarial UDA SDA Target fully supervised DDC Rev Grad DRCN Co GAN ADDA Cy CADA CADA SDA 1 2 3 4 5 6 7 MNIST USPS

75.2 1.6 79.1 0.5 77.1 1.8 91.8 0.1 91.2 0.1 89.4 0.2 95.6 0.2 96.4 0.1

CCSA 85.0 89.0 90.1 91.4 92.4 93.0 92.9

FADA 89.1 91.3 91.9 93.3 93.4 94.0 94.4 F-CADA 97.2 97.5 97.9 98.1 98.3 98.4 98.6 USPS MNIST

57.1 1.7 66.5 3.3 73.0 2.0 73.7 0.04 89.1 0.8 90.1 0.8 96.5 0.1 97.0 0.1

CCSA 78.4 82.2 85.8 96.1 88.8 89.6 89.4

FADA 81.1 84.2 87.5 89.9 91.1 91.2 91.5 F-CADA 97.5 97.8 98.1 98.4 98.6 98.8 98.9 SVHN MNIST

60.1 1.1 68.1 0.3 73.9 82.0 1.6 - 76.0 1.8 90.4 0.4 90.9 0.2

FT 65.5 68.6 70.7 7.3 74.5 74.6 75.4

FADA 72.8 81.8 82.6 85.1 86.1 86.8 87.2 F-CADA 94.8 95.1 95.4 95.5 95.6 95.9 96.1

Figure 4: Confusion matrices for the digit adaptation (SVHN MNIST).

for all the aforementioned digit adaptation scenarios. For relatively easy shift between MNIST and USPS (both greyscale hand-written digit datasets), CADA enhances the accuracy in both adaptation directions by at least 21% compared to the lower-baseline. It elevates the performance closer to supervised learning methods as well as the upper-baseline as demonstrated in Table 1. The adaptation for SVHN MNIST is much more challenging since SVHN is a color digit dataset of house number plates while MNIST contains uniﬁed greyscale digits. Even in this case, CADA improves the accuracy by 31% over the lower-baseline and outperforms the prior works. We use t-SNE (Maaten and Hinton 2008) to map the embedded feature representations through different encoders to a 2-D space for better visualization of the domain shift. Fig. 3(a) and Fig. 3(b) depict the embedded features using the non-adapted source encoder and the CADA source encoder, respectively (different color represents different digits). Confusion matrices before and after using CADA for this adaptation are presented in Fig. 4(a) and Fig. 4(b). If we directly apply the non-adapted source encoder in the target domain, as shown in Fig. 3(a), the clusters of 3s and 5s, 4s and 9s overlap with each other, which leads to corresponding large misclassiﬁcation among these digits as shown in Fig. 4(a). After employing CADA, the digit clusters of these common confusions are separated in the latent feature space (Fig. 3(b)), that indeed contributes to the corresponding performance gain as presented in confusion matrix (Fig. 4(b)).

Performance of F-CADA We randomly chose (k = 1, . . . , 7) labeled samples per class as the labeled target samples and utilized them for Step 3 label learning of F-CADA. The performance of F-CADA is compared with one SDA method: CCSA (Motiian et al. 2017b), and one advanced few-shot adversarial SDA method: FADA (Motiian et al.

Figure 5: Impact of number of labeled target samples on labeling accuracy and classiﬁcation accuracy (SVHN MNIST). The shaded area is the 5% and 95% percentile.

2017a). For the scenario of SVHN MNIST, we compared F-CADA to the source only model on available labeled target data with ﬁne-tuning (denote as FT in Table 1). It can be observed from Table 1 that F-CADA achieves signiﬁcant performance gain over the current best SDA benchmarks in all scenarios. Another noteworthy point is that it achieves comparable accuracy to the upper-baseline with only 7 labeled target samples per category. This impressive performance comes from 2 main reasons. Firstly, F-CADA inherits the advantages of CADA. As shown in Table 1, the accuracy of CADA is already higher than SDA methods for several cases. The embedded target dataset via CADA is the ideal input dataset for the following label learning of FCADA. Secondly, the proposed label learning method can fully make use of the few labeled target samples for accuracy enhancements. For instance, Fig. 5 depicts the label learning accuracy and classiﬁcation accuracy of SVHN

Table 2: Spatial adaptation for gesture recognition across different environments.

Scenario source only Traditional UDA Adversarial UDA F-CADA Target fully supervised Rev Grad DRCN Go GAN ADDA CADA 1 3 5 Large Small 58.4 0.7 68.1 0.2 69.3 0.3 69.4 0.2 71.5 0.3 88.8 0.1 92.3 96.3 98.7 99.2 0.1 Small Large 62.2 0.6 66.6 1.1 65.8 0.7 70.2 0.5 67.7 0.6 87.4 0.1 91.7 96.0 98.3 99.1 0.1

Figure 6: Confusion matrices for gesture recognition (large conference room small conference room).

Figure 7: Floor plan of the testbeds for the spatial adaptation experiments and sample CSI frames from different spatial sources.

MNIST when distinct numbers of labeled target samples are available. We can easily observe the positive correlation between label learning accuracy and classiﬁcation accuracy as the number of labeled target samples is increased. Comparing the confusion matrices between CADA (Fig. 4(b)) and F-CADA (k = 1) (Fig. 4(c)), the digit misclassiﬁcation between 1s and 4s is reduced signiﬁcantly by F-CADA with only one labeled target sample per class. It leads to 4% overall accuracy improvement. These analyses prove the excellent few-shot domain adaptation performance of F-CADA even when the number of labeled target samples is tiny.

Spatial Adaptation for Gesture Recognition

We also implement our methods to enhance the spatial adaptation capability of Wi Fi-enabled device-free gesture recognition (GR). By leveraging ﬁne-grained channel measurement (Channel State Information (CSI)) from Wi Fi physical layer and advanced machine learning methods, numerous occupancy sensing tasks, e.g. crowd counting (Zou et al. 2018b), human activity recognition (Zou et al. 2018c), and even human identiﬁcation (Zou et al. 2018a), have been realized in a device-free, privacy-preserving and non-intrusive manner. Since human gestures also alter the Wi Fi signal

propagation among Wi Fi-enabled Io T devices, we can identify the gestures in a device-free manner via the Channel State Information (CSI) enabled sensing platform proposed in (Zou et al. 2018a). One major bottleneck being, tedious data collection and labeling process required to train a new gesture classiﬁer when the system is to be employed in a new environment. The classiﬁer is also vulnerable to spatial variations. Thus, we aim to use our methods to improve the accuracy and resilience of the classiﬁer over spatial dynamics without collecting 1) any labeled target samples, 2) sparsely labeled ones. As shown in Fig. 7, the experiments were conducted in 2 conference rooms with different sizes (i.e. a large conference room (7m 5m) and a small conference room (6.1m 4.4m). Volunteers performed 6 common gestures, moving a hand right and left, up and down, push and pull between the two Io T devices. 200 samples per gesture were collected in each room during different days. After transforming CSI time series data into CSI frames (each CSI frames size: 400 228 as depicted in Fig. 7), we modiﬁed the Le Net architecture and designed a dedicated encoder and classiﬁer for our methods.

Performance of CADA We compare the performance of CADA with 2 state-of-the-art UDA methods (Rev Grad and DRCN) and 2 adversarial UDA methods (Co GAN and ADDA). Table 2 summarizes the gesture classiﬁcation accuracies of these methods in both adaptation directions between the 2 conference rooms. CADA enhances the accuracy by at least 25% over the lower-baseline (non-adapted source encoder is adopted) in both adaptation scenarios, without tedious labeled target data collection and training process. Comparing the confusion matrices before and after using CADA (Fig. 6(a) and Fig. 6(b) (large small)), the recognition accuracy of every gesture is improved. It can be easily observed that CADA outperforms all the traditional and adversarial UDA approaches. It realizes resilient Wi Fienabled device-free gesture recognition against spatial variations without time-consuming and labor-intensive data collection and labeling process in a new environment.

Performance of F-CADA We randomly chose (k = 1, 3, 5) labeled samples per gesture as the labeled target samples and used them for the Step 3 of F-CADA training process. Similar to the digit adaptation results, F-CADA achieves signiﬁcant performance gain over UDA methods by at least 3.5% when only one labeled sample per gesture is available in the target domain. Moreover, as demonstrated in Table 2 and Fig. 6, its accuracy is further increased when a little more labeled samples are available and employed. Its accuracy reaches 98.5% with 5-shot learning, which is only 0.6% lower than the upper-baseline (train a new classiﬁer with full labeled target samples).

In this paper, we proposed Consensus Adversarial Domain Adaptation (CADA) that gives freedom to both target encoder and source encoder in adversarial learning, by embedding data from both domains into a consensus domaininvariant feature space. In this manner, the domain discrepancy can be further minimized. CADA s feature representations are more robust to large domain shift and have the capacity to avoid over-ﬁtting models in both domains. A novel few-shot domain adaptation scheme (F-CADA) is also proposed to enhance the ADA performance by exploring few labeled target data in an efﬁcient way. By inheriting CADA s feature representation, F-CADA assigns presumptive labels to unlabeled data points in the target domain, by greedily minimizing an information entropy loss function. The greedy label learning method has theoretical guarantees since the entropy objective is approximately submodular. Extensive real-world experiments on digit recognition across multiple benchmark digit datasets and Wi Fi-enabled device-free gesture recognition under spatial dynamics are conducted. The results validate that CADA achieves compelling results and outperforms state-of-the-art UDA and SDA methods. F-CADA can further enhance the adaptation performance even with sparsely labeled target data.

Acknowledgments

This work is supported by a 2018 Seed Fund Award from CITRIS and the Banatao Institute at the University of California. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning, 1180 1189.

Ghifary, M.; Kleijn, W. B.; Zhang, M.; Balduzzi, D.; and Li, W. 2016. Deep reconstruction-classiﬁcation networks for unsupervised domain adaptation. In European Conference on Computer Vision, 597 613. Springer.

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672 2680.

Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.-Y.; Isola, P.; Saenko, K.; Efros, A. A.; and Darrell, T. 2017. Cycada: Cycle-consistent adversarial domain adaptation. ar Xiv preprint ar Xiv:1711.03213. Hull, J. J. 1994. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence 16(5):550 554. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278 2324. Liu, M.-Y., and Tuzel, O. 2016. Coupled generative adversarial networks. In Advances in neural information processing systems, 469 477. Luo, Z.; Zou, Y.; Hoffman, J.; and Fei-Fei, L. F. 2017. Label efﬁcient learning of transferable representations acrosss domains and tasks. In Advances in Neural Information Processing Systems, 164 176. Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. Journal of machine learning research 9(Nov):2579 2605. Malkomes, G.; Schaff, C.; and Garnett, R. 2016. Bayesian optimization for automated model selection. In Advances in Neural Information Processing Systems, 2900 2908. Motiian, S.; Jones, Q.; Iranmanesh, S.; and Doretto, G. 2017a. Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems, 6673 6683. Motiian, S.; Piccirilli, M.; Adjeroh, D. A.; and Doretto, G. 2017b. Uniﬁed deep supervised domain adaptation and generalization. In The IEEE International Conference on Computer Vision (ICCV), volume 2. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, 5. Ren, M.; Triantaﬁllou, E.; Ravi, S.; Snell, J.; Swersky, K.; Tenenbaum, J. B.; Larochelle, H.; and Zemel, R. S. 2018. Metalearning for semi-supervised few-shot classiﬁcation. ar Xiv preprint ar Xiv:1803.00676. Shen, J.; Qu, Y.; Zhang, W.; and Yu, Y. 2018. Wasserstein distance guided representation learning for domain adaptation. In AAAI. Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; and Darrell, T. 2014. Deep domain confusion: Maximizing for domain invariance. ar Xiv preprint ar Xiv:1412.3474. Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain adaptation. ar Xiv preprint ar Xiv:1702.05464. Volpi, R.; Morerio, P.; Savarese, S.; and Murino, V. 2017. Adversarial feature augmentation for unsupervised domain adaptation. ar Xiv preprint ar Xiv:1711.08561. Zhou, Y., and Spanos, C. J. 2016. Causal meets submodular: Subset selection with directed information. In Advances In Neural Information Processing Systems, 2649 2657. Zou, H.; Zhou, Y.; Yang, J.; Gu, W.; Xie, L.; and Spanos, C. J. 2018a. Wiﬁ-based human identiﬁcation via convex tensor shapelet learning. In AAAI. Zou, H.; Zhou, Y.; Yang, J.; and Spanos, C. J. 2018b. Device-free occupancy detection and crowd counting in smart buildings with wiﬁ-enabled iot. Energy and Buildings 174:309 322. Zou, H.; Zhou, Y.; Yang, J.; and Spanos, C. J. 2018c. Towards occupant activity driven smart buildings via wiﬁ-enabled iot devices and deep learning. Energy and Buildings 177:12 22.