# domain_adaptive_multibranch_networks__b0abfc21.pdf

Published as a conference paper at ICLR 2020

DOMAIN ADAPTIVE MULTIBRANCH NETWORKS

R oger Berm udez-Chac on, Mathieu Salzmann, & Pascal Fua Computer Vision Laboratory Ecole Polytechnique F ed erale de Lausanne Station 14, CH-1015 Lausanne, Switzerland {roger.bermudez,mathieu.salzmann,pascal.fua}@epfl.ch

We tackle unsupervised domain adaptation by accounting for the fact that different domains may need to be processed differently to arrive to a common feature representation effective for recognition. To this end, we introduce a deep learning framework where each domain undergoes a different sequence of operations, allowing some, possibly more complex, domains to go through more computations than others. This contrasts with state-of-the-art domain adaptation techniques that force all domains to be processed with the same series of operations, even when using multi-stream architectures whose parameters are not shared. As evidenced by our experiments, the greater ﬂexibility of our method translates to higher accuracy. Furthermore, it allows us to handle any number of domains simultaneously.

1 INTRODUCTION

While deep learning has ushered in great advances in automated image understanding, it still suffers from the same weaknesses as all other machine learning techniques: when trained with images obtained under speciﬁc conditions, deep networks typically perform poorly on images acquired under different ones. This is known as the domain shift problem: the changing conditions cause the statistical properties of the test, or target, data, to be different from those of the training, or source, data, and the network s performance degrades accordingly.

Domain adaptation aims to address this problem, especially when annotating images from the target domain is difﬁcult, expensive, or downright infeasible. The dominant trend is to map images to features that are immune to the domain shift, so that the classiﬁer works equally well on the source and target domains (Fernando et al., 2013; Ganin & Lempitsky, 2015; Sun & Saenko, 2016). In the context of deep learning, the standard approach is to ﬁnd those features using a single architecture for both domains (Tzeng et al., 2014; Ganin & Lempitsky, 2015; Sun & Saenko, 2016; Yan et al., 2017; Zhang et al., 2018). Intuitively, however, as the domains have different properties, it is not easy to ﬁnd one network that does this effectively for both. A better approach is to allow domains to undergo different transformations to arrive at domain-invariant features. This has been the focus of recent work (Tzeng et al., 2017; Berm udez-Chac on et al., 2018; Rozantsev et al., 2018; 2019), where source and target data pass through two different networks with the same architecture but different weights, nonetheless related to each other.

In this paper, we introduce a novel, even more ﬂexible paradigm for domain adaptation, that allows the different domains to undergo different computations, not only in terms of layer weights but also in terms of number of operations, while selectively sharing subsets of these computations. This enables the network to automatically adapt to situations where, for example, one domain depicts simpler images, such as synthetic ones, which may not need as much processing power as those coming from more complex domains, such as images taken in-the-wild. Our formulation reﬂects the intuition that source and target domain networks should be similar because they solve closely related problems, but should also perform domain-speciﬁc computations to offset the domain shift.

To turn this intuition into a working algorithm, we develop a multibranch architecture that sends the data through multiple network branches in parallel. What gives it the necessary ﬂexibility are trainable gates that are tuned to modulate and combine the outputs of these branches, as shown in Fig. 1. Assigning to each domain its own set of gates allows the global network to learn what set of

Published as a conference paper at ICLR 2020

Source Domain (labeled)

Target Domain (unlabeled)

Bike! f ( 1 )

Figure 1: A Domain Adaptive Multibranch Network is a sequence of computational units f (i), each of which processes the data in parallel branches, whose outputs are then aggregated in a weighted manner by a gate to obtain a single response. To allow for domain-adaptive computations, each domain has its own set of gates, one for each computational unit, which combine the branches in different ways. As a result, some computations are shared across domains while others are domain-speciﬁc.

computations should be carried out for each one. As an additional beneﬁt, in contrast to previous strategies for untying the source and target streams (Rozantsev et al., 2018; 2019), our formulation naturally extends to more than two domains.

In other words, our contribution is a learning strategy that adaptively adjusts the speciﬁc computation to be performed for each domain. To demonstrate that it constitutes an effective approach to extracting domain-invariant features, we implement it in conjunction with the popular domain classiﬁer-based method of Ganin & Lempitsky (2015). Our experiments demonstrate that our Domain Adaptive Multibranch Networks, which we will refer to as DAMNets, not only outperform the original technique of Ganin & Lempitsky (2015), but also the state-of-the-art strategy for untying the source and target weights of Rozantsev et al. (2019), which relies on the same domain classiﬁer. We will make our code publicly available upon acceptance of the paper.

2 RELATED WORK

Domain Adaptation. Domain adaptation has achieved important milestones in recent years (Dai et al., 2007; Gretton et al., 2009; Pan et al., 2010; Fernando et al., 2013; Sun et al., 2016; Shu et al., 2018), with deep learning-based methods largely taking the lead in performance. The dominant approach to deep domain adaptation is to learn a domain-invariant data representation. This is commonly achieved by ﬁnding a mapping to a feature space where the source and target features have the same distribution. In Tzeng et al. (2014); Long et al. (2015; 2017); Yan et al. (2017), the distribution similarity was measured in terms of Maximum Mean Discrepancy (Gretton et al., 2007), while other metrics based on secondand higher-order statistics were introduced in Sun & Saenko (2016); Koniusz et al. (2017); Sun et al. (2017). In Saito et al. (2018), the distribution alignment process was disambiguated by exploiting the class labels, and in H ausser et al. (2017); Shkodrani et al. (2018) by leveraging anchor points associating embeddings between the domains. Another popular approach to learning domain-invariant features is to train a classiﬁer to recognize the domain from which a sample was drawn, and use adversarial training to arrive to features that the classiﬁer can no longer discriminate (Tzeng et al., 2015; Ganin et al., 2016; 2017). This idea has spawned several recent adversarial domain adaptation classiﬁcation (Hu et al., 2018; Zhang et al., 2018), semantic segmentation (Hoffman et al., 2018; Chen et al., 2018; Hong et al., 2018), and active learning (Su et al., 2019) techniques, and we will use such a classiﬁer.

Closest in spirit to our approach are those that do not share the weights of the networks that process the source and target data (Tzeng et al., 2017; Berm udez-Chac on et al., 2018; Rozantsev et al., 2018; 2019). In Tzeng et al. (2017), the weights were simply allowed to vary freely. In Rozantsev et al. (2018); Berm udez-Chac on et al. (2018), it was shown that regularizing them to remain close to each other was beneﬁcial. More recently, Rozantsev et al. (2019) proposed to train small networks to map the source weights to the target ones. While these methods indeed untie the source and target weights, the source and target data still undergo the same computations, i.e., number of operations.

In this paper, we argue that the amount of computation, that is, the network capacity, should adapt to each domain and reﬂect their respective complexities. We rely on a domain classiﬁer as in Tzeng

Published as a conference paper at ICLR 2020

et al. (2015); Ganin et al. (2016; 2017). However, we do not force the source and target samples to go through the same transformations, which is counterintuitive since they display different appearance statistics. Instead, we start from the premise that they should undergo different computations and use domain-speciﬁc gates to turn this premise into our DAMNet architecture.

Dynamic Network Architectures. As the performance of a neural network is tightly linked to its structure, there has been a recent push towards automatically determining the best architecture for the problem at hand. While neural architecture search techniques (Zoph & Le, 2017; Liu et al., 2018; 2019; Pham et al., 2018; Zoph et al., 2018; Real et al., 2019; Noy et al., 2019) aim to ﬁnd one ﬁxed architecture for a given dataset, other works have focused on dynamically adapting the network structure at inference time (Graves, 2016; Ahmed & Torresani, 2017; Shazeer et al., 2017; Veit & Belongie, 2018; Wu et al., 2018). In particular, in Ahmed & Torresani (2017); Shazeer et al. (2017); Veit & Belongie (2018); Bhatia et al. (2019), gates were introduced for this purpose. While our DAMNets also rely on gates, their role is very different: ﬁrst, we work with data coming from different domains, whereas these gated methods, with the exception of Bhatia et al. (2019), were all designed to work in the single-domain scenario. Second, and more importantly, these techniques aim to deﬁne a different computational path for every test sample. By contrast, we seek to determine the right computation for each domain. Another consideration is that we freeze our gates for inference while these methods must constantly update theirs. We believe this to be illsuited to domain adaptation, particularly because learning to adapt the gates for the target domain, for which only unlabeled data is available, is severely under-constrained. This lack of supervision may be manageable when one seeks to deﬁne operations for a whole domain, but not when these operations are sample-speciﬁc.

We now describe our deep domain adaptation approach, which automatically adjusts the computations that the different domains undergo. We ﬁrst introduce the multibranch networks that form the backbone of our DAMNet architecture and then discuss training in the domain adaptation scenario.

3.1 MULTIBRANCH NETWORKS

x( i -1) x( i )

Figure 2: A computational unit f (i) is an aggregation of the outputs of parallel computations, or branches, f (i) j .

Let us ﬁrst consider a single domain. In this context, a traditional deep neural network can be thought of as a sequence of Nf operations f (i)( )1 i Nf , each transforming the output of the previous one. Given an input image x, this can be expressed as

x(i) = f (i)(x(i 1)) . (1)

As a general convention, each operation f (i)( ) can represent either a single layer or multiple ones. Our formulation extends this deﬁnition by replacing each f (i) by multiple parallel computations, as shown in Fig. 2. More speciﬁcally, we replace each f (i) by a computational unit {f (i) 1 , . . . , f (i) K } consisting of K parallel branches. Note that this K can be different at each stage of the network and should therefore be denoted as K(i). However, to simplify notation, we drop this index below. Given this deﬁnition, we write the output of each computational unit as

x(i) = ˆΣ f (i) 1 (x(i 1)), . . . , f (i) K (x(i 1)) , (2)

where ˆΣ( ) is an aggregation operator that could be deﬁned in many ways. It could be a simple summation that gives all outputs equal importance, or, at the opposite end of the spectrum, a multiplexer that selects a single branch and ignores the rest. To cover the range between these two alternatives,

Published as a conference paper at ICLR 2020

we introduce learnable gates that enable the network to determine what relative importance the different branches should be given. Our gates perform a weighted combination of the branch outputs. Each gate is controlled by a set of K activation weights {φ(i) j }1 j K, and a unit returns

j=1 φ(i) j f (i) j (x(i 1)) . (3)

If j, φ(i) j = 1, the gate performs a simple summation. If φ(i) j = 1 for a single j and 0 for the others,

it behaves as a multiplexer. The activation weights φ(i) j enable us to modulate the computational graph of network block f (i). To bound them and encourage the network to either select or discard each branch in a computational unit, we write them in terms of sigmoid functions with adaptive steepness. That is,

φ(i) j = 1 + exp π(i) g(i) j 1 , (4)

where the g(i) j s are learnable unbounded model parameters, and π(i) controls the plasticity of the

activation the rate at which φ(i) j varies between the extreme values 0 and 1 for block i. During training, we initially set π(i) to a small value, which enables the network to explore different gate conﬁgurations. We then apply a cooling schedule on our activations, by progressively increasing π(i) over time, so as to encourage the gates to reach a ﬁrm decision. Note that our formulation does not require PK j=1 φ(i) j = 1, that is, we do not require the aggregated output x(i) to be a convex

combination of the branch outputs f (i) j (x(i 1)). This is deliberate because allowing the activation weights to be independent from one another provides additional ﬂexibility for the network to learn general additive relationships.

Finally, a Multibranch Network is the concatenation of multiple computational units, as shown in Fig. 1. For the aggregation within each unit f (i) to be possible, the f (i) j s outputs must be of matching shapes. Furthermore, as in standard networks, two computational units can be attached only if the output shape of the ﬁrst one matches the input shape of the second. Although it would be possible to deﬁne computational units at any point in the network architecture, in practice, we usually take them to correspond to groups of layers that are semantically related. For example, one would group a succession of convolutions, pooling and non-linear operations into the same computational unit.

3.2 DOMAIN ADAPTIVE MULTIBRANCH NETWORKS

3.2.1 TWO DOMAINS

Our goal is to perform domain adaptation, that is, leverage a large amount of labeled images, Xs = {xs 1, . . . , xs N} with corresponding annotations Ys = {ys 1, . . . , ys N}, drawn from a source domain, to train a model for a target domain, whose data distribution is different and for which we only have access to unlabeled images Xt = {xt 1, . . . , xt M}.

To this end, we extend the gated networks of Section 3.1 by deﬁning two sets of gates, one for the source domain and one for the target one. Let {(φs)(i) j }K j=1 and {(φt)(i) j }K j=1 be the corresponding source and target activation weights for computational unit f (i), respectively. Given a sample xd coming from a domain d {s, t}, we take the corresponding output of the i-th computational unit to be

j=1 (φd)(i) j f (i) j (xd)(i 1) . (5)

Note that under this formulation, the domain identity d of the sample is required in order to select the appropriate (φd)(i).

The concatenated computational units forming the DAMNet encode sample x from domain d into a feature vector z = f(x, d). Since the gates for different domains are set independently from

Published as a conference paper at ICLR 2020

one another, the outputs of the branches for each computational unit are combined in a domainspeciﬁc manner, dictated by the activation weights (φd)(i) j . Therefore, the samples are encoded to a common space, but arrive to it through potentially different computations. Fig. 3 depicts this process. Ultimately, the network can learn to share weights for computational unit f (i) by setting (φs)(i) j = (φt)(i) j , j. It can also learn to fully untie the weights by having AS i AT i = , where AS i and AT i denote the set of non-zero activations in the two domains. Finally, in contrast to Tzeng et al. (2017); Berm udez-Chac on et al. (2018); Rozantsev et al. (2018; 2019), it can learn to use more computation for one domain than for the other by setting (φs)(i) j > 0 for two different branches f (i) j while having only a single non-zero (φt)(i) j , for a particular computational unit f (i).

Source domain Target domain

Figure 3: Computational graphs for the source (top) and target (bottom) domains, for the same network. While both domains share the same computational units, their outputs are obtained by different aggregations of their inner operations, e.g., in the ﬁrst unit, the source domain does not use the middle two operations, whereas the target domain does; by contrast, both exploit the fourth operation. In essence, this scheme adapts the amount of computation that each domain is subjected to.

The above formulation treats all branches for each computational unit as potentially sharable between domains. However, it is sometimes desirable not to share at all. For example, batchnormalization layers that accumulate and update statistics of the data over time, even during the forward pass, are best exposed to a single domain to learn domain-speciﬁc statistics. We allow for this by introducing computational units whose gates are ﬁxed, yet domain speciﬁc, and that therefore act as multiplexers.

After the last computational unit, a small network py operates directly on the encodings and returns the class assignment ˆy = py(z), thus subjecting the encodings for all samples to the same set of operations.

3.2.2 MULTIPLE DOMAINS

The formulation outlined above extends naturally to more than two domains, by assigning one set of gates per domain. This enables us to exploit annotated data from different source domains, and even to potentially handle multiple target domains simultaneously. In this generalized case, we introduce governing sets of gates with activations φd1, . . . , φd D for D different domains. They act in the same way as in the two-domain case and the overall architecture remains similar.

3.2.3 TRAINING

When training our models, we jointly optimize the gate parameters (gd)(i) j , from Eq. 4, along with the other network parameters using standard back-propagation. To this end, we make use of a composite loss function, designed to encourage correct classiﬁcation for labeled samples from the source domain(s) and align the distributions of all domains, using labeled and unlabeled samples. This loss can be expressed as

LDAMNet = 1

n=1 Ly(yn, ˆyn) + 1 |ℓ u|

n=1 Ld(dn, ˆdn) , (6)

where ℓand u are the sets of labeled and unlabeled samples, respectively, and where we assumed, without loss of generality, that the samples are ordered.

The ﬁrst term in this loss, Ly(y, ˆy), is the standard cross-entropy, which compares the groundtruth class probabilities y with the predicted ones ˆy = py(z), where, as discussed in Section 3.2.1, z = f(x, d) is the feature encoding of sample x from domain d. For the second term, which encodes distribution alignment, we rely on the domain confusion strategy of Ganin & Lempitsky (2015), which is commonly used in existing frameworks. Speciﬁcally, for D domains, we make use of an auxiliary domain classiﬁer network pd that predicts a D-dimensional vector of domain probabilities ˆd given the feature vector z. Following the gradient reversal technique of Ganin & Lempitsky

Published as a conference paper at ICLR 2020

(2015), we express the second term in our loss as Ld(d, ˆd) = PD i=1 di log(ˆdi) , where d is the D-dimensional binary vector encoding the ground-truth domain, di indicates the i-th element of d, and ˆd = pd(R(z)), with R the gradient reversal pseudofunction of Ganin & Lempitsky (2015) that enables to incorporate adversarial training directly into back-propagation. That is, with this loss, standard back-propagation trains jointly the domain classiﬁer to discriminate the domains and the feature extractor f( ) to produce features that fool this classiﬁer.

When training is complete and the gates have reached a stable state, the branches whose activations are close to zero are deactivated. This prevents the network from performing computations that are irrelevant and allows us to obtain a more compact network to process the target data.

4 EVALUATION

4.1 BASELINES

Since we rely on the domain confusion loss to train our model, we treat the Domain-Adversarial Neural Network (DANN) method of Ganin & Lempitsky (2015), as our ﬁrst baseline.

To demonstrate the beneﬁts of our approach over simply untying the source and target stream parameters, we compare our approach against the Residual Parameter Transfer (RPT) method of Rozantsev et al. (2019), which constitutes the state of the art in doing so. Note that RPT also relies on the domain confusion loss, which makes our comparison fair. In addition, we report the results of directly applying a model trained on the source domain to the target, without any domain adaptation, which we refer to as No DA . We also provide the oracle accuracy of a model trained on the fully-labeled target domain, referred to as On TD .

4.2 IMPLEMENTATION DETAILS

We adapt different network architectures to the multibranch paradigm for different adaptation problems. For all cases, we initialize our networks parameters by training the original versions of those architectures on the source domains, either from scratch, for simple architectures, or by ﬁne-tuning weights learned on Image Net, for very deep ones. We then set the parameters of all branches to the values from the corresponding layers. We perform this training on the predeﬁned training splits, when available, or on 75% of the images, otherwise. The initial values of the gate parameters are deﬁned so as to set the activations to 1 K , for each of the K branches. This prevents our networks from initially favoring a particular branch for any domain.

To train our networks, we use Stochastic Gradient Descent with a momentum of 0.9 and a variable learning rate deﬁned by the annealing schedule of Ganin & Lempitsky (2015) as µp = µ0 (1+α p)β , where p is the training progress, relative to the total number of training epochs, µ0 is the initial learning rate, which we take to be 10 2, and α = 10 and β = 0.75 as in Ganin & Lempitsky (2015). We eliminate exploding gradients by ℓ2-norm clipping. Furthermore, we modulate the plasticity of the activations at every gate as π(i) = 1 p, that is, we make π(i) decay linearly as training progresses. As data preprocessing, we apply mean subtraction, as in Ganin & Lempitsky (2015). We train for 200 epochs, during which the network is exposed to all the image data from the source and target domains, but only to the annotations from the source domain(s).

Our On TD oracle is trained on either the preset training splits, when available, or our deﬁned training data, and evaluated on the corresponding test data. For the comparison to this oracle to be meaningful, we follow the same strategy for our DAMNets. That is, we use the unlabeled target data from the training splits only and report results on the testing splits. This protocol differs from that of Rozantsev et al. (2019), which relied on a transductive evaluation, where all the target images, training and test ones, were seen by the networks during training.

4.3 IMAGE RECOGNITION

We evaluate our method in the task of image recognition for which we use several domain adaptation benchmark problems: Digits, which comprises three domains: MNIST (Le Cun et al., 1998), MNIST-M (Ganin & Lempitsky, 2015), and SVHN (Netzer et al., 2011); Ofﬁce (Saenko et al.,

Published as a conference paper at ICLR 2020

Table 1: Domain Adaptation datasets and results. We compare the accuracy of our DAMNet approach with that of DANN (Ganin & Lempitsky, 2015) and of RPT (Rozantsev et al., 2019), for image classiﬁcation tasks commonly used to evaluate domain adaptation methods. Our DAMNets yield a signiﬁcant accuracy boost in the presence of large domain shifts, particularly when using more than one source domain. A more comprehensive evaluation on all datasets is provided in Appendix D.

Digits: MNIST (M), MNIST-M (MM), SVHN (S) Ofﬁce-Home: Art (A), Clipart (C), Product (P), Real (R)

Source(s) M S M MM M,MM M,MM A C C R A C P C,P A,C,P Target MM M S S S S P P A A R R R R R No DA 52.25 54.90 25.57 27.49 33.52 22.88 37.03 36.67 29.65 50.91 53.12 43.03 46.42 59.39 58.72 DANN 76.66 73.90 31.69 37.43 44.16 49.02 58.50 70.50 47.93 57.68 56.40 57.90 62.30 70.53 72.00 RPT 82.24 78.70 34.72 37.90 n/a n/a 54.51 63.18 47.32 51.90 52.15 55.05 62.16 n/a n/a Ours 88.80 81.30 37.95 39.41 51.83 79.45 59.30 77.50 51.24 60.74 59.90 62.70 65.00 72.25 77.65 On TD 96.21 99.26 89.23 89.23 89.23 96.07 87.66 87.66 64.42 64.42 77.80 77.80 77.80 77.80 77.80

2010), which contains three domains: Amazon, DSLR, and Webcam; Ofﬁce-Home (Venkateswara et al., 2017), with domains Art, Clipart, Product, and Real; and Vis DA17 (Peng et al., 2018), with Synthetic and Real images. As all these are well studied benchmark datasets, we provide full descriptions and image examples evidencing the different degrees of domain shift in Appendix B.

Setup. As discussed in Section 3, our method is general and can work with any feed-forward network architecture. To showcase this, for the digit recognition datasets, we apply it to the Le Net and SVHNet architectures (Ganin & Lempitsky, 2015), which are very simple convolutional networks, well suited for small images. Following Ganin & Lempitsky (2015), we employ Le Net when using the synthetic datasets MNIST and MNIST-M as source domains, and SVHNet when SVHN acts as source domain. We extend these architectures to multibranch ones by deﬁning the computational units as the groups of consecutive convolution, pooling and non-linear operations deﬁned in the original model. For simplicity, we use as many branches within each computational unit as we have domains, and all branches from a computational unit follow the same architecture, which we provide in Appendix A, Figures 1 and 2. As backbone network to process all the rest of the datasets, we use a Res Net-50 (He et al., 2016), with the bottleneck layer modiﬁcation of Rozantsev et al. (2019). While many multibranch conﬁgurations can be designed for such a deep network, we choose to make our gated computational units coincide with the layer groupings deﬁned in He et al. (2016), namely conv1, conv2 x, conv3 x, conv4 x, and conv5 x. The resulting multibranch network is depicted in Appendix A, Figure 4. We feed our DAMNets images resized to 224 224 pixels, as expected by Res Net-50.

Results. The results for the digit recognition and Ofﬁce-Home datasets are provided in Table 1. Results for Ofﬁce and Vis DA17 datasets are presented in Appendix D. Our approach outperforms the baselines in all cases.

For the Digits datasets, in addition to the traditional two-domain setup, we also report results when using two source domains simultaneously. Note that the reference method RPT (Rozantsev et al., 2019) does not apply to this setting, since it was designed to transform a single set of source parameters to the target ones. Altogether, our method consistently outperforms the others. Note that the ﬁrst two columns correspond to the combinations reported in the literature. We believe, however, that the SVHN MNIST one is quite artiﬁcial, since, in practice, one would typically annotate simpler, synthetic images and aim to use real ones at test time. We therefore also report synthetic SVHN cases, which are much more challenging. The multi-source version of our method achieves a significant boost over the baselines in this scenario. To further demonstrate the potential of our approach in this setting, we replaced its backbone with the much deeper Res Net-50 network and applied it on upscaled versions of the images. As shown in the column indicated by a , this allowed us to achieve an accuracy close to 80%, which is remarkable for such a difﬁcult adaptation task.

On Ofﬁce-Home, the gap between DAMNet and the baselines is again consistent across the different domain pairs. Note that, here, because of the relatively large number of classes, the overall performance is low for all methods. Importantly, our results show that we gain performance by training on more than one source domain, and by leveraging all synthetic domains to transfer to the real one, our approach reaches an accuracy virtually equal to that of using full supervision on the target domain. Despite our best efforts, we were unable to obtain convincing results for RPT using the authors publicly available code, as results for this dataset were not originally reported for RPT.

Gate dynamics. To understand the way our networks learn the domain-speciﬁc branch assignments, we track the state of the gates for all computational units over all training epochs. In Figure 4,

Published as a conference paper at ICLR 2020

we plot the corresponding evolution of the gate activations for the DSLR+Webcam Amazon task on Ofﬁce. Note that our DAMNet leverages different branches over time for each domain before reaching a ﬁrm decision. Interestingly, we can see that, with the exception of the ﬁrst unit, which performs low-level computations, DSLR and Webcam share all branches. By contrast, Amazon, which has a signiﬁcantly different appearance, mostly uses its own branches, except in two computational units. This evidences that our network successfully understands when domains are similar and can thus use similar computations.

4.4 OBJECT DETECTION

Method Average precision

No adaptation 0.377 DANN (Ganin & Lempitsky, 2015) 0.715 ADDA (Tzeng et al., 2017) 0.731 Two-stream (Rozantsev et al., 2018) 0.732 RPT (Rozantsev et al., 2019) 0.743 DAMNet 0.792

Table 2: Average precision of our DAMNet approach with several other reference methods, for domain adaptation from synthetic to real images of drones.

We evaluate our method for the detection of drones from video frames, on the UAV-200 dataset (Rozantsev et al., 2018), which contains examples of drones both generated artiﬁcially and captured from real video footage. Full details and example images are provided in Appendix B.3

Setup. Our domain adaptation leverages both the synthetic examples of drones, as source domain, and the limited amount of annotated real drones, as target domain, as well as the background negative examples, to predict the class of patches from the validation set of real images. We follow closely the supervised setup and network architecture of Rozantsev et al. (2019), including the use of Ada Delta as optimizer, cross-entropy as loss function, and average precision as evaluation metric. Our multibranch computational units are deﬁned as groupings of successive convolutions, nonlinearities, and pooling operations. The details of the architecture are provided in Appendix A, Figure 3.

Results. Our method considerably surpasses all the others in terms of average precision, as shown in Table 2, thus validating DAMNets as effective models for leveraging synthetic data for domain adaptation in real-world problems.

4.5 DAMNET AS A GENERAL FEATURE EXTRACTOR

Table 3: We boost the method of Saito et al. (2018) by replacing their feature extraction with our DAMNets.

MCD accuracy

No DAMNet with DAMNet

MNIST-M SVHN 38.54 41.51 DSLR Amazon 67.24 67.81 Webcam Amazon 64.33 66.19 Clipart Real 63.50 63.87

We validate the effectiveness of our method as a feature extractor, by combining it with the Maximum Classiﬁer Discrepancy (MCD) method of Saito et al. (2018). As MCD operates on the extracted encodings, we replace the encoding strategy that MCD uses, which is the same as DANN, with our DAMNet. Or, in other words, we replace the domain classiﬁer in our approach with the corresponding MCD term. Speciﬁcally, we use a single computational unit with two branches, each of which replicates the architectures proposed in Saito et al. (2018).

We present the results of combining MCD with DAMNet in Table 3. In all tested scenarios, we obtain improvements over using MCD as originally proposed.

4.6 BRANCH ARCHITECTURES

To obtain more insights about speciﬁc branch decisions, we evaluate the effects of adding extra branches to the network, as well as using branches with different capacities.

4.6.1 BRANCHES WITH DIFFERENT CAPACITIES

When computational units are composed of branches of different capacities, DAMNets often assign branches with more capacity to more complex domains. To exemplify this, we trained a modiﬁed multibranch SVHNet for adaptation between MNIST and SVHN. Instead of the identical branches originally used, we replace the second branch in each computational unit with a similar branch where the convolution operation is performed by 1x1 rather than 5x5 kernels. These second branches, with

Published as a conference paper at ICLR 2020

conv1 conv2_x conv3_x conv4_x conv5_x

Office Amazon Office Webcam Office DSLR

0 20 60 40 120 100 80 140

0 20 60 40 120 100 80 140 0 20 60 40 120 100 80 140 0 20 60 40 120 100 80 140 0 20 60 40 120 100 80 140

Branch 1 Branch 2 Branch 3

Epoch Epoch Epoch Epoch Epoch

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Branch 1 Branch 2 Branch 3

Figure 4: Evolution of the gates activations for each of the computational units in a multibranch Res Net50 network, for the Ofﬁce DSLR + Webcam Amazon domain adaptation problem. In the top two rows, we show the gates for the source domains and in the bottom row for the target one. All branches are initialized to parameters obtained from a single Res Net-50 trained on Image Net. Note how for the ﬁrst computational unit, conv1, each domain chooses to process the data with different branches. In the remaining units, the two source domains, which have similar appearance, share all the computations. By contrast, the target domain still uses its own branches in conv3 x, and conv4 x to account for its signiﬁcantly different appearance. When arriving at conv 5x, the data has been converted to a domain-agnostic representation, and hence the same branch can operate on all domains.

25 times fewer parameters each, are mostly used by the simpler domain MNIST in this case. We provide the gate evolution that reﬂects this in Appendix C, Figures 5 and 6.

4.6.2 ADDITIONAL BRANCHES

We explore the effects of using more branches than domains, so as to provide the networks with alternative branches from where to choose. In particular, we explore the case where K = D + 1. We evaluate multibranch Le Net and Res Net architectures under this setting. We show the gate activation evolution in Appendix C, Figures 7 and 8. During the training process, we have observed that the networks quickly choose to ignore extra branches when K > D. This suggests that they did not contribute to the learning of our feature extraction. We did not ﬁnd experimental evidence to support that K > D is beneﬁcial.

5 CONCLUSION

We have introduced a domain adaptation approach that allows for adaptive, separate computations for different domains. Our framework relies on computational units that aggregate the outputs of multiple parallel operations, and on a set of trainable domain-speciﬁc gates that adapt the aggregation process to each domain. Our experiments have demonstrated the beneﬁts of this approach over the state-of-the-art weight untying strategy; the greater ﬂexibility of our method translates into a consistently better accuracy.

Although we only experimented with using the same branch architectures within a computational unit, our framework generalizes to arbitrary branch architectures, the only constraint being that their outputs are of commensurate shapes. An interesting avenue for future research would therefore be to automatically determine the best operation to perform for each domain, for example by combining our approach with neural architecture search strategies.

Published as a conference paper at ICLR 2020

K. Ahmed and L. Torresani. Connectivity Learning in Multi-Branch Networks. In NIPS Metalearning Workshop, 2017.

R. Berm udez-Chac on, P. M arquez-Neila, M. Salzmann, and P. Fua. A Domain-Adaptive Two Stream U-Net for Electron Microscopy Image Segmentation. In International Symposium on Biomedical Imaging, pp. 400 404, April 2018.

P. Bhatia, K. Arumae, and E. B. Celikkaya. Dynamic transfer learning for named entity recognition. In International Workshop on Health Intelligence, pp. 69 81. Springer, 2019.

Y. Chen, W. Li, and L. Van Gool. ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes. In Conference on Computer Vision and Pattern Recognition, pp. 7892 7901, 2018.

W. Dai, Q. Yang, G.R. Xue, and Y. Yu. Boosting for Transfer Learning. In Machine Learning, pp. 193 200, 2007.

B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised Visual Domain Adaptation Using Subspace Alignment. In International Conference on Computer Vision, pp. 2960 2967, 2013.

Y. Ganin and V. Lempitsky. Unsupervised Domain Adaptation by Backpropagation. In International Conference on Machine Learning, pp. 1180 1189, 2015.

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky. Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research, 17:591 5935, 2016.

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky. Domain-Adversarial Training of Neural Networks. In Domain Adaptation in Computer Vision Applications., pp. 189 209. 2017.

A. Graves. Adaptive Computation Time for Recurrent Neural Networks. In ar Xiv Preprint, 2016.

A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Sch olkopf, and A.J. Smola. A Kernel Method for the Two-Sample Problem. In Advances in Neural Information Processing Systems, pp. 513 520, 2007.

A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Sch olkopf. Covariate Shift by Kernel Mean Matching. Journal of the Royal Statistical Society, 3(4):5 13, 2009.

P. H ausser, T. Frerix, A. Mordvintsev, and D. Cremers. Associative Domain Adaptation. In International Conference on Computer Vision, pp. 2784 2792, 2017.

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016.

J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. Cy CADA: Cycle Consistent Adversarial Domain Adaptation. In International Conference on Machine Learning, pp. 1989 1998, 2018.

W. Hong, Z. Wang, M. Yang, and J. Yuan. Conditional Generative Adversarial Network for Structured Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pp. 1335 1344, 2018.

L. Hu, M. Kan, S. Shan, and X. Chen. Duplex Generative Adversarial Network for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pp. 1498 1507, 2018.

P. Koniusz, Y. Tas, and F. Porikli. Domain Adaptation by Mixture of Alignments of Secondor Higher-Order Scatter Tensors. In Conference on Computer Vision and Pattern Recognition, pp. 4478 4487, 2017.

Published as a conference paper at ICLR 2020

Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, pp. 2278 2324, 1998.

C. Liu, B. Zoph, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. Progressive Neural Architecture Search. In European Conference on Computer Vision, 2018.

H. Liu, K. Simonyan, and Y. Yang. DARTS Differentiable Architecture Search. In International Conference on Learning Representations, 2019.

M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning Transferable Features with Deep Adaptation Networks. In International Conference on Machine Learning, pp. 97 105, 2015.

M. Long, J. Wang, and M.I. Jordan. Deep Transfer Learning with Joint Adaptation Networks. In International Conference on Machine Learning, pp. 2208 2217, 2017.

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. In Advances in Neural Information Processing Systems, 2011.

A. Noy, N. Nayman, T. Ridnik, N. Zamir, S. Doveh, T. Friedman, R. Giryes, and L. Zelnik-Manor. ASAP: Architecture search, anneal and prune. ar Xiv preprint ar Xiv:1904.04123, 2019.

S.J. Pan, I. Tsang, J. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199 210, 2010.

X. Peng, B. Usman, N. Kaushik, D. Wang, J. Hoffman, and K. Saenko. Vis DA: A synthetic-to-real benchmark for visual domain adaptation. Conference on Computer Vision and Pattern Recognition Workshops, 2018.

H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean. Efﬁcient Neural Architecture Search via Parameter Sharing. In International Conference on Machine Learning, 2018.

E. Real, A. Aggarwal, Y. Huang, and Quoc V Le. Regularized evolution for image classiﬁer architecture search. In American Association for Artiﬁcial Intelligence Conference, volume 33, pp. 4780 4789, 2019.

A. Rozantsev, M. Salzmann, and P. Fua. Residual Parameter Transfer for Deep Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pp. 4339 4348, 2018.

A. Rozantsev, M. Salzmann, and P. Fua. Beyond Sharing Weights for Deep Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4):801 814, 2019.

K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting Visual Category Models to New Domains. In European Conference on Computer Vision, pp. 213 226, 2010.

K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum Classiﬁer Discrepancy for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2018.

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-Of-Experts Layer. In International Conference on Learning Representations, 2017.

S. Shkodrani, M. Hofmann, and E. Gavves. Dynamic adaptation on non-stationary visual domains. In European Conference on Computer Vision, 2018.

R. Shu, H. Bui, H. Narui, and S. Ermon. A DIRT-T approach to unsupervised domain adaptation. In International Conference on Learning Representations, 2018.

J. Su, Y. Tsai, K. Sohn, B. Liu, S. Maji, and M. Chandraker. Active adversarial domain adaptation. Conference on Computer Vision and Pattern Recognition Workshops, 2019.

B. Sun and K. Saenko. Deep CORAL: Correlation Alignment for Deep Domain Adaptation. In European Conference on Computer Vision, pp. 443 450, 2016.

Published as a conference paper at ICLR 2020

B. Sun, J. Feng, and K. Saenko. Correlation Alignment for Unsupervised Domain Adaptation. ar Xiv Preprint, 2016.

B. Sun, J. Feng, and K. Saenko. Correlation Alignment for Unsupervised Domain Adaptation. In Domain Adaptation in Computer Vision Applications., pp. 153 171. 2017.

E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep Domain Confusion: Maximizing for Domain Invariance. In ar Xiv Preprint, 2014.

E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous Deep Transfer Across Domains and Tasks. In International Conference on Computer Vision, pp. 4068 4076, 2015.

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial Discriminative Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pp. 7167 7176, 2017.

A. Veit and S. Belongie. Convolutional Networks with Adaptive Inference Graphs. In European Conference on Computer Vision, pp. 3 18, 2018.

H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep Hashing Network for Unsupervised Domain Adaptation. Conference on Computer Vision and Pattern Recognition, pp. 5018 5027, 2017.

Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. Davis, K. Grauman, and R. Feris. Blockdrop: Dynamic Inference Paths in Residual Networks. In Conference on Computer Vision and Pattern Recognition, pp. 8817 8826, 2018.

H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo. Mind the Class Weight Bias: Weighted Maximum Mean Discrepancy for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pp. 2272 2281, 2017.

W. Zhang, W. Ouyang, W. Li, and D. Xu. Collaborative and Adversarial Network for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, pp. 3801 3809, 2018.

B. Zoph and Q. Le. Neural Architecture Search with Reinforcement Learning. In International Conference on Learning Representations, 2017.

B. Zoph, V. Vasudevan, J. Shlens, and Q. Le. Learning transferable architectures for scalable image recognition. In Conference on Computer Vision and Pattern Recognition, pp. 8697 8710, 2018.

Published as a conference paper at ICLR 2020

A MULTIBRANCH ARCHITECTURES

Below, we provide the network architectures and detailed building blocks of our Domain Adaptive Multibranch Networks, for the single source domain case (D = 2). Each computational unit is enclosed by dotted lines. The input and output shapes for all layer groupings are provided.

Conv 5x5, 32 ch

Max Pool 2x2

Conv 5x5, 48 ch

Max Pool 2x2

Feature classifier

Fully-conn. 100

Fully-conn. 100

Fully-conn. 10

Domain classifier

Fully-conn. 100

Fully-conn. D

Source Domain

Target Domain

Domain classifier conv1

Feature classifier

Figure 1: Multibranch Le Net. This architecture is a multibranch extension to the Le Net used by DANN (Ganin & Lempitsky, 2015).

Conv 5x5, 64 ch

Max Pool 3x3 str 2

Feature classifier

Fully-conn. 3072

Fully-conn 2048

Fully-conn. 10

Domain classifier

Fully-conn. 1024

Fully-conn. 1024

Fully-conn. D

Conv 5x5, 128 ch

Conv 5x5, 64 ch

Max Pool 3x3 str 2

Source Domain

Target Domain

conv2 Domain classifier

Feature classifier

Figure 2: Multibranch SVHNet. This architecture is a multibranch extension to the SVHNet used by DANN (Ganin & Lempitsky, 2015).

Conv 3x3, 64 ch

Max Pool 2x2 str 2

Feature classifier

Fully-conn. 512

Fully-conn. 2

Domain classifier

Fully-conn. 512

Fully-conn. 128

Fully-conn. D

Conv 3x3, 128 ch

Max Pool 2x2 str 2

Conv 5x5, 32 ch

Max Pool 2x2 str 2

Source Domain

Target Domain

conv2 Domain classifier

Feature classifier

Figure 3: Multibranch architecture for drone detection. This architecture is a multibranch extension to the one used by Rozantsev et al. (2019).

Published as a conference paper at ICLR 2020

Source Domain

Target Domain

conv5_x Domain classifier

Feature classifier

Conv 7x7, 64 ch

Max Pool 3x3 str 2

Feature classifier

Fully-conn. N

Domain classifier

Fully-conn. 1024

Fully-conn. 1024

Fully-conn. D

Average Pool 7x7

Fully-conn. 256

conv3_a conv3_a

Conv 1x1, 64 ch

Conv 3x3, 64 ch

Conv 1x1, 256 ch

conv2_{b,c}

Conv 1x1, 128 ch

Conv 3x3, 128 ch

Conv 1x1,512 ch

Conv 1x1, 512 ch Batch Norm

Conv 1x1, 128 ch

Conv 3x3, 128 ch

Conv 1x1, 512 ch

conv3_{b,c,d}

Conv 1x1, 256 ch

Conv 3x3, 256 ch

Conv 1x1, 1024 ch

Conv 1x1, 1024 ch Batch Norm

Conv 1x1, 256 ch

Conv 3x3, 256 ch

Conv 1x1, 1024 ch

conv4_{b,c,d,e,f}

Conv 1x1, 512 ch

Conv 3x3, 512 ch

Conv 1x1, 2048 ch

Conv 1x1, 2048 ch Batch Norm

Conv 1x1, 512 ch

Conv 3x3, 512 ch

Conv 1x1, 2048 ch

conv5_{b,c}

Conv 1x1, 64 ch Conv 1x1, 64 ch

Conv 3x3, 64 ch

Conv 1x1, 256 ch

Batch Norm Conv 1x1, 256 ch

Figure 4: Multibranch Res Net-50. This architecture is adapted from the original Res Net-50 (He et al., 2016). We preserve the groupings described in the original paper (He et al., 2016). N denotes the number of classes in the dataset.

Published as a conference paper at ICLR 2020

B BENCHMARK DATASET DESCRIPTIONS

B.1 DIGIT RECOGNITION

MNIST (Le Cun et al., 1998) consists of black and white images of handwritten digits from 0 to 9. All images are of size 28 28 pixels. The standard training and testing splits contain 60,000 and 10,000 examples, respectively. MNIST-M (Ganin & Lempitsky, 2015) is synthetically generated by randomly replacing the foreground and background pixels of random MNIST samples with natural images. Its image size is 32 32, and the standard training and testing splits contain 59,001 and 9,001 images, respectively. SVHN (Netzer et al., 2011), the Street View House Numbers dataset, consists of natural scene images of numbers acquired from Google Street View. Its images are also of size 32 32 pixels, and its preset training and testing splits are of 73,257 and 26,032 images, respectively. The SVHN images are centered at the desired digit, but contain clutter, visual artifacts, and distractors from its surroundings.

0 1 2 3 4 5 6 7 8 9

B.2 OBJECT RECOGNITION

Ofﬁce (Saenko et al., 2010) is a multiclass object recognition benchmark dataset, containing images of 31 categories of objects commonly found in ofﬁce environments. It contains color images from three different domains: 2,817 images of products scraped from Amazon, 498 images acquired using a DSLR digital camera, and 795 images captured with a webcam. The images are of arbitrary sizes and aspect ratios.

Backpack Bicycle Bike helmet Bookcase Bottle Calculator Chair Desk lamp Computer File cabinet Headphones

Ofﬁce-Home (Venkateswara et al., 2017) comprises a larger corpus of color, arbitrarily-sized images from 65 different classes of objects commonly found in ofﬁce and home environments, coming from four different domains. It contains 2,427 images extracted from paintings (Art), 4,365 clipart images (Clipart), 4,439 photographs of products (Product), and 4,357 pictures captured with a regular consumer camera (Real world).

Alarm clock Backpack Battery Bed Bicycle Bottle Bucket Calculator Calendar Candle Chair Clipboard Computer

Published as a conference paper at ICLR 2020

Vis DA 2017 (Peng et al., 2018) includes images of diverse sizes from 12 different categories, coming from two different domains: 55,368 synthetic renders of 3D models, and 152,397 photographs of the real-world objects. It is larger than the other two datasets, and exhibits a more signiﬁcant domain shift.

Aeroplane Bicycle Bus Car Horse Knife Motorcycle Person Plant Skateboard Train Truck

B.3 OBJECT DETECTION

UAV-200 aggregates 200 images of real drones and around 33,000 synthetic ones, as well as around 190,000 patches obtained from the background of the video, which do not contain drones, used as negative examples. All examples are of size 40 40 pixels. We evaluate performance on a validation set comprising 3,000 positive and 135,000 negative patches.

Published as a conference paper at ICLR 2020

C ADDITIONAL EXPERIMENTS

unit1 unit3 unit2

Figure 5: Gate evolution for a multibranch SVHN network with branches of different capacities. Branch 1 is the original branch that applies 5x5 convolutions to the image, whereas branch 2 is a similar architecture but with 1x1 convolutions instead. The network quickly recognizes that SVHN requires a more complex processing and hence assigns the respective branch to it for computational units 1 and 3.

Figure 6: Gate evolution for a multibranch Le Net network with branches of different capacities. We have simpliﬁed the architecture to encapsulate the feature extraction into a single computational unit in this case. Similarly to the above, we modify the second branch for a simpler computation. The original branches apply convolution operations to extract 32 channels with a 5x5 kernel, and then to extract 48 channels from those with a 5x5 kernel. We replace them in the second branch with 24 channels 3x3 kernel and 48 channels 1x1 kernel convolutions, respectively, which yields commensurate shapes with the original branch, but with more than 20 times fewer parameters. Unlike in the above experiment, we do not force the gates to open or close. The network still assigns combinations of branches that reﬂect the difference in visual complexity of the domains.

Published as a conference paper at ICLR 2020

unit1 unit2

Figure 7: Effect of adding extra branches to a Le Net multibranch network. We augment the original multibranch Le Net with a third branch under the same branch architecture as the original one. The network rapidly decides to ignore this overparametrization. The additional branch does not have an effect in the ﬁnal activation of the gates, nor does it help during training.

conv1 conv4_x conv3_x conv2_x conv5_x

office_dslr office_amazon office_webcam

Figure 8: Augmenting a multibranch Res Net-50 has a similar effect as the above. One of the branches is discarded early on by each computational unit.

Published as a conference paper at ICLR 2020

D FULL RESULTS

Table 4: Domain Adaptation results. We compare the accuracy of our DAMNet approach with that of DANN (Ganin & Lempitsky, 2015) and of RPT (Rozantsev et al., 2019), for image classiﬁcation tasks commonly used to evaluate domain adaptation methods. As illustrated in Appendix B, different source and target domain combinations present various degrees of domain shift, and some combinations are clearly more challenging than others. Our DAMNets yield a signiﬁcant accuracy boost in the presence of large domain shifts, particularly when using more than one source domain.

Datasets Source(s) Target No DA DANN RPT DAMNet On TD

MNIST MNIST-M 52.25 76.66 82.24 88.80 96.21 SVHN MNIST 54.90 73.90 78.70 81.30 99.26 MNIST SVHN 25.57 31.69 34.72 37.95 89.23 MNIST-M SVHN 27.49 37.43 37.90 39.41 89.23

MNIST + MNIST-M SVHN 33.52 44.16 n/a 51.83 89.23 MNIST + MNIST-M SVHN 22.88 49.02 n/a 79.45 96.07

Webcam DSLR 93.60 99.20 99.40 99.62 95.20 Amazon DSLR 32.80 79.10 82.70 84.14 95.20 DSLR Webcam 90.45 97.70 98.00 98.11 98.49 Amazon Webcam 34.67 78.90 81.50 85.28 98.49 Webcam Amazon 41.42 62.80 63.60 65.67 85.11 DSLR Amazon 34.47 63.60 64.70 64.82 85.11

DSLR + Webcam Amazon 45.82 64.86 n/a 68.87 85.11

Art Product 37.03 58.50 54.51 59.30 87.66 Clipart Product 36.67 70.50 63.18 77.50 87.66 Clipart Art 29.65 47.93 47.32 51.24 64.42 Real world Art 50.91 57.68 51.90 60.74 64.42 Art Real world 53.12 56.40 52.15 59.90 77.80 Clipart Real world 43.03 57.90 55.05 62.70 77.80 Product Real world 46.42 62.30 62.16 65.00 77.80

Clipart + Product Real world 53.39 70.53 n/a 72.25 77.80 Art + Clipart + Product Real world 58.72 72.00 n/a 77.65 77.80

Vis DA 2017 Synthetic Real 35.46 59.90 61.10 61.40 84.72 Real Synthetic 51.12 83.10 82.15 85.20 99.34

UAV-200 Synthetic Real 0.377 0.715 0.743 0.792 0.858

Accuracy reported in Ganin & Lempitsky (2015) and Rozantsev et al. (2019) Evaluated with a Res Net-50 Results reported as Average Precision