# unsupervised_domain_adaptation_based_on_sourceguided_discrepancy__d9cf4a26.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Unsupervised Domain Adaptation Based on Source-Guided Discrepancy

Seiichi Kuroki, Nontawat Charoenphakdee, Han Bao, Junya Honda, Issei Sato, Masashi Sugiyama

The University of Tokyo, Tokyo, Japan RIKEN, Tokyo, Japan

Unsupervised domain adaptation is the problem setting where data generating distributions in the source and target domains are different and labels in the target domain are unavailable. An important question in unsupervised domain adaptation is how to measure the difference between the source and target domains. Existing discrepancy measures for unsupervised domain adaptation either require high computation costs or have no theoretical guarantee. To mitigate these problems, this paper proposes a novel discrepancy measure called source-guided discrepancy (S-disc), which exploits labels in the source domain unlike the existing ones. As a consequence, S-disc can be computed efﬁciently with a ﬁnitesample convergence guarantee. In addition, it is shown that S-disc can provide a tighter generalization error bound than the one based on an existing discrepancy measure. Finally, experimental results demonstrate the advantages of S-disc over the existing discrepancy measures.

Introduction In the conventional supervised learning framework, we often assume that the training and test distributions are the same. However, this assumption may not hold in many practical applications such as natural language processing (Glorot, Bordes, and Bengio 2011), speech recognition (Sun et al. 2017), and computer vision (Saito, Ushiku, and Harada 2017). For instance, a personalized spam ﬁlter can be trained from the emails of all available users, but the training data may not represent the emails of the target user. Such scenarios can be formulated in the framework of domain adaptation, which has been studied extensively (Ben-David et al. 2007; Mansour, Mohri, and Rostamizadeh 2009a; Zhang, Zhang, and Ye 2012; Pan and Yang 2010; Sugiyama and Kawanabe 2012). An important challenge in domain adaptation is to ﬁnd a classiﬁer for a label-scarce target domain by exploiting a label-rich source domain. In particular, there are cases where we only have access to labeled data from the source domain and unlabeled data from the target domain, because annotating labels in the target domain is often time-consuming and expensive (Saito et al. 2018).

A longer version of this paper is available in Kuroki et al. (2018). Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

This problem setting is called unsupervised domain adaptation (Ben-David et al. 2007), which is our interest in this paper. Since domain adaptation cannot be performed effectively if the source and target domains are too different, an important topic to be addressed is how to measure the difference, or discrepancy, between the two domains. Many discrepancy measures have been used in previous studies such as the maximum mean discrepancy (Huang et al. 2007), Kullback-Leibler divergence (Sugiyama et al. 2008), R enyi divergence (Mansour, Mohri, and Rostamizadeh 2009b), and Wasserstein distance (Courty et al. 2017). Apart from the discrepancy measures described above, Ben-David et al. (2007) proposed a discrepancy measure for binary classiﬁcation that explicitly takes a hypothesis class into account. They showed that their discrepancy measure leads to a tighter generalization error bound than the L1 distance, which does not use information of the hypothesis class. Following this line of research, Mansour, Mohri, and Rostamizadeh (2009a) generalized the discrepancy measure of Ben-David et al. (2007) to arbitrary loss functions, and Germain et al. (2013) provided a PAC-Bayesian analysis based on the average disagreement of a set of hypotheses. These results provided a theoretical foundation for various applications of domain adaptation based on estimation of source-target discrepancy, including sentiment analysis (Glorot, Bordes, and Bengio 2011), source selection (Bhatt, Rajkumar, and Roy 2016), and multi-source domain adaptation (Zhao et al. 2017). However, their discrepancy measure considers the worst pair of hypotheses to bound the maximum gap in the loss between the domains, which may lead to a loose generalization error bound and expensive computation costs. Zhang, Zhang, and Ye (2012) and Mohri and Medina (2012) analyzed another discrepancy measure with a generalization error bound. Nevertheless, this discrepancy measure cannot be computed without labels in the target domain and therefore is not suitable for unsupervised domain adaptation. To alleviate the limitations of the existing discrepancy measures for domain adaptation, we propose a novel discrepancy measure called source-guided discrepancy (Sdisc). By incorporating labels in the source domain, S-disc provides a tighter generalization error bound than the previously proposed discrepancy measure by Mansour, Mohri,

and Rostamizadeh (2009a). Furthermore, we provide an estimator of S-disc for binary classiﬁcation that can be computed efﬁciently. We also establish consistency and analyze the convergence rate of the proposed S-disc estimator. Our main contributions are as follows.

We propose a novel discrepancy measure called sourceguided discrepancy (S-disc), which uses labels in the source domain to measure the difference between the two domains for unsupervised domain adaptation (Deﬁnition 2).

We propose an efﬁcient algorithm for estimating S-disc for the 0-1 loss (Algorithm 1).

We show the consistency and elucidate the convergence rate of the estimator of S-disc (Theorem 4).

We derive a generalization error bound in the target domain based on S-disc (Theorem 7), which is tighter than the existing bound provided by Mansour, Mohri, and Rostamizadeh (2009a). In addition, we derive a generalization error bound for a ﬁnite-sample case (Theorem 8).

We demonstrate the effectiveness of S-disc for unsupervised domain adaptation in experiments.

Problem Setting and Notation In this section, we formulate the unsupervised domain adaptation problem. Let X be the input space and Y be the output space, which is {+1, 1} in binary classiﬁcation. We deﬁne a domain as a pair (PD, f D), where PD is an input distribution on X and f D : X Y is a labeling function. In domain adaptation, we denote the source domain and the target domain as (PS,f S) and (PT,f T), respectively. Unlike conventional supervised learning, we focus on the case where (PT,f T) differs from (PS,f S). In unsupervised domain adaptation, we are given the following data:

Unlabeled data in the target domain: T = {x T i }n T i=1.

Labeled data in the source domain: S = {(x S j , y S j )}n S j=1.

For simplicity, we denote the input features from the source domain {x S j }n S j=1 as SX and the empirical distribution corresponding to PT (resp. PS) as b PT (resp. b PS). We denote a loss function ℓ: R R R+. For example, the 0-1 loss is given as ℓ01(y, y ) = (1 sign(yy ))/2. The expected loss for functions h, h : X R and distribution PD over X is denoted by Rℓ D(h, h ) = Ex PD[ℓ(h(x), h (x))]. We also denote an empirical risk as b Rℓ D(h, h ) = Ex b PD[ℓ(h(x), h (x))]. In addition, we deﬁne the true risk minimizer and the empirical risk minimizer in a certain domain (PD,f D) in a hypothesis class H as h D = arg minh H Rℓ D(h, f D) and bh D = arg minh H b Rℓ D(h, f D), respectively. Here, note that the risk minimizer h D is not necessarily equal to the labeling function f D as we consider a restricted hypothesis class. The goal in domain adaptation is to ﬁnd a hypothesis h out of a hypothesis class H so that it gives as small expected loss as possible in the target domain deﬁned by

Rℓ T(h, f T) = Ex PT[ℓ(h(x), f T(x))].

Related Work In unsupervised domain adaptation, it is essential to measure the difference between the source and target domains, because it might degrade the performance to use data collected from a source domain that is far from the target domain (Pan and Yang 2010). Since we cannot access labels from the target domain, it is impossible to measure the difference between the two domains on the basis of the output space. One way is to measure the difference between probability distributions in terms of the input space. We will review such discrepancy measures in this section. First, we introduce the discrepancy measure proposed by Mansour, Mohri, and Rostamizadeh (2009a), which is deﬁned as

discℓ H(PT, PS) = sup h,h H

Rℓ T(h, h ) Rℓ S(h, h ) . (1)

We call this discrepancy measure X-disc as it does not require labels from the source domain but input features. Note that X-disc takes the hypothesis class H into account. Mansour, Mohri, and Rostamizadeh (2009a) showed that the following inequalities hold for any h, h H: Rℓ T(h, h ) Rℓ S(h, h ) discℓ H(PT, PS)

M L1(PT, PS),

where M > 0 is a constant and L1( , ) is the L1 distance over distributions. Therefore, a tighter bound for the difference in the expected loss between the two domains can be obtained by considering a hypothesis class H. However, a drawback of X-disc is that it considers the worst pair of hypotheses as can be seen from (1). This may lead to a loose generalization error bound as we will show later, and also an intractable computation cost for empirical estimation. Ben-David et al. (2007) provided a computationally efﬁcient proxy of X-disc for the 0-1 loss deﬁned as follows:

d H(PT, PS) = sup h H

Rℓ01 T (h, 1) Rℓ01 S (h, 1) .

Although d H can be computed efﬁciently, there is no learning guarantee. There is another variant of X-disc called the generalized discrepancy (Cortes, Mohri, and Medina 2015). However, the theoretical analysis therein is only applicable to the regression task and not suitable for binary classiﬁcation unlike the analysis for X-disc. Another discrepancy measure Y-disc which also takes H into account is as follows:

Y-discℓ H(PT, PS) = sup h H |Rℓ T(h, f T) Rℓ S(h, f S)|.

While Y-disc has been proposed to provide a tighter generalization error bound than X-disc (Mohri and Medina 2012; Zhang, Zhang, and Ye 2012), it requires the labeling function f T in the test domain which cannot be estimated in unsupervised domain adaptation in general.

Source-guided Discrepancy (S-disc) In this section, we propose a novel discrepancy measure called source-guided discrepancy (S-disc) to mitigate the

limitations of the existing discrepancy measures. Later, we will show that S-disc can provide a tighter generalization bound and can be computed efﬁciently compared with the existing measures. We deﬁne S-disc as follows, which is obtained by ﬁxing h in the deﬁnition (1) of X-disc to the true risk minimizer h S = arg minh H Rℓ S(h, f S) in the source domain. Deﬁnition 1 (Source-guided discrepancy). Let H be a hypothesis class and let ℓ: R R R+ be a loss function. S-disc between two distributions PD1 and PD2 is deﬁned as

ςℓ H(PD1, PD2) = sup h H

Rℓ D1(h, h S) Rℓ D2(h, h S) . (2)

S-disc satisﬁes the triangular inequality, i.e., ςℓ H(PD1, PD2) ςℓ H(PD1, PD3) + ςℓ H(PD3, PD2), and is symmetric, i.e., ςℓ H(PD1, PD2) = ςℓ H(PD2, PD1). However, in general, S-disc is not a distance as we may have ςℓ H(PD1, PD2) = 0 for some PD1 = PD2. S-disc has the following three advantages over existing discrepancy measures. First, the computation cost of S-disc is low since we do not need to consider a pair of hypotheses unlike X-disc as we can see from (1) and (2). Second, S-disc leads to a tighter bound as discussed below. Mansour, Mohri, and Rostamizadeh (2009a) derived a generalization error bound by bounding the difference Rℓ T(h, h S) Rℓ S(h, h S) with X-disc. On the other hand, for any h H, it is easy to see that from the deﬁnition of S-disc Rℓ T(h, h S) Rℓ S(h, h S) ςℓ H(PT, PS)

discℓ H(PT, PS), (3)

which implies that S-disc gives a tighter generalization error bound than X-disc. Third, since h S is estimated by labeled data from the source domain, we can compute S-disc without labels from the target domain unlike Y-disc. For these reasons, S-disc is more suitable for unsupervised domain adaptation than the existing discrepancy measures. A comparison of S-disc with the existing discrepancy measures is summarized in Table 1.

S-disc Estimation for the 0-1 Loss Here we consider the task of binary classiﬁcation with the 0-1 loss, where the output space Y is {+1, 1}. The following theorem states that S-disc estimation can be reduced to a cost-sensitive classiﬁcation problem. We consider a symmetric hypothesis class H, which is closed under negation, i.e., for any h H, h is contained in H. Theorem 2. For the 0-1 loss and a symmetric hypothesis class H, the following equality holds:

ςℓ01 H ( b PT, b PS) = 1 min h H Jℓ01(h),

where Jℓ(h) is deﬁned as

j=1 ℓ(h(x S j ), h S(x S j ))

i=1 ℓ(h(x T i ), h S(x T i )).

Algorithm 1: S-disc Estimation for the 0-1 Loss

Input: labeled source data S, unlabeled target data T , surrogate loss ℓsur, hypothesis class H. Output: ςℓ H( b PT, b PS). Source learning: Learn a classiﬁer bh S using labeled source data SX . Pseudo labeling:

e S = {(x, sign bh S(x)) | x SX },

e T = {(x, sign bh S(x)) | x T }.

Cost sensitive learning from pseudo labeled data Learn another classiﬁer h H using e S and e T to minimize the surrogate cost-sensitive risk Jℓsur. return ςℓ H( b PT, b PS) = 1 Jℓ01(h ).

The proof of Theorem 2 is given in Kuroki et al. (2018). Note that the minimization of Jℓ01(h) corresponds to empirical risk minimization for S = {(x, h S(x)) | x SX } and T = {(x, h S(x)) | x T }. Therefore, Theorem 2 naturally suggests a three-step algorithm illustrated in Algorithm 1, where h S is replaced with bh S. This allows us to compute S-disc by minimizer h of the cost-sensitive risk Jℓ01(h), which weights the pseudo-labeled data e S and e T with costs 1/n S and 1/n T, respectively. Since minimization of the 0-1 loss is computationally hard (Ben-David, Eiron, and Long 2003; Feldman et al. 2012), we use a surrogate loss such as the hinge loss ℓhinge(y, y ) = max(0, 1 yy ) (Bartlett, Jordan, and Mc Auliffe 2006). Note that the 0-1 loss is used for calculating S-disc in the ﬁnal step, i.e., 1 Jℓ01(h ), while a surrogate loss is only used to train a classiﬁer. Hinge loss minimization with the sequential minimal optimization (SMO) algorithm requires O((n T + n S)3) for the entire algorithm (Platt 1999). This low computation cost has a big advantage over X-disc (see Table 1).

Theoretical Analysis In this section, we show that S-disc can be estimated from ﬁnite data with the consistency guarantee. After that, we show that our proposed S-disc is useful for deriving a tighter generalization error bound than X-disc. To derive theoretical results, the Rademacher complexity is used, which captures the complexity of a set of functions by measuring the capability of a hypothesis class to correlate with the random noise. Deﬁnition 3 (Rademacher complexity (Bartlett and Mendelson 2002)). Let H be a set of real-valued functions deﬁned over a set X. Given a sample (x1, . . . , xm) X m independently and identically drawn from a distribution µ, the Rademacher complexity of H is deﬁned as

Rµ,m(H) = Ex1,...,xm Eσ

i=1 σih(xi)

where the inner expectation is taken over σ = (σ1, . . . , σm)

Table 1: Comparison of S-disc with the existing discrepancy measures for unsupervised domain adaptation in binary classiﬁcation. We assume the hypothesis class H satisﬁes (4). We consider the hinge loss for S-disc, X-disc, and d H. The computational complexity of S-disc and d H are based on the empirical hinge loss minimization, which is solved with the kernel support vector machine by the SMO algorithm (Platt 1999). The computational complexity of X-disc is that based on SDP relaxation by the ellipsoid method (Bubeck 2015) given in Kuroki et al. (2018). Y-disc is not computable with unlabeled target data.

S-disc (proposed) d H X-disc Y-disc taking H into account

convergence rate Op n T 1

N/A target generalization error bound N/A computational complexity O (n T + n S)3

O (n T + n S)3

O (n T + n S + d)8

which are mutually independent uniform random variables taking values in {+1, 1}. Hereafter, we use the following notation: H H := {x 7 h(x) h (x) | h, h H}, ℓ (H H) := {x 7 ℓ(h(x), h (x)) | h, h H}.

Consistency of S-disc

In this section, we show the estimator ςℓ H( b PT, b PS) converges to the true S-disc ςℓ H(PT, PS) as the numbers of samples n T and n S increase. The following theorem gives the deviation of the empirical S-disc estimator, which is a general result that does not depend on the speciﬁc choice of the loss ℓand the hypothesis class H. Theorem 4. Assume the loss function ℓis bounded from above by M > 0. For any δ (0, 1), with probability at least 1 δ, ςℓ H( b PT, b PS) ςℓ H(PT, PS)

2RPT,n T(ℓ (H H)) + 2RPS,n S(ℓ (H H))

The proof of this theorem is given in Kuroki et al. (2018). Theorem 4 guarantees the consistency of ςℓ H( b PT, b PS) under the condition that RPT,n T(ℓ (H H)) and RPS,n S(ℓ (H H)) are well-controlled. To derive a speciﬁc convergence rate, we consider an assumption given by

RPT,n T(H H) CH H n T , RPS,n S(H H) CH H n S ,

for some constant CH H > 0 depending only on the hypothesis class H. Lemma 5 below shows that this assumption is naturally satisﬁed in the linear-in-parameter model. Lemma 5. Let H be the linear-in-parameter model class, i.e., H := {x 7 w φ(x) | w Rd, w 2 Λ} for ﬁxed basis functions φ : X Rd satisfying φ Dφ. Then, for any distribution µ over X and m N,

Rµ,m(H H) Λ2D2 φ m .

The proof of Lemma 5 is given in Kuroki et al. (2018). Subsequently, we derive the convergence rate bound for the 0-1 loss. Corollary 6. When we consider ℓ= ℓ01, it holds for any δ (0, 1) that, with probability at least 1 δ, ςℓ H( b PT, b PS) ςℓ H(PT, PS)

CH H n T + CH H n S +

δ 2n S under the assumption in (4).

Proof. It simply follows from Theorem 4, the assumption in (4), and the fact Rµ,k(ℓ01 (H H)) = 1

2Rµ,k(H H) for any distribution µ and k > 0 (Mohri, Rostamizadeh, and Talwalkar 2012, Lemma 3.1).

From this corollary, we see that the empirical S-disc has the consistency with convergence rate Op(n T 1/2 + n S 1/2) under a mild condition.

Generalization Error Bound In the previous section, we showed that S-disc ςℓ H(PT, PS) can be estimated by ςℓ H( b PT, b PS). In this section, we give two bounds on the generalization error Rℓ T(h, f T) for the target domain in terms of ςℓ H(PT, PS) or ςℓ H( b PT, b PS). The ﬁrst bound shows the relationship between target risk Rℓ T(h, f T) and source risk Rℓ S(h, h S). Theorem 7. Assume that ℓobeys the triangular inequality, i.e., ℓ(u, v) ℓ(u, w) + ℓ(w, v), u, v, w R, such as the 0-1 loss. Then, for any hypothesis h H,

Rℓ T(h, f T) Rℓ T(h T, f T)

Rℓ S(h, h S) + Rℓ T(h S, h T) + ςℓ H(PT, PS). (5) Proof. Since

Rℓ T(h, f T) Rℓ T(h, h S) + Rℓ T(h S, h T) + Rℓ T(h T, f T) holds from the triangular inequality, we have

Rℓ T(h, f T) Rℓ T(h T, f T)

Rℓ T(h, h S) + Rℓ T(h S, h T)

Rℓ S(h, h S) + Rℓ T(h S, h T) + ςℓ H(PT, PS), where the last inequality follows from the deﬁnition of Sdisc.

The LHS of (5) represents the regret arising from the use of hypothesis h instead of h T in the target domain. Theorem 7 shows that the regret Rℓ T(h, f T) Rℓ T(h T, f T) is bounded by three terms: (i) the expected loss with respect to h S in the source domain, (ii) the difference between h T and h S in the target domain, and (iii) S-disc between PT and PS. Note that if the source and target domains are sufﬁciently close, we can expect the second term Rℓ T(h S, h T) and the third term ςℓ H(PT, PS) to be small. This fact indicates that for an appropriate source domain, minimization of estimation error Rℓ S(h, h S) in the source domain leads to a better generalization in the target domain. We can see an advantage of the generalization error bound based on S-disc through comparison with the bound based on X-disc (Mansour, Mohri, and Rostamizadeh 2009a, Theorem 8) given by

Rℓ T(h, f T) Rℓ T(h T, f T)

Rℓ S(h, h S) + Rℓ T(h T, h S) + discℓ H(PT, PS). (6)

The upper bound (6) using X-disc has the same form as the upper bound (5) except for the term ςℓ H(PT, PS). Since S-disc is never larger than X-disc (see the inequality (3)), S-disc gives a tighter bound than X-disc. The following theorem shows the generalization error bound for the ﬁnite-sample case.

Theorem 8. When we consider ℓ= ℓ01, for any h H and δ (0, 1), with probability at least 1 δ,

Rℓ T(h, f T) Rℓ T(h T, f T)

b Rℓ S(h, h S) + Rℓ T(h S, h T) + ςℓ H( b PT, b PS)

+ CH H n T + CH H n S +

under the assumption in (4).

The proof of this theorem is given in Kuroki et al. (2018). Theorem 8 tells us that when n S, n T the following three terms are dominating in the bound of the regret in the target domain Rℓ T(h, f T) Rℓ T(h T, f T): (i) the empirical loss with respect to h S in the source domain, (ii) the difference between h T and h S in the target domain, and (iii) S-disc between the two empirical distributions ςℓ H( b PT, b PS). Therefore, if h T is sufﬁciently close to h S, selecting a good source in terms of S-disc allows us to achieve good target generalization.

Comparison with Existing Discrepancy Measures In the previous section, we showed the consistency of the estimator for S-disc and derived a generalization error bound of S-disc tighter than X-disc. In this section, we ﬁrst compare these theoretical guarantees of S-disc with those for the existing ones in more detail. We next discuss their computation cost. This is also an important aspect when we apply these discrepancy measures to sentiment analysis (Bhatt, Rajkumar, and Roy 2016), adversarial learning (Zhao et al.

2017), and computer vision (Saito et al. 2018) for source selection or reweighting of the source data. In fact, in these applications the discrepancy d H instead of X-disc is used for ease of computation even though d H has no theoretical guarantee on the generalization error. The results of this section are summarized in Table 1.

Convergence Rates of Discrepancy Estimators

Here we discuss the consistency and convergence rates of the estimators of discrepancy measures. The empirical estimator of d H is consistent, and its convergence rate is Op(((log n T)/n T)1/2 + ((log n S)/n S)1/2) (Ben-David et al. 2010, Lemma 1). This rate is slower than the rate Op n T 1/2 + n S 1/2

for S-disc with appropriately controlled Rademacher complexities RPT,n T(H) and RPS,n S(H) of the hypothesis class, such as the linear-in-parameter model (Mohri, Rostamizadeh, and Talwalkar 2012, Theorem 4.3). Here recall that no generalization error bound is known in terms of d H even though d H itself is consistently estimated. On the other hand, the empirical estimator of X-disc is shown to be consistent (Mansour, Mohri, and Rostamizadeh 2009a, Corollary 7) and its convergence rate is Op(n T 1/2 + n S 1/2) in the case that the loss function is ℓq loss, i.e., ℓq(y, y ) = |y y |q. Thus, the derived rate is the same as S-disc whereas the requirement on the loss function is more restrictive than the one for S-disc in Theorem 8. Note that the above difference of the theoretical guarantees does not come from the inherent difference of these estimators. This is because we adopted the analysis based on the Rademacher complexity, which has not been well studied in the context of unsupervised domain adaptation. This is a distribution-dependent complexity measure and less pessimistic compared with the VC-dimension used in the previous work (Ben-David et al. 2010). In fact, the known guarantees on d H and X-disc can be improved to Propositions 9 and 10 given below.

Proposition 9. For any δ (0, 1), with probability at least 1 δ, d H( b PT, b PS) d H(PT, PS)

2RPT,n T(H) + 2RPS,n S(H) +

Proposition 10. Assume the loss function ℓis upper bounded by M > 0. For any δ (0, 1), with probability at least 1 δ, discℓ H( b PT, b PS) discℓ H(PT, PS)

2RPT,n T(ℓ (H H)) + 2RPS,n S(ℓ (H H))

The proofs of these propositions are quite similar to that of Theorem 4 and omitted.

8 6 4 2 0 2 4 6 8 x1

S1 pos S1 neg

S2 pos S2 neg

T pos T neg

Figure 1: 2D plots of three domains.

In summary, under the mild assumptions, the convergence rates of the empirical estimators of d H and X-disc are Op n T 1/2 + n S 1/2 as well as S-disc.

Computational Complexity

Computation of d H can be done by the empirical risk minimization (Ben-David et al. 2010, Lemma 2). The original form is given with the 0-1 loss, which can be efﬁciently minimized with a surrogate loss. When the hinge loss is applied, the minimization can be carried out with the computation cost O((n T + n S)3) by the SMO algorithm (Platt 1999). On the other hand, no efﬁcient algorithm is given for the computation of X-disc in the classiﬁcation setting.1 For a fair comparison, we give a relatively efﬁcient algorithm to compute X-disc (1) in the classiﬁcation setting with the hinge loss, based on semideﬁnite relaxation. Unfortunately, the computational complexity of the relaxed algorithm is still O((n T +n S +d)8), which is prohibitive compared with the computation of S-disc and d H.

Experiments

In this section, we provide the experimental results that illustrate the failure of existing discrepancy measures and the advantage of using S-disc. We illustrate the advantage of S-disc in terms of computation time, the empirical convergence, and the performance on the source selection task.

Illustration

We illustrate the failure of d H, which is the well-known proxy for X-disc. We compared S-disc with d H in the toy experiment. We generated 200 data points per class for each of two

1A computation algorithm of X-disc for the 0-1 loss is given only in the one-dimensional case (Mansour, Mohri, and Rostamizadeh 2009a, Section 5.2).

sources S1 and S2 and target domain as follows:

PSi(X|Y = j) = N(µ(j) i , I) (i = 1, 2),

PT(X|Y = j) = N(µ(j) T , I),

µ(0) 1 = ( 5, 5), µ(0) 2 = (0, 3), µ(0) T = ( 5, 3),

µ(1) 1 = (5, 5), µ(1) 2 = (2, 3), µ(1) T = (5, 3).

In this experiment, we used the support vector machine with a linear kernel2. For these data, we obtained the following results:

ςℓ H( b PT, b PS1) = 0.27, ςℓ H( b PT, b PS2) = 0.49,

d H( b PT, b PS1) = 0.69, d H( b PT, b PS2) = 0.49.

These values indicate that while d H regards S2 as the better source, S-disc regards S1 as the better source for target domain. In this example, the loss calculated on the target domain of the classiﬁer trained on S1 is 0.0 and the loss of the classiﬁer trained on S2 is 0.49. This implies that S-disc is the better discrepancy measure to measure the quality of source domains for a better generalization in a given target domain. Intuitively, d H is a heuristic measure based on a classiﬁer separating the source and target domains without considering labels in the source domain. Once the supports of the input distributions are not highly overlapped between the source and target domains , d H may immediately regard that domains are totally different from each other even if the risk minimizers are highly similar between these domains, which resulted in the failure of d H. On the other hand, S-disc can prevent such a problem by taking labels from the source domain into account as illustrated in the toy experiments and justiﬁed in Theorems 7 and 8.

Comparison of Computation Time We compared the computation time to estimate S-disc, d H, and X-disc. We used 2-dimensional 200 examples for both source and target domains. Each domain consists of 100 positive examples and 100 negative examples. For the computation of X-disc, we used the relaxed algorithm3 shown in Kuroki et al. (2018). For both d H and S-disc, we used the support vector machine with a linear kernel for the computation of S-disc and d H. The simulation was run on 2.8GHz Intel R Core i7. The results shown in Figure 2 demonstrate that the computation time of X-disc is prohibitive while both S-disc and d H are feasible to compute as suggested in Table 1.

Empirical Convergence We compared the empirical convergence of S-disc and d H based on logistic regression implemented with scikit-learn with default parameters (Pedregosa et al. 2011). Note that

2We used scikit-learn (Pedregosa et al. 2011) to implement it. 3We used picos (https://picos-api.gitlab.io/picos/) for the SDP solver with cvxopt (https://cvxopt.org/) backend.

10 2 100 102 104 Computation time [sec]

Figure 2: Comparison of computation time in the log scale.

5000 10000 15000 20000 Number of examples

Output value

d H S1 d H S2

S-disc S1 S-disc S2

Figure 3: Empirical convergence of d H and S-disc.

X-disc is not used due to computational intractability. Here, we used MNIST (Le Cun, Cortes, and Burges 2010) dataset, and considered a binary classiﬁcation task to separate odd and even digits. We deﬁned the source and target domains as follows:

Source domain S1: MNIST,

Source domain S2: MNIST from zero to seven,

Target domain: MNIST.

The number of examples was ranged from {1000, 2000, ..., 20000} for each domain. Note that while examples from the source domain S1 are drawn from the same distribution of the target domain, the sample of the source domain S2 is affected by sample selection bias (Cortes et al. 2008). Figure 3 shows the empirical convergence of the estimator of both discrepancy measures. It is observed that both discrepancy measures indicate that S1 is a better source than S2 for target domain. However, since the true value of the discrepancy of S1 from the target domain is supposed to be zero, we can observe that the d H will be converged much more slowly than S-disc or converged to a non-zero value.

Source Selection

We compared the performance in the source selection task between S-disc and d H. We used the source domains and the target domain as follows:

Clean source domains: Five grayscale MNIST-M (Ganin et al. 2016),

1000 2000 3000 4000 Number of examples

S-disc ϵ=50 S-disc ϵ=40

S-disc ϵ=30 d H

Figure 4: Source selection performance with varying noise rates. Blue dotted line denotes the maximum score (ﬁve).

Noisy source domains: Five grayscale MNIST-M corrupted by Gaussian random noise, Target domain: MNIST. The MNIST-M dataset is known to be useful for the domain adaptation task when the target domain is MNIST. The task of each domain is to classify between even and odd digits and logistic regression with default parameters was used for computing S-disc and d H. The objective of this source selection task is to correctly rank ﬁve clean source domains over noisy domains. We ranked each source by computing discrepancy values between the target domain and each source and rank them in ascending order. The score was calculated by counting how many clean sources are ranked in the ﬁrst ﬁve sources. We varied the number of examples from {200, 400, ..., 4000} for each domain. The Gaussian noise with standard deviation ϵ = 30, 40, 50 were added and clipped to force the value to between 0 255. For each number of examples per class, the experiments were conducted 15 times and the average score was used. Figure 4 shows the performance of each discrepancy measure with different noise rates. As the number of examples increased, S-disc achieved a better performance. In contrast, d H cannot distinguish between noisy and clean source domains. In fact, d H always returned one, which indicates that MNIST-M is unrelated to MNIST. Unlike the previous experiment on empirical convergence, the difference between the two domains may be harder to identify since the source domains and the target domain are not exactly the same (MNIST vs MNIST-M). As a result, our experiment demonstrates the failure of d H in the source selection task and suggests the advantage of S-disc in this task.

Conclusion We proposed a novel discrepancy measure for unsupervised domain adaptation called source-guided discrepancy (S-disc). We provided a computationally efﬁcient algorithm for the estimation of S-disc with respect to the 0-1 loss, and also established the consistency of the estimator with the convergence rate. Moreover, we derived a generalization

bound based on S-disc, which is tighter than that of X-disc. Finally, we demonstrated the failure of existing discrepancy measures and the advantages of S-disc through experiments.

Acknowledgements We thank Ikko Yamane, Futoshi Futami, and Kento Nozawa for the useful discussion. NC was supported by MEXT scholarship. JH was supported by KAKENHI 17H00757, IS was supported by JST CREST Grant Number JPMJCR17A1, and MS was supported by KAKENHI 17H01760.

References Bartlett, P. L., and Mendelson, S. 2002. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3:463 482. Bartlett, P. L.; Jordan, M. I.; and Mc Auliffe, J. D. 2006. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association 101(473):138 156. Ben-David, S.; Blitzer, J.; Crammer, K.; and Pereira, F. 2007. Analysis of representations for domain adaptation. In NIPS, 137 144. Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning 79(1-2):151 175. Ben-David, S.; Eiron, N.; and Long, P. M. 2003. On the difﬁculty of approximately maximizing agreements. Journal of Computer and System Sciences 66(3):496 514. Bhatt, H. S.; Rajkumar, A.; and Roy, S. 2016. Multi-source iterative adaptation for cross-domain classiﬁcation. In IJCAI, 3691 3697. Bubeck, S. 2015. Convex Optimization: Algorithms and Complexity, volume 8. Now Publishers, Inc. Cortes, C.; Mohri, M.; Riley, M.; and Rostamizadeh, A. 2008. Sample selection bias correction theory. In ALT, 38 53. Cortes, C.; Mohri, M.; and Medina, A. M. 2015. Adaptation algorithm and theory based on generalized discrepancy. In SIGKDD, 169 178. Courty, N.; Flamary, R.; Habrard, A.; and Rakotomamonjy, A. 2017. Joint distribution optimal transportation for domain adaptation. In NIPS, 3730 3739. Feldman, V.; Guruswami, V.; Raghavendra, P.; and Wu, Y. 2012. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing 41(6):1558 1590. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17(1):2096 2030. Germain, P.; Habrard, A.; Laviolette, F.; and Morvant, E. 2013. A PAC-Bayesian approach for domain adaptation with specialization to linear classiﬁers. In ICML, 738 746. Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Domain adaptation for large-scale sentiment classiﬁcation: A deep learning approach. In ICML, 513 520.

Huang, J.; Gretton, A.; Borgwardt, K. M.; Sch olkopf, B.; and Smola, A. J. 2007. Correcting sample selection bias by unlabeled data. In NIPS, 601 608. Kuroki, S.; Charoenphakdee, N.; Bao, H.; Honda, J.; Sato, I.; and Sugiyama, M. 2018. Unsupervised domain adaptation based on source-guided discrepancy. ar Xiv:1809.03839. Le Cun, Y.; Cortes, C.; and Burges, C. 2010. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009a. Domain adaptation: Learning bounds and algorithms. In COLT. Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009b. Multiple source adaptation and the R enyi divergence. In UAI, 367 374. Mohri, M., and Medina, A. M. 2012. New analysis and algorithm for learning with drifting distributions. In ALT, 124 138. Mohri, M.; Rostamizadeh, A.; and Talwalkar, A. 2012. Foundations of Machine Learning. MIT Press. Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10):1345 1359. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research 12:2825 2830. Platt, J. C. 1999. Sequential minimal optimization: A fast algorithm for training support vector machines. In Advances in Kernel Methods: Support Vector Learning. MIT Press. 185 208. Saito, K.; Watanabe, K.; Ushiku, Y.; and Harada, T. 2018. Maximum classiﬁer discrepancy for unsupervised domain adaptation. In CVPR, 3723 3732. Saito, K.; Ushiku, Y.; and Harada, T. 2017. Asymmetric tri-training for unsupervised domain adaptation. In ICML, 2988 2997. Sugiyama, M., and Kawanabe, M. 2012. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. MIT Press. Sugiyama, M.; Suzuki, T.; Nakajima, S.; Kashima, H.; von B unau, P.; and Kawanabe, M. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics 60(4):699 746. Sun, S.; Zhang, B.; Xie, L.; and Zhang, Y. 2017. An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257:79 87. Zhang, C.; Zhang, L.; and Ye, J. 2012. Generalization bounds for domain adaptation. In NIPS, 3320 3328. Zhao, H.; Zhang, S.; Wu, G.; Costeira, J. P.; Moura, J. M. F.; and Gordon, G. J. 2017. Multiple source domain adaptation with adversarial training of neural networks. ar Xiv:1705.09684.