# a_new_pacbayesian_perspective_on_domain_adaptation__0e9ddb73.pdf

A New PAC-Bayesian Perspective on Domain Adaptation

Pascal Germain PASCAL.GERMAIN@INRIA.FR INRIA, SIERRA Project-Team, 75589, Paris, France, and D.I., Ecole Normale Superieure, 75230 Paris, France

Amaury Habrard AMAURY.HABRARD@UNIV-ST-ETIENNE.FR Univ Lyon, UJM-Saint-Etienne, CNRS, IOGS, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Etienne, France

Franc ois Laviolette FRANCOIS.LAVIOLETTE@IFT.ULAVAL.CA D epartement d informatique et de g enie logiciel, Universit e Laval, Qu ebec, Canada

Emilie Morvant EMILIE.MORVANT@UNIV-ST-ETIENNE.FR Univ Lyon, UJM-Saint-Etienne, CNRS, IOGS, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Etienne, France

We study the issue of PAC-Bayesian domain adaptation: We want to learn, from a source domain, a majority vote model dedicated to a target one. Our theoretical contribution brings a new perspective by deriving an upper-bound on the target risk where the distributions divergence expressed as a ratio controls the trade-off between a source error measure and the target voters disagreement. Our bound suggests that one has to focus on regions where the source data is informative. From this result, we derive a PACBayesian generalization bound, and specialize it to linear classiﬁers. Then, we infer a learning algorithm and perform experiments on real data.

1. Introduction

Machine learning practitioners are commonly exposed to the issue of domain adaptation1 (Jiang, 2008; Margolis, 2011): One usually learns a model from a corpus, i.e., a ﬁxed yet unknown source distribution, then wants to apply it on a new corpus, i.e., a related but slightly different target distribution. Therefore, domain adaptation is widely studied in a lot of application ﬁelds like computer vision (Patel et al., 2015; Ganin & Lempitsky, 2015), bioinformatics (Liu et al., 2008), natural language processing (Blitzer, 2007; Daum e III, 2007), etc. A common example is the

1Domain adaptation is associated with transfer learning (Pan & Yang, 2010; Quionero-Candela et al., 2009).

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

spam ﬁltering problem where a model needs to be adapted from one user mailbox to another receiving signiﬁcantly different emails. Many approaches exist to address domain adaptation, often with the same idea: If we can apply a transformation to move closer the distributions, then we can learn a model with the available labels. This is generally performed by reweighting the importance of labeled data (Huang et al., 2006; Sugiyama et al., 2007; Cortes et al., 2010; 2015), and/or by learning a common representation for the source and target distributions (Chen et al., 2012; Ganin et al., 2016), and/or by minimizing a measure of divergence between the distributions (Morvant et al., 2012; Germain et al., 2013; Cortes & Mohri, 2014). The divergence-based approach has especially been explored to derive generalization bounds for domain adaptation (e.g., Ben-David et al., 2006; 2010; Mansour et al., 2009; Li & Bilmes, 2007; Zhang et al., 2012). Recently, this issue has been studied through the PAC-Bayesian framework (Germain et al., 2013), which focuses on learning weighted majority votes2 without target label. Even the latter result opened the door to tackle domain adaptation in a PAC-Bayesian fashion, it shares the same philosophy as the seminal works of Ben-David et al. (2006; 2010); Mansour et al. (2009): The risk of the target model is upper-bounded jointly by the model s risk on the source distribution, the divergence between the marginal distributions, and a nonestimable term3 related to the ability to adapt in the current space. Note that Li & Bilmes (2007) proposed a PACBayesian generalization bound for domain adaptation but they considered target labels.

2This setting is not too restrictive since many algorithms can be seen as a majority vote learning. E.g., ensemble learning and kernel methods output models interpretable as majority votes. 3More precisely, this term can only be estimated in the presence of labeled data from both the source and the target domains.

A New PAC-Bayesian Perspective on Domain Adaptation

In this paper, we derive a novel domain adaptation bound for the weighted majority vote framework. Concretely, the risk of the target model is still upper-bounded by three terms, but they differ in the information they capture. The ﬁrst term is estimable from unlabeled data and relies on a notion of expected voters disagreement on the target domain. The second term depends on the expected accuracy of the voters on the source domain. Interestingly, this latter is weighted by a divergence between the source and the target domains that enables controlling the relationship between domains. The third term estimates the volume of the target domain living apart from the source one4, which has to be small for ensuring adaptation. From our bound, we deduce that a good adaptation strategy consists in ﬁnding a weighted majority vote leading to a suitable tradeoff controlled by the domains divergence between the ﬁrst two terms: Minimizing the ﬁrst one corresponds to look for voters that disagree on the target domain, and minimizing the second one to seek accurate voters on the source. Thereafter, we provide PAC-Bayesian generalization guarantees to justify the empirical minimization of our new domain adaptation bound, and specialize it to linear classiﬁers (following a methodology known to give rise to tight bound values). This allows to design DALC, a learning algorithm that improves the performances of the previous PAC-Bayesian domain adaptation algorithm.

The rest of the paper is organized as follows. Section 2 presents the PAC-Bayesian domain adaptation setting. Section 3 reviews previous theoretical results on domain adaptation. Section 4 states our new analysis of domain adaptation for majority votes, that we relate to other works in Section 5. Then, Section 6 provides generalization bounds, specialized to linear classiﬁers in Section 7 to motivate the DALC learning algorithm, evaluated in Section 8.

2. Unsupervised Domain Adaptation Setting

We tackle domain adaptation for binary classiﬁcation, from a d-dimensional input space X Rd to an output space Y ={ 1, 1}. Our goal is to perform domain adaptation from a distribution S the source domain to another (related) distribution T the target domain on X Y ; SX and TX being the associated marginal distributions on X. Given a distribution D, we denote (D)m the distribution of a m-sample constituted by m elements drawn i.i.d. from D. We consider the unsupervised domain adaptation setting in which the algorithm is provided with a labeled source mssample S ={(xi, yi)}ms i=1 (S)ms, and with an unlabeled target mt-sample T ={xi}mt i=1 (TX)mt. PAC-Bayesian domain adaptation. Our work is inspired by the PAC-Bayesian theory (ﬁrst introduced by

4Here we do not focus on learning a new representation to help the adaptation: We directly aim at adapting in the current space.

Mc Allester, 1999). More precisely, we adopt the PACBayesian domain adaptation setting previously studied in Germain et al. (2013). Given H, a set of voters h : X Y , the elements of this approach are a prior distribution π on H, a pair of source-target learning samples (S, T) and a posterior distribution ρ on H. The prior distribution π models an a priori belief before observing (S, T) of the voters accuracy. Then, given the information provided by (S, T), we aim at learning a posterior distribution ρ leading to a ρ-weighted majority vote over H,

Bρ( ) = sign E h ρh( ) ,

with nice generalization guarantees on the target domain T . In other words, we want to ﬁnd the posterior distribution ρ minimizing the true target risk of Bρ :

RT (Bρ) = E (x,y) T I Bρ(x) = y ,

where I a = 1 if a is true, and 0 otherwise. However, in most PAC-Bayesian analyses one does not directly focus on this majority vote risk, but studies the expectation of the risks over H according to ρ, designed as the Gibbs risk :

RD(Gρ) = E (x,y) D E h ρ I h(x) = y . (1)

It is well-known in the PAC-Bayesian literature that RD(Bρ) 2 RD(Gρ) (e.g., Herbrich & Graepel, 2000). Unfortunately, this worst case bound often leads to poor generalization guarantees on the majority vote risk. To address this issue, Lacasse et al. (2006) (reﬁned in Germain et al., 2015) have exhibited that one can obtain a tighter bound on RD(Bρ) by studying the expected disagreement d D(ρ) of pairs of voters, deﬁned as

d D(ρ) = E x DX E h ρ E h ρ I h(x) = h (x) , (2)

as RD(Bρ) 1 (1 2 RD(Gρ))2

1 2 d D(ρ) . Note that, although relying on d D(ρ), our present work does not reuse the latter result.5 Instead, we adopt another well-known strategy to obtain tight majority vote bounds, by specializing our PAC-Bayesian bound to linear classiﬁers. We describe this approach, and refer to related works, in Section 7.

3. Some Previous Domain Adaptation Bounds

Many approaches tackling domain adaptation share the same underlying philosophy , pulling its origins in the work of Ben-David et al. (2006; 2010) which proposed a domain adaptation bound (Theorem 1, below). To summarize, the domain adaptation bounds reviewed in this section (see Zhang et al., 2012; Cortes et al., 2010; 2015, for other

5The quantity d D(ρ) is also used in the domain adaptation bound of Germain et al. (2013) to measure divergence between distributions. See forthcoming Theorem 2.

A New PAC-Bayesian Perspective on Domain Adaptation

bounds) express a similar trade-off between three terms: (i) the source risk, (ii) the distance between source and target marginal distributions over X, (iii) a non-estimable term (without target label) quantifying the difﬁculty of the task. Ben-David et al. (2006) assumed that the domains are related in the sense that there exists a (unknown) model performing well on both domains. Formally, their domain adaptation bound depends on the error µh =RS(h )+RT (h ) of the best hypothesis overall h =argminh H RS(h) + RT (h) . In practice, when no target label is available, µh is non-estimable and is assumed to be low when domain adaptation is achievable (or at least that there exists a representation space in which this assumption can be veriﬁed). In such a scenario, the domain adaptation strategy is then to look for a set H of possible models that behave similarly on both the source and target data, and to learn a model in H with a good accuracy on the source data. This similarity, called the H H-distance, d H H(SX, TX) =

2 sup (h,h ) H2

E x SXI h(x) = h (x) E x TXI h(x) = h (x) ,

gives rise to the following domain adaptation bound. Theorem 1 (Ben-David et al., 2006; 2010). Let H be a (symmetric6) hypothesis class. We have,

h H, RT (h) RS(h)+ 1

2d H H(SX, TX)+µh . (3)

Pursuing in the same line of research, Mansour et al. (2009) generalizes the H H-distance to real-valued loss functions L : [ 1, 1]2 R+, to express a similar theorem for regression. Their discrepancy disc L(SX, TX) is deﬁned as

sup (h,h ) H2

E x SX L h(x), h (x) E x TX L h(x), h (x) .

The accuracy of the Mansour et al. (2009) s bound also relies on a non-estimable term assumed to be low when adaptation is achievable. Roughly, this term depends on the risk of the best target hypothesis and its agreement with the best source hypothesis on the source domain.

Building on previous domain adaptation analyses, Germain et al. (2013) derived a PAC-Bayesian domain adaptation bound. This bound is based on a divergence suitable for PAC-Bayes, i.e., for the risk of a ρ-weighted majority vote of the voters of H (instead of a single classiﬁer h H). This domain disagreement disρ(SX, TX) is deﬁned as

disρ(SX, TX) = d S(ρ) d T (ρ) . (4)

Theorem 2 (below) needs the strong assumption that, in favorable adaptation situations, the learned posterior agrees with the best target one ρT = argminρRT (Gρ). Indeed, it relies on the following non-estimable term: λ(ρ) = RT (GρT )+Eh ρ Eh ρT Ex SX I h(x) =h (x) + Eh ρ Eh ρT Ex TX I h(x) =h (x) .

6In a symmetric H, for all h H, its inverse h is also in H.

Theorem 2 (Germain et al., 2013). Let H be a set of voters. For any domains S and T over X Y , we have, ρ on H, RT (Gρ) RS(Gρ) + disρ(SX, TX) + λ(ρ).

A compelling aspect of this PAC-Bayesian analysis is the suggested trade-off, which is function of ρ. Indeed, given a ﬁxed instance space X and a ﬁxed class H, apart from using importance weighting methods, the only way to minimize the bound of Theorem 1 is to ﬁnd h H that minimizes RS(h). In Germain et al. (2013), the bound of Theorem 2 inspired an algorithm named PBDA selecting ρ over H that achieves a trade-off between RS(Gρ) and disρ(SX, TX). However, the term λ(ρ) does not appear in the optimization process of PBDA, even if it relies on the learned weight distribution ρ. It is assumed that the value of λ(ρ) should be negligible (uniformly for all ρ) when adaptation is achievable. Nevertheless, this strong assumption cannot be veriﬁed because the best target posterior distribution ρT is unknown. This is a major weakness of the previous PAC-Bayesian work that our new approach overcomes.

4. A New Domain Adaptation Perspective

In this section, we introduce an original approach to upperbound the non-estimable risk of a ρ-weighted majority vote on a target distribution T thanks to a term depending on its marginal distribution TX, another one on a related source domain S, and a term capturing the volume of the source distribution uninformative for the target task. We base our bound on the expected disagreement d D(ρ) of Equation (2) and the expected joint error e D(ρ), deﬁned as

e D(ρ) = E (x,y) D E h ρ E h ρ I h(x) = y I h (x) = y . (5)

Indeed, Lacasse et al. (2006); Germain et al. (2015) observed that, given a domain D on X Y and a distribution ρ on H, we can decompose the Gibbs risk as

2 E (x,y) D E h ρ E h ρ I h(x) =y +I h (x) =y

= E (x,y) D E h ρ E h ρ I h(x) =h (x) +2 I h(x) =y h (x) =y

= 1 2 d D(ρ) + e D(ρ) . (6)

A key observation is that the voters disagreement does not rely on labels; we can compute d D(ρ) using the marginal distribution DX. Thus, in the present domain adaptation context, we have access to d T (ρ) even if the target labels are unknown. However, the expected joint error can only be computed on the labeled source domain.

Domains divergence. In order to link the target joint error e T (ρ) with the source one e S(ρ), we weight the latter thanks to a divergence measure between the domains

A New PAC-Bayesian Perspective on Domain Adaptation

βq(T S) parametrized by a real value q > 0 :

βq(T S) = E (x,y) S

It is worth noting that considering some q values allow us to recover well-known divergences. For instance, choosing q=2 relates our result to the χ2-distance, as β2(T S)= p

χ2(T S) + 1 . Moreover, we can link βq(T S) to the R enyi divergence7, which has led to generalization bounds in the context of importance weighting (Cortes et al., 2010). We denote the limit case q by

β (T S) = sup (x,y) SUPP(S)

with SUPP(S) the support of S. The divergence βq(T S) handles the input space areas where the source domain support SUPP(S) is included in the target one SUPP(T ). It seems reasonable to assume that, when adaptation is achievable, such areas are fairly large. However, it is likely that SUPP(T ) is not entirely included in SUPP(S). We denote T \S the distribution of (x, y) T conditional to (x, y) SUPP(T )\SUPP(S). Since it is hardly conceivable to estimate the joint error e T \S(ρ) without making extra assumptions, we deﬁne the worst risk for this unknown area

ηT \S = Pr (x,y) T

(x, y) / SUPP(S) sup h H RT \S(h) . (8)

Even if we cannot evaluate sup H RT \S(h), the value of ηT \S is necessarily lower than Pr T ((x, y)/ SUPP(S)).

The domain adaptation bound. Let us state the result underlying the domain adaptation perspective of this paper. Theorem 3. Let H be a hypothesis space, let S and T respectively be the source and the target domains on X Y . Let q > 0 be a constant. We have, for all ρ on H,

2 d T (ρ) + βq(T S) h e S(ρ) i1 1

q + ηT \S ,

where d T (ρ), e S(ρ), βq(T S) and ηT \S are respectively deﬁned by Equations (2), (5), (7) and (8).

Proof. Let us deﬁne t=E(x,y) T I (x, y)/ SUPP(S) , then

ηρ = E (x,y) TI (x, y)/ SUPP(S) E h ρ E h ρI h(x) =y I h (x) =y

= t E (x,y) T \S E h ρ E h ρ I h(x) =y I h (x) =y = t e T \S(ρ)

= t RT \S(Gρ) 1

2d T \S(ρ) t sup h H RT \S(h) = ηT \S .

Then, with βq =βq(T S) and p such that 1

e T (ρ) = E (x,y) T E h ρ E h ρ I h(x) =y I h (x) =y

7For q 0, we can show βq(T S)=2 q 1

q Dq(T S), where Dq(T S) is the R enyi divergence between T and S.

= E (x,y) S

T (x,y) S(x,y) E h ρ E h ρI h(x) =y I h (x) =y +ηρ (9)

E h ρ E h ρ E (x,y) S

I h(x) =y I h (x) =y p 1

Last line is due to H older inequality. Finally, we remove the exponent from expression (I h(x) = y I h (x) = y )p

without affecting its value, which is either 1 or 0, and the ﬁnal result follows from Equation (6).

Note that the bound of Theorem 3 is reached whenever the domains are equal (S = T ). Thus, when adaptation is not necessary, our analysis is still sound and non-degenerated:

RS(Gρ) = RT (Gρ) 1 2 d T (ρ) + 1 [e S(ρ)]1 + 0

= 1 2 d S(ρ) + e S(ρ) = RS(Gρ) .

Meaningful quantities. Similarly to the previous results recalled in Section 3, our domain adaptation theorem bounds the target risk by a sum of three terms. However, our approach breaks the problem into atypical quantities: (i) The expected disagreement d T (ρ) captures second degree information about the target domain. (ii) The domains divergence βq(T S) weights the inﬂuence of the expected joint error e S(ρ) of the source domain; the parameter q allows us to consider different relationships between βq(T S) and e S(ρ). (iii) The term ηT \S quantiﬁes the worst feasible target error on the regions where the source domain is uninformative for the target one. In the current work, we assume that this area is small.

5. Comparison With Related Works

In this section, we discuss how our domain adaptation bound can be related to some previous works.

5.1. On the previous PAC-Bayesian bound

It is instructive to compare the new bound of Theorem 3 with the previous PAC-Bayesian domain adaptation bound of Theorem 2. In Theorem 3, the non-estimable terms are the domain divergence βq(T S) and the term ηT \S. Contrary to the non-controllable term λ(ρ) of Theorem 2, these terms do not depend on the learned posterior distribution ρ: For every ρ on H, βq(T S) and ηT \S are constant values measuring the relation between the domains. Moreover, the fact that the domain divergence βq(T S) is not an additive term but a multiplicative one (as opposed to disρ(SX, TX)+λ(ρ) in Theorem 2) is a contribution of our new analysis. Consequently, βq(T S) can be viewed as a hyperparameter allowing us to tune the trade-off between the target voters disagreement and the source joint error. Experiments of Section 8 conﬁrm that this hyperparameter can be successfully selected.

A New PAC-Bayesian Perspective on Domain Adaptation

5.2. On some domain adaptation assumptions

In order to characterize which domain adaptation task may be learnable, Ben-David et al. (2012) presented three assumptions that can help domain adaptation. Our Theorem 3 does not rely on these assumptions, but they can be interpreted in our framework as discussed below. On the covariate shift. A domain adaptation task fulﬁlls the covariate shift assumption (Shimodaira, 2000) if the source and target domains only differ in their marginals according to the input space, i.e., TY |x(y) = SY |x(y). In this scenario, one may estimate βq(TX SX), and even ηT \S, by using unsupervised density estimation methods. Interestingly, by also assuming that the domains share the same support, we have ηT \S =0. Then from Line (9) we obtain

2d T (ρ)+ E x SX

TX(x) SX(x) E h ρ E h ρI h(x) =y I h (x) =y ,

which suggests a way to correct the shift between the domains by reweighting the labeled source distribution, while considering the information from the target disagreement. On the weight ratio. The weight ratio (Ben-David et al., 2012) of source and target domains, with respect to a collection of input space subsets B 2X, is given by

CB(S, T ) = inf b B, TX(b) =0 SX(b) TX(b) .

When CB(S, T ) is bounded away from 0, adaptation should be achievable under covariate shift. In this context, and when SUPP(S)= SUPP(T ), the limit case of β (T S) is equal to the inverse of the point-wise weight ratio obtained by letting B = {{x} : x X} in CB(S, T ). Indeed, both βq and CB compare the densities of source and target domains, but provide distinct strategies to relax the point-wise weight ratio; the former by lowering the value of q and the latter by considering larger subspaces B. On the cluster assumption. A target domain fulﬁlls the cluster assumption when examples of the same label belong to a common area of the input space, and the differently labeled areas are well separated by low-density regions (formalized by the probabilistic Lipschitzness of Urner et al., 2011). Once specialized to linear classiﬁers, d T (ρ) behaves nicely in this context (see Section 7).

5.3. On representation learning

The main assumption underlying our domain adaptation algorithm exhibited in Section 7 is that the support of the target domain is mostly included in the support of the source domain, i.e., the value of the term ηT \S is small. When T \S is sufﬁciently large to prevent proper adaptation, one could try to reduce its volume while taking care to preserve a good compromise between d T (ρ) and e S(ρ), using a representation learning approach, i.e., by projecting source and target examples into a new common space, as done for example by Chen et al. (2012); Ganin et al. (2016).

6. PAC-Bayesian Generalization Guarantees

To compute our domain adaptation bound, one needs to know the distributions S and TX, which is never the case in real life tasks. The PAC-Bayesian theory provides tools to convert the bound of Theorem 3 into a generalization bound on the target risk computable from a pair of sourcetarget samples (S, T) (S)ms (TX)mt. To achieve this, we ﬁrst provide generalization guarantees for d T (ρ) and e S(ρ). These results are presented as corollaries of Theorem 4 below, that generalizes a PAC-Bayesian theorem of Catoni (2007) to arbitrary loss functions.8 Indeed, Theorem 4, with ℓ(h, x, y)=I h(x) =y and Equation (1), gives the usual bound on the Gibbs risk.

Theorem 4. For any domain D over X Y , any set of voters H, any prior π over H, any loss ℓ: H X Y [0, 1], any real number c>0, with a probability at least 1 δ over the choice of {(xi, yi)}m i=1 (D)m, we have for all ρ on H:

E (x,y) D E h ρ ℓ(h, x, y)

i=1 E h ρℓ(h, xi, yi) + KL(ρ π) + ln 1

Note that, similarly to Mc Allester & Keshet (2011), we could choose to restrict c (0, 2) to obtain a slightly looser but simpler bound. Using e c 1 c 1

2c2, an upper bound on the right hand side of above equation is given by

h 1 m Pm i=1 Eh ρ ℓ(h, xi, yi) + KL(ρ π)+ln 1

We now exploit Theorem 4 to obtain generalization guarantees on the expected disagreement and the expected joint error. PAC-Bayesian bounds on these quantities appeared in Germain et al. (2015), but under different forms. In Corollary 5 below, we are especially interested in the possibility of controlling the trade-off between the empirical estimate computed on the samples and the complexity term KL(ρ π) with the help of parameters b and c.

Corollary 5. For any domains S and T over X Y , any set of voters H, any prior π over H, any δ (0, 1], any real numbers b > 0 and c > 0, we have: with a probability at least 1 δ over T (TX)mt,

ρ on H, d T (ρ) c 1 e c

bd T (ρ)+ 2KL(ρ π)+ln 1

with a probability at least 1 δ over S (S)ms,

ρ on H, e S(ρ) b 1 e b

be S(ρ)+ 2KL(ρ π)+ln 1

where bd T (ρ) and be S(ρ) are the empirical estimations of the target voters disagreement and the source joint error.

8To do so, we exploit a result of Maurer (2004) that allows to generalize PAC-Bayes theorems to arbitrary bounded loss function (see the proof of Theorem 4 in supplemental).

A New PAC-Bayesian Perspective on Domain Adaptation

Proof. Given π and ρ over H, we consider a new prior π2 and a new posterior ρ2, both over H2, such that: hij = (hi, hj) H2, π2(hij) = π(hi)π(hj), and ρ2(hij) = ρ(hi)ρ(hj). Thus, KL(ρ2 π2) = 2KL(ρ2 π2) (see Germain et al., 2015). Let us deﬁne two new loss functions for a paired voter hij H2:

ℓd(hij, x, y) = I hi(x) = hj(x) ,

and ℓe(hij, x, y) = I hi(x) = y I hj(x) = y .

Then, the bound on d T (ρ) is obtained from Theorem 4 with ℓ:= ℓd, and Equation (2). The bound on e S(ρ) is similarly obtained with ℓ:= ℓe and using Equation (5).

For algorithmic simplicity, we deal with Theorem 3 when q . Thanks to Corollary 5, we obtain the following generalization bound deﬁned with respect to the empirical estimates of the target disagreement and the source joint error. Theorem 6. For any domains S and T over X Y , any set of voters H, any prior π over H, any δ (0, 1], any b>0 and c>0, with a probability at least 1 δ over the choices of S (S)ms and T (TX)mt, we have

ρ on H, RT (Gρ) c 1

2 bd T (ρ) + b be S(ρ) + ηT \S

+ c mt c+ b ms b 2 KL(ρ π) + ln 2

where bd T (ρ) and be S(ρ) are the empirical estimations of the target voters disagreement and the source joint error, and b = b 1 e b β (T S), and c = c 1 e c .

Proof. We bound separately d T (ρ) and e S(ρ) using Corollary 5 (with probability 1 δ

2 each), and then combine the two upper bounds according to Theorem 3.

From an optimization perspective, the problem suggested by the bound of Theorem 6 is much more convenient to minimize than the PAC-Bayesian bound derived from Theorem 2 in Germain et al. (2013). The former is smoother than the latter: the absolute value related to the domain disagreement disρ(SX, TX) of Equation (4) disappears in beneﬁt of the domain divergence β (T S), which is constant and can be considered as an hyperparameter of the algorithm. Additionally, Theorem 2 requires equal source and target sample sizes while Theorem 6 allows ms =mt. Moreover, recall that in Germain et al. (2013) the ρ-dependent non-constant term λ(ρ) is ignored. In our new analysis, such compromise is not mandatory in order to apply the theoretical result to real problems, since the non-estimable term ηT \S is constant and does not depend on the learned ρ. Hence, we can neglect ηT \S without any impact on the optimization problem described in the next section. Beside, it is realistic to consider ηT \S as a small quantity in situations where the source and target supports are similar.

7. Specialization to Linear Classiﬁers

In order to derive an algorithm, we now specialize the bounds of Theorems 3 and 6 to the risk of a linear classiﬁer hw, deﬁned by a weight vector w Rd :

x X, hw(x) = sign (w x) .

The taken approach is the one privileged in numerous PACBayesian works (e.g., Langford & Shawe-Taylor, 2002; Ambroladze et al., 2006; Mc Allester & Keshet, 2011; Parrado-Hern andez et al., 2012; Germain et al., 2009; 2013), as it makes the risk of the linear classiﬁer hw and the risk of a (properly parametrized) majority vote coincide, while in the same time promoting large margin classiﬁers. To this end, let H be the set of all linear classiﬁers over the input space, H = hw | w Rd , and let ρw over H be a posterior distribution, resp. a prior distribution π0, that is constrained to be a spherical Gaussian with identity covariance matrix centered on vector w, resp. 0,

hw H, ρw(hw ) = 1

and π0(hw ) = 1

The KL-divergence between ρw and π0 simply is

KL(ρw π0) = 1

2 w 2 . (10)

Thanks to this parameterization, the majority vote classiﬁer Bρw corresponds to the one of the linear classiﬁer hw (see above cited PAC-Bayesian works). That is,

x X, w H, hw(x) = sign E hw ρwhw (x) =Bρw(x) .

Then, RD(hw) = RD(Bρw) for any data distribution D. Moreover, Langford & Shawe-Taylor (2002) showed that the closely related Gibbs risk (Equation 1) is related to the linear classiﬁer margin y w x

x , as follows:

RD(Gρw) = E (x,y) D Φ y w x

where Φ(x)= 1

2 , and Erf(x) = 2 π R x 0 e t2dt is the Gauss error function. Here, Φ(x) can be seen as a smooth surrogate sometimes called the probit loss (e.g., Mc Allester & Keshet, 2011) of the zero-one loss function I x 0 relying on y w x

x . Note that w plays an important role on the value of RD(Gρw), but not on RD(hw). Indeed, RD(Gρw) tends to RD(hw) as w grows, which can provide very tight bounds (see the empirical analyses of Ambroladze et al., 2006; Germain et al., 2009). In the PAC-Bayesian context, w turns out to be a measure of complexity of the learned classiﬁer, as Equation (10) shows. We now seek to express the expected disagreement d D(ρw) and the expected joint error e D(ρw) of Equations (2)

A New PAC-Bayesian Perspective on Domain Adaptation

Figure 1. Graphical representation of the loss functions given by the specialization to linear classiﬁers.

Figure 2. Decision boundaries of DALC on the intertwining moons toy problem, for ﬁxed parameters B=C=1, and a RBF kernel k(x, x ) = exp( x x 2). The target points are black. The positive, resp. negative, source points are red, resp. green. The blue dashed line shows the decision boundaries of algorithm PBDA (Germain et al., 2013).

and (5) related to the parameterized distribution ρw. As shown in Germain et al. (2013) the former is given by

d D(ρw) = E x DX Φdis

where Φdis(x) = 2 Φ(x) Φ( x). Following a similar approach, we obtain, for all w R,

e D(ρw) = E (x,y) D E h ρw E h ρw I h(x) =y I h (x) =y

= E (x,y) D E h ρw I h(x) =y E h ρw I h (x) =y

= E (x,y) D Φerr

with Φerr(x) = Φ(x) 2. As function Φ in Equation (11), functions Φerr and Φdis deﬁned above can be interpreted as loss functions for linear classiﬁers (illustrated by Figure 1). Domain adaptation bound. Theorem 3 specialized to linear classiﬁers gives the following corollary. Note that, as mentioned above, RT (hw) = RT (Bρw) 2 RT (Gρw). Corollary 7. Let S and T respectively be the source and the target domains on X Y . For all w R, we have : RT (hw) d T (ρw) + 2 β (T S) e S(ρw) + 2 ηT \S ,

Figure 1 leads to an insightful geometric interpretation of the domain adaptation trade-off promoted by Corollary 7. For ﬁxed values of β (T S) and ηT \S, the target risk RT (hw) is upper-bounded by a (β -weighted) sum of two losses. The expected Φerr-loss (i.e., the joint error) is computed on the (labeled) source domain; it aims to label the source examples correctly, but is more permissive on the required margin than the Φ-loss (i.e., the Gibbs risk). The expected Φdis-loss (i.e., the disagreement) is computed on the target (unlabeled) domain; it promotes large unsigned target margins. Thus, if a target domain fulﬁlls the cluster assumption (described in Section 5.2), d T (ρw) will be low when the decision boundary crosses a low-density region between the homogeneous labeled clusters. Hence, Corollary 7 reﬂects that some source errors may be allowed if, doing so, the separation of the target domain is improved. Generalization bound and learning algorithm. Theorem 6 specialized to linear classiﬁers gives the following.

Corollary 8. For any domains S and T over X Y , any δ (0, 1], any a>0 and b>0, with a probability at least 1 δ over the choices of S (S)ms and T (TX)mt, we have

w R : RT (hw) c bd T (ρw) + 2 b be S(ρw) + 2 ηT \S

+ 2 c mt c + b ms b w 2 + ln 2

For a source S={(xi, yi)}ms i=1 and a target T={(x i)}mt i=1 samples of potentially different size, and some hyperparameters C>0, B>0, minimizing the next objective function w.r.t w R is equivalent to minimize the above bound.

C bd T (ρw) + B be S(ρw) + w 2 (12)

i=1 Φdis w x i x i + B

i=1 Φerr yi w xi

We call the optimization of Equation (12) by gradient descent the DALC algorithm, for Domain Adaptation of Linear Classiﬁers. The kernel trick applies to DALC. That is, given a kernel k:Rd Rd R, one can express a linear classiﬁer in a RKHS9 by a dual weight vector α Rms+mt :

hw( ) = sign

i=1 αik(xi, ) +

i=1 αi+msk(x i, )

Even though the objective function is highly non-convex, we achieved good empirical results by minimizing the kernelized version of Equation (12) by gradient descent, with a uniform weight vector as a starting point. More details are given in the supplementary material.

8. Experimental Results

Firstly, Figure 2 illustrates the behavior of the decision boundary of our algorithm DALC on an intertwining moons toy problem10, where each moon corresponds to a label.

9It is non-trivial to show that the kernel trick holds when π0 and ρw are Gaussian over inﬁnite-dimensional feature space. As mentioned by Mc Allester & Keshet (2011), it is, however, the case provided we consider Gaussian processes as measure of distributions π0 and ρw over (inﬁnite) H. 10We generate each pair of moons with the make moons function provided in scikit-learn (Pedregosa et al., 2011).

A New PAC-Bayesian Perspective on Domain Adaptation

Table 1. Error rates on Amazon dataset. Best risks appear in bold and seconds are in italic.

SVM DASVM CODA PBDA DALC (CV) (RCV) (RCV) (RCV) (RCV)

books DVDs 0.179 0.193 0.181 0.183 0.178 books electro 0.290 0.226 0.232 0.263 0.212 books kitchen 0.251 0.179 0.215 0.229 0.194 DVDs books 0.203 0.202 0.217 0.197 0.186 DVDs electro 0.269 0.186 0.214 0.241 0.245 DVDs kitchen 0.232 0.183 0.181 0.186 0.175 electro books 0.287 0.305 0.275 0.232 0.240 electro DVDs 0.267 0.214 0.239 0.221 0.256 electro kitchen 0.129 0.149 0.134 0.141 0.123 kitchen books 0.267 0.259 0.247 0.247 0.236 kitchen DVDs 0.253 0.198 0.238 0.233 0.225 kitchen electro 0.149 0.157 0.153 0.129 0.131 Average 0.231 0.204 0.210 0.208 0.200

The target domain, for which we have no label, is a rotation of the source one. The ﬁgure shows clearly that DALC succeeds to adapt to the target domain, even for a rotation angle of 50 . We see that DALC does not rely on the restrictive covariate shift assumption, as some source examples are misclassiﬁed. This behavior illustrates the DALC trade-off in action, that concedes some errors on the source sample to lower the disagreement on the target sample.

Secondly, we evaluate DALC on the classical Amazon.com Reviews benchmark (Blitzer et al., 2006) according to the setting used by Chen et al. (2011); Germain et al. (2013). This dataset contains reviews of four types of products (books, DVDs, electronics, and kitchen appliances) described with about 100, 000 attributes. Originally, the reviews were labeled with a rating from 1 to 5. Chen et al. (2011) proposed a simpliﬁed binary setting by regrouping ratings into two classes (products rated lower than 3 and products rated higher than 4). Moreover, they reduced the dimensionality to about 40,000 by only keeping the features appearing at least ten times for a given domain adaptation task. Finally, the data are pre-processed with a tf-idf re-weighting. A domain corresponds to a kind of product. Therefore, we perform twelve domain adaptation tasks. For instance, books DVD s is the task for which the source domain is books and the target one is DVDs . We compare DALC with the classical non-adaptive algorithm SVM (trained only on the source sample), the adaptive algorithm DASVM (Bruzzone & Marconcini, 2010), the adaptive cotraining CODA (Chen et al., 2011), and the PAC-Bayesian domain adaptation algorithm PBDA (Germain et al., 2013) based on Theorem 2. Note that, in Germain et al. (2013), DASVM has shown better accuracy than SVM, CODA and PBDA. Each parameter is selected with a grid search thanks to a usual cross-validation (CV) on the source sample for SVM, and thanks to a reverse validation procedure11 (RCV)

11For details on the reverse validation procedure, see Bruzzone & Marconcini (2010); Zhong et al. (2010). Other details on our

for CODA, DASVM, PBDA, and DALC. The algorithms use a linear kernel and consider 2,000 labeled source examples and 2,000 unlabeled target examples. Table 1 reports the error rates of all the methods evaluated on the same separate target test sets proposed by Chen et al. (2011).

Above all, the adaptive approaches show the best result, implying that tackling this problem with a domain adaptation method is reasonable. Then, our new method DALC is the best algorithm overall on this task. Except for the two adaptive tasks between electronics and DVDs , DALC is either the best one (six times), or the second one (four times). Moreover, according to a Wilcoxon signed rank test with a 5% signiﬁcance level, we obtain a probability of 89.5% that DALC is better than PBDA. This test tends to conﬁrm that our new bound improves the analysis done previously in Germain et al. (2013), in addition to being more interpretable.

9. Conclusion

We propose a new domain adaptation analysis for majority vote learning. It relies on an upper bound on the target risk, expressed as a trade-off between the voters disagreement on the target domain, the voters joint errors on the source one, and a term reﬂecting the worst case error in regions where the source domain is non-informative. To the best of our knowledge, a crucial novelty of our contribution is that the trade-off is controlled by the divergence βq(T S) (Equation 7) between the domains: The divergence is not an additive term (as in many domain adaptation bounds) but is a factor weighting the importance of the source information. Our analysis, combined with a PAC-Bayesian generalization bound, leads to a new domain adaptation algorithm for linear classiﬁers. The empirical experiments show that our new algorithm outperforms the previous PAC-Bayesian approach (Germain et al., 2013).

As future work, we ﬁrst aim at investigating the case where the domains divergence βq(T S) can be estimated, i.e., when the covariate shift assumption holds or when some target labels are available. In these scenarios, βq(T S) might not be considered as a hyperparameter to tune. Last but not least, the term ηT \S of our bound suggesting that the two domains should live in the same regions can be dealt with a representation learning approach. As mentioned in Section 5.3, this could be an incentive to combine our learning algorithm with existing representation learning techniques. In another vein, considering an active learning setup (as in Berlind & Urner, 2015), one could query the labels of target examples to estimate the value bounded by ηT \S. We see this as a great source of inspiration for new algorithms for this learning paradigm.

experimental protocol are given in supplementary material.

A New PAC-Bayesian Perspective on Domain Adaptation

Acknowledgements

This work was supported in part by the French project LIVES ANR-15-CE23-0026-03, and in part by NSERC discovery grant 262067.

Ambroladze, A., Parrado-Hern andez, E., and Shawe Taylor, J. Tighter PAC-Bayes bounds. In NIPS, pp. 9 16, 2006.

Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. Analysis of representations for domain adaptation. In NIPS, pp. 137 144, 2006.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. Wortman. A theory of learning from different domains. Mach. Learn., 79(1-2):151 175, 2010.

Ben-David, S., Shalev-Shwartz, S., and Urner, R. Domain adaptation can quantity compensate for quality? In ISAIM, 2012.

Berlind, C. and Urner, R. Active nearest neighbors in changing environments. In ICML, pp. 1870 1879, 2015.

Blitzer, J. Domain adaptation of natural language processing systems. Ph D thesis, UPenn, 2007.

Blitzer, J., Mc Donald, R., and Pereira, F. Domain adaptation with structural correspondence learning. In EMNLP, pp. 120 128, 2006.

Bruzzone, L. and Marconcini, M. Domain adaptation problems: A DASVM classiﬁcation technique and a circular validation strategy. IEEE Trans. Pattern Anal. Mach. Intel., 32(5):770 787, 2010.

Catoni, O. PAC-Bayesian supervised classiﬁcation: the thermodynamics of statistical learning, volume 56. Inst. of Mathematical Statistic, 2007.

Chen, M., Weinberger, K.Q., and Blitzer, J. Co-training for domain adaptation. In NIPS, pp. 2456 2464, 2011.

Chen, M., Xu, Z. E., Weinberger, K. Q., and Sha, F. Marginalized denoising autoencoders for domain adaptation. In ICML, pp. 767 774, 2012.

Cortes, C. and Mohri, M. Domain adaptation and sample bias correction theory and algorithm for regression. Theor. Comput. Sci., 519:103 126, 2014.

Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In NIPS, pp. 442 450, 2010.

Cortes, C., Mohri, M., and Medina, A. Mu noz. Adaptation algorithm and theory based on generalized discrepancy. In ACM SIGKDD, pp. 169 178, 2015.

Daum e III, H. Frustratingly easy domain adaptation. In ACL, 2007.

Ganin, Y. and Lempitsky, V. S. Unsupervised domain adaptation by backpropagation. In ICML, pp. 1180 1189, 2015.

Ganin, Y., Ustinova, E., H, Ajakan, Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. JMLR, 17(59):1 35, 2016.

Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. PAC-Bayesian learning of linear classiﬁers. In ICML, pp. 353 360, 2009.

Germain, P., Habrard, A., Laviolette, F., and Morvant, E. A PAC-Bayesian approach for domain adaptation with specialization to linear classiﬁers. In ICML, pp. 738 746, 2013.

Germain, P., Lacasse, A., Laviolette, F., Marchand, ., and Roy, J.-F. Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. JMLR, 16:787 860, 2015.

Herbrich, R. and Graepel, T. A PAC-Bayesian margin bound for linear classiﬁers: Why svms work. In NIPS, pp. 224 230, 2000.

Huang, J., Smola, A., Gretton, A., Borgwardt, K., and Sch olkopf, B. Correcting sample selection bias by unlabeled data. In NIPS, pp. 601 608, 2006.

Jiang, J. A literature survey on domain adaptation of statistical classiﬁers, 2008.

Lacasse, A., Laviolette, F., Marchand, M., Germain, P., and Usunier, N. PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classiﬁer. In NIPS, pp. 769 776, 2006.

Langford, J. and Shawe-Taylor, J. PAC-Bayes & margins. In NIPS, pp. 439 446, 2002.

Li, X. and Bilmes, J. A Bayesian divergence prior for classifﬁer adaptation. In AISTATS, pp. 275 282, 2007.

Liu, Q., Mackey, A. J., Roos, D. S., and Pereira, F. Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction. Bioinformatics, 24(5): 597 605, 2008.

Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. In COLT, 2009.

A New PAC-Bayesian Perspective on Domain Adaptation

Margolis, A. A literature review of domain adaptation with unlabeled data, 2011.

Maurer, A. A note on the PAC-Bayesian theorem. Co RR, cs.LG/0411099, 2004.

Mc Allester, D. A. Some PAC-Bayesian theorems. Mach. Learn., 37:355 363, 1999.

Mc Allester, D. A. and Keshet, J. Generalization bounds and consistency for latent structural probit and ramp loss. In NIPS, pp. 2205 2212, 2011.

Morvant, E., Habrard, A., and Ayache, S. Parsimonious Unsupervised and Semi-Supervised Domain Adaptation with Good Similarity Functions. KAIS, 33(2):309 349, 2012.

Pan, S. J. and Yang, Q. A survey on transfer learning. T. Knowl. Data En., 22(10):1345 1359, 2010.

Parrado-Hern andez, E., Ambroladze, A., Shawe-Taylor, J., and Sun, S. PAC-Bayes bounds with data dependent priors. JMLR, 13:3507 3531, 2012.

Patel, V. M., Gopalan, R., Li, R., and Chellappa, R. Visual domain adaptation: A survey of recent advances. IEEE Signal Proc. Mag., 32(3):53 69, 2015.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. JMLR, 12: 2825 2830, 2011.

Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. The MIT Press, 2009.

Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Statist. Plann. Inference, 90(2):227 244, 2000.

Sugiyama, M., Nakajima, S., Kashima, H., von B unau, P., and Kawanabe, M. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, 2007.

Urner, R., Shalev-Shwartz, S., and Ben-David, S. Access to unlabeled data can speed up prediction time. In ICML, pp. 641 648, 2011.

Zhang, C., Zhang, L., and Ye, J. Generalization bounds for domain adaptation. In NIPS, pp. 3320 3328, 2012.

Zhong, E., Fan, W., Yang, Q., Verscheure, O., and Ren, J. Cross validation framework to choose amongst models and datasets for transfer learning. In ECML-PKDD, pp. 547 562, 2010.