# a_new_pacbayesian_perspective_on_domain_adaptation__0e9ddb73.pdf A New PAC-Bayesian Perspective on Domain Adaptation Pascal Germain PASCAL.GERMAIN@INRIA.FR INRIA, SIERRA Project-Team, 75589, Paris, France, and D.I., Ecole Normale Superieure, 75230 Paris, France Amaury Habrard AMAURY.HABRARD@UNIV-ST-ETIENNE.FR Univ Lyon, UJM-Saint-Etienne, CNRS, IOGS, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Etienne, France Franc ois Laviolette FRANCOIS.LAVIOLETTE@IFT.ULAVAL.CA D epartement d informatique et de g enie logiciel, Universit e Laval, Qu ebec, Canada Emilie Morvant EMILIE.MORVANT@UNIV-ST-ETIENNE.FR Univ Lyon, UJM-Saint-Etienne, CNRS, IOGS, Laboratoire Hubert Curien UMR 5516, F-42023, Saint-Etienne, France We study the issue of PAC-Bayesian domain adaptation: We want to learn, from a source domain, a majority vote model dedicated to a target one. Our theoretical contribution brings a new perspective by deriving an upper-bound on the target risk where the distributions divergence expressed as a ratio controls the trade-off between a source error measure and the target voters disagreement. Our bound suggests that one has to focus on regions where the source data is informative. From this result, we derive a PACBayesian generalization bound, and specialize it to linear classifiers. Then, we infer a learning algorithm and perform experiments on real data. 1. Introduction Machine learning practitioners are commonly exposed to the issue of domain adaptation1 (Jiang, 2008; Margolis, 2011): One usually learns a model from a corpus, i.e., a fixed yet unknown source distribution, then wants to apply it on a new corpus, i.e., a related but slightly different target distribution. Therefore, domain adaptation is widely studied in a lot of application fields like computer vision (Patel et al., 2015; Ganin & Lempitsky, 2015), bioinformatics (Liu et al., 2008), natural language processing (Blitzer, 2007; Daum e III, 2007), etc. A common example is the 1Domain adaptation is associated with transfer learning (Pan & Yang, 2010; Quionero-Candela et al., 2009). Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). spam filtering problem where a model needs to be adapted from one user mailbox to another receiving significantly different emails. Many approaches exist to address domain adaptation, often with the same idea: If we can apply a transformation to move closer the distributions, then we can learn a model with the available labels. This is generally performed by reweighting the importance of labeled data (Huang et al., 2006; Sugiyama et al., 2007; Cortes et al., 2010; 2015), and/or by learning a common representation for the source and target distributions (Chen et al., 2012; Ganin et al., 2016), and/or by minimizing a measure of divergence between the distributions (Morvant et al., 2012; Germain et al., 2013; Cortes & Mohri, 2014). The divergence-based approach has especially been explored to derive generalization bounds for domain adaptation (e.g., Ben-David et al., 2006; 2010; Mansour et al., 2009; Li & Bilmes, 2007; Zhang et al., 2012). Recently, this issue has been studied through the PAC-Bayesian framework (Germain et al., 2013), which focuses on learning weighted majority votes2 without target label. Even the latter result opened the door to tackle domain adaptation in a PAC-Bayesian fashion, it shares the same philosophy as the seminal works of Ben-David et al. (2006; 2010); Mansour et al. (2009): The risk of the target model is upper-bounded jointly by the model s risk on the source distribution, the divergence between the marginal distributions, and a nonestimable term3 related to the ability to adapt in the current space. Note that Li & Bilmes (2007) proposed a PACBayesian generalization bound for domain adaptation but they considered target labels. 2This setting is not too restrictive since many algorithms can be seen as a majority vote learning. E.g., ensemble learning and kernel methods output models interpretable as majority votes. 3More precisely, this term can only be estimated in the presence of labeled data from both the source and the target domains. A New PAC-Bayesian Perspective on Domain Adaptation In this paper, we derive a novel domain adaptation bound for the weighted majority vote framework. Concretely, the risk of the target model is still upper-bounded by three terms, but they differ in the information they capture. The first term is estimable from unlabeled data and relies on a notion of expected voters disagreement on the target domain. The second term depends on the expected accuracy of the voters on the source domain. Interestingly, this latter is weighted by a divergence between the source and the target domains that enables controlling the relationship between domains. The third term estimates the volume of the target domain living apart from the source one4, which has to be small for ensuring adaptation. From our bound, we deduce that a good adaptation strategy consists in finding a weighted majority vote leading to a suitable tradeoff controlled by the domains divergence between the first two terms: Minimizing the first one corresponds to look for voters that disagree on the target domain, and minimizing the second one to seek accurate voters on the source. Thereafter, we provide PAC-Bayesian generalization guarantees to justify the empirical minimization of our new domain adaptation bound, and specialize it to linear classifiers (following a methodology known to give rise to tight bound values). This allows to design DALC, a learning algorithm that improves the performances of the previous PAC-Bayesian domain adaptation algorithm. The rest of the paper is organized as follows. Section 2 presents the PAC-Bayesian domain adaptation setting. Section 3 reviews previous theoretical results on domain adaptation. Section 4 states our new analysis of domain adaptation for majority votes, that we relate to other works in Section 5. Then, Section 6 provides generalization bounds, specialized to linear classifiers in Section 7 to motivate the DALC learning algorithm, evaluated in Section 8. 2. Unsupervised Domain Adaptation Setting We tackle domain adaptation for binary classification, from a d-dimensional input space X Rd to an output space Y ={ 1, 1}. Our goal is to perform domain adaptation from a distribution S the source domain to another (related) distribution T the target domain on X Y ; SX and TX being the associated marginal distributions on X. Given a distribution D, we denote (D)m the distribution of a m-sample constituted by m elements drawn i.i.d. from D. We consider the unsupervised domain adaptation setting in which the algorithm is provided with a labeled source mssample S ={(xi, yi)}ms i=1 (S)ms, and with an unlabeled target mt-sample T ={xi}mt i=1 (TX)mt. PAC-Bayesian domain adaptation. Our work is inspired by the PAC-Bayesian theory (first introduced by 4Here we do not focus on learning a new representation to help the adaptation: We directly aim at adapting in the current space. Mc Allester, 1999). More precisely, we adopt the PACBayesian domain adaptation setting previously studied in Germain et al. (2013). Given H, a set of voters h : X Y , the elements of this approach are a prior distribution π on H, a pair of source-target learning samples (S, T) and a posterior distribution ρ on H. The prior distribution π models an a priori belief before observing (S, T) of the voters accuracy. Then, given the information provided by (S, T), we aim at learning a posterior distribution ρ leading to a ρ-weighted majority vote over H, Bρ( ) = sign E h ρh( ) , with nice generalization guarantees on the target domain T . In other words, we want to find the posterior distribution ρ minimizing the true target risk of Bρ : RT (Bρ) = E (x,y) T I Bρ(x) = y , where I a = 1 if a is true, and 0 otherwise. However, in most PAC-Bayesian analyses one does not directly focus on this majority vote risk, but studies the expectation of the risks over H according to ρ, designed as the Gibbs risk : RD(Gρ) = E (x,y) D E h ρ I h(x) = y . (1) It is well-known in the PAC-Bayesian literature that RD(Bρ) 2 RD(Gρ) (e.g., Herbrich & Graepel, 2000). Unfortunately, this worst case bound often leads to poor generalization guarantees on the majority vote risk. To address this issue, Lacasse et al. (2006) (refined in Germain et al., 2015) have exhibited that one can obtain a tighter bound on RD(Bρ) by studying the expected disagreement d D(ρ) of pairs of voters, defined as d D(ρ) = E x DX E h ρ E h ρ I h(x) = h (x) , (2) as RD(Bρ) 1 (1 2 RD(Gρ))2 1 2 d D(ρ) . Note that, although relying on d D(ρ), our present work does not reuse the latter result.5 Instead, we adopt another well-known strategy to obtain tight majority vote bounds, by specializing our PAC-Bayesian bound to linear classifiers. We describe this approach, and refer to related works, in Section 7. 3. Some Previous Domain Adaptation Bounds Many approaches tackling domain adaptation share the same underlying philosophy , pulling its origins in the work of Ben-David et al. (2006; 2010) which proposed a domain adaptation bound (Theorem 1, below). To summarize, the domain adaptation bounds reviewed in this section (see Zhang et al., 2012; Cortes et al., 2010; 2015, for other 5The quantity d D(ρ) is also used in the domain adaptation bound of Germain et al. (2013) to measure divergence between distributions. See forthcoming Theorem 2. A New PAC-Bayesian Perspective on Domain Adaptation bounds) express a similar trade-off between three terms: (i) the source risk, (ii) the distance between source and target marginal distributions over X, (iii) a non-estimable term (without target label) quantifying the difficulty of the task. Ben-David et al. (2006) assumed that the domains are related in the sense that there exists a (unknown) model performing well on both domains. Formally, their domain adaptation bound depends on the error µh =RS(h )+RT (h ) of the best hypothesis overall h =argminh H RS(h) + RT (h) . In practice, when no target label is available, µh is non-estimable and is assumed to be low when domain adaptation is achievable (or at least that there exists a representation space in which this assumption can be verified). In such a scenario, the domain adaptation strategy is then to look for a set H of possible models that behave similarly on both the source and target data, and to learn a model in H with a good accuracy on the source data. This similarity, called the H H-distance, d H H(SX, TX) = 2 sup (h,h ) H2 E x SXI h(x) = h (x) E x TXI h(x) = h (x) , gives rise to the following domain adaptation bound. Theorem 1 (Ben-David et al., 2006; 2010). Let H be a (symmetric6) hypothesis class. We have, h H, RT (h) RS(h)+ 1 2d H H(SX, TX)+µh . (3) Pursuing in the same line of research, Mansour et al. (2009) generalizes the H H-distance to real-valued loss functions L : [ 1, 1]2 R+, to express a similar theorem for regression. Their discrepancy disc L(SX, TX) is defined as sup (h,h ) H2 E x SX L h(x), h (x) E x TX L h(x), h (x) . The accuracy of the Mansour et al. (2009) s bound also relies on a non-estimable term assumed to be low when adaptation is achievable. Roughly, this term depends on the risk of the best target hypothesis and its agreement with the best source hypothesis on the source domain. Building on previous domain adaptation analyses, Germain et al. (2013) derived a PAC-Bayesian domain adaptation bound. This bound is based on a divergence suitable for PAC-Bayes, i.e., for the risk of a ρ-weighted majority vote of the voters of H (instead of a single classifier h H). This domain disagreement disρ(SX, TX) is defined as disρ(SX, TX) = d S(ρ) d T (ρ) . (4) Theorem 2 (below) needs the strong assumption that, in favorable adaptation situations, the learned posterior agrees with the best target one ρT = argminρRT (Gρ). Indeed, it relies on the following non-estimable term: λ(ρ) = RT (GρT )+Eh ρ Eh ρT Ex SX I h(x) =h (x) + Eh ρ Eh ρT Ex TX I h(x) =h (x) . 6In a symmetric H, for all h H, its inverse h is also in H. Theorem 2 (Germain et al., 2013). Let H be a set of voters. For any domains S and T over X Y , we have, ρ on H, RT (Gρ) RS(Gρ) + disρ(SX, TX) + λ(ρ). A compelling aspect of this PAC-Bayesian analysis is the suggested trade-off, which is function of ρ. Indeed, given a fixed instance space X and a fixed class H, apart from using importance weighting methods, the only way to minimize the bound of Theorem 1 is to find h H that minimizes RS(h). In Germain et al. (2013), the bound of Theorem 2 inspired an algorithm named PBDA selecting ρ over H that achieves a trade-off between RS(Gρ) and disρ(SX, TX). However, the term λ(ρ) does not appear in the optimization process of PBDA, even if it relies on the learned weight distribution ρ. It is assumed that the value of λ(ρ) should be negligible (uniformly for all ρ) when adaptation is achievable. Nevertheless, this strong assumption cannot be verified because the best target posterior distribution ρT is unknown. This is a major weakness of the previous PAC-Bayesian work that our new approach overcomes. 4. A New Domain Adaptation Perspective In this section, we introduce an original approach to upperbound the non-estimable risk of a ρ-weighted majority vote on a target distribution T thanks to a term depending on its marginal distribution TX, another one on a related source domain S, and a term capturing the volume of the source distribution uninformative for the target task. We base our bound on the expected disagreement d D(ρ) of Equation (2) and the expected joint error e D(ρ), defined as e D(ρ) = E (x,y) D E h ρ E h ρ I h(x) = y I h (x) = y . (5) Indeed, Lacasse et al. (2006); Germain et al. (2015) observed that, given a domain D on X Y and a distribution ρ on H, we can decompose the Gibbs risk as 2 E (x,y) D E h ρ E h ρ I h(x) =y +I h (x) =y = E (x,y) D E h ρ E h ρ I h(x) =h (x) +2 I h(x) =y h (x) =y = 1 2 d D(ρ) + e D(ρ) . (6) A key observation is that the voters disagreement does not rely on labels; we can compute d D(ρ) using the marginal distribution DX. Thus, in the present domain adaptation context, we have access to d T (ρ) even if the target labels are unknown. However, the expected joint error can only be computed on the labeled source domain. Domains divergence. In order to link the target joint error e T (ρ) with the source one e S(ρ), we weight the latter thanks to a divergence measure between the domains A New PAC-Bayesian Perspective on Domain Adaptation βq(T S) parametrized by a real value q > 0 : βq(T S) = E (x,y) S It is worth noting that considering some q values allow us to recover well-known divergences. For instance, choosing q=2 relates our result to the χ2-distance, as β2(T S)= p χ2(T S) + 1 . Moreover, we can link βq(T S) to the R enyi divergence7, which has led to generalization bounds in the context of importance weighting (Cortes et al., 2010). We denote the limit case q by β (T S) = sup (x,y) SUPP(S) with SUPP(S) the support of S. The divergence βq(T S) handles the input space areas where the source domain support SUPP(S) is included in the target one SUPP(T ). It seems reasonable to assume that, when adaptation is achievable, such areas are fairly large. However, it is likely that SUPP(T ) is not entirely included in SUPP(S). We denote T \S the distribution of (x, y) T conditional to (x, y) SUPP(T )\SUPP(S). Since it is hardly conceivable to estimate the joint error e T \S(ρ) without making extra assumptions, we define the worst risk for this unknown area ηT \S = Pr (x,y) T (x, y) / SUPP(S) sup h H RT \S(h) . (8) Even if we cannot evaluate sup H RT \S(h), the value of ηT \S is necessarily lower than Pr T ((x, y)/ SUPP(S)). The domain adaptation bound. Let us state the result underlying the domain adaptation perspective of this paper. Theorem 3. Let H be a hypothesis space, let S and T respectively be the source and the target domains on X Y . Let q > 0 be a constant. We have, for all ρ on H, 2 d T (ρ) + βq(T S) h e S(ρ) i1 1 q + ηT \S , where d T (ρ), e S(ρ), βq(T S) and ηT \S are respectively defined by Equations (2), (5), (7) and (8). Proof. Let us define t=E(x,y) T I (x, y)/ SUPP(S) , then ηρ = E (x,y) TI (x, y)/ SUPP(S) E h ρ E h ρI h(x) =y I h (x) =y = t E (x,y) T \S E h ρ E h ρ I h(x) =y I h (x) =y = t e T \S(ρ) = t RT \S(Gρ) 1 2d T \S(ρ) t sup h H RT \S(h) = ηT \S . Then, with βq =βq(T S) and p such that 1 e T (ρ) = E (x,y) T E h ρ E h ρ I h(x) =y I h (x) =y 7For q 0, we can show βq(T S)=2 q 1 q Dq(T S), where Dq(T S) is the R enyi divergence between T and S. = E (x,y) S T (x,y) S(x,y) E h ρ E h ρI h(x) =y I h (x) =y +ηρ (9) E h ρ E h ρ E (x,y) S I h(x) =y I h (x) =y p 1 Last line is due to H older inequality. Finally, we remove the exponent from expression (I h(x) = y I h (x) = y )p without affecting its value, which is either 1 or 0, and the final result follows from Equation (6). Note that the bound of Theorem 3 is reached whenever the domains are equal (S = T ). Thus, when adaptation is not necessary, our analysis is still sound and non-degenerated: RS(Gρ) = RT (Gρ) 1 2 d T (ρ) + 1 [e S(ρ)]1 + 0 = 1 2 d S(ρ) + e S(ρ) = RS(Gρ) . Meaningful quantities. Similarly to the previous results recalled in Section 3, our domain adaptation theorem bounds the target risk by a sum of three terms. However, our approach breaks the problem into atypical quantities: (i) The expected disagreement d T (ρ) captures second degree information about the target domain. (ii) The domains divergence βq(T S) weights the influence of the expected joint error e S(ρ) of the source domain; the parameter q allows us to consider different relationships between βq(T S) and e S(ρ). (iii) The term ηT \S quantifies the worst feasible target error on the regions where the source domain is uninformative for the target one. In the current work, we assume that this area is small. 5. Comparison With Related Works In this section, we discuss how our domain adaptation bound can be related to some previous works. 5.1. On the previous PAC-Bayesian bound It is instructive to compare the new bound of Theorem 3 with the previous PAC-Bayesian domain adaptation bound of Theorem 2. In Theorem 3, the non-estimable terms are the domain divergence βq(T S) and the term ηT \S. Contrary to the non-controllable term λ(ρ) of Theorem 2, these terms do not depend on the learned posterior distribution ρ: For every ρ on H, βq(T S) and ηT \S are constant values measuring the relation between the domains. Moreover, the fact that the domain divergence βq(T S) is not an additive term but a multiplicative one (as opposed to disρ(SX, TX)+λ(ρ) in Theorem 2) is a contribution of our new analysis. Consequently, βq(T S) can be viewed as a hyperparameter allowing us to tune the trade-off between the target voters disagreement and the source joint error. Experiments of Section 8 confirm that this hyperparameter can be successfully selected. A New PAC-Bayesian Perspective on Domain Adaptation 5.2. On some domain adaptation assumptions In order to characterize which domain adaptation task may be learnable, Ben-David et al. (2012) presented three assumptions that can help domain adaptation. Our Theorem 3 does not rely on these assumptions, but they can be interpreted in our framework as discussed below. On the covariate shift. A domain adaptation task fulfills the covariate shift assumption (Shimodaira, 2000) if the source and target domains only differ in their marginals according to the input space, i.e., TY |x(y) = SY |x(y). In this scenario, one may estimate βq(TX SX), and even ηT \S, by using unsupervised density estimation methods. Interestingly, by also assuming that the domains share the same support, we have ηT \S =0. Then from Line (9) we obtain 2d T (ρ)+ E x SX TX(x) SX(x) E h ρ E h ρI h(x) =y I h (x) =y , which suggests a way to correct the shift between the domains by reweighting the labeled source distribution, while considering the information from the target disagreement. On the weight ratio. The weight ratio (Ben-David et al., 2012) of source and target domains, with respect to a collection of input space subsets B 2X, is given by CB(S, T ) = inf b B, TX(b) =0 SX(b) TX(b) . When CB(S, T ) is bounded away from 0, adaptation should be achievable under covariate shift. In this context, and when SUPP(S)= SUPP(T ), the limit case of β (T S) is equal to the inverse of the point-wise weight ratio obtained by letting B = {{x} : x X} in CB(S, T ). Indeed, both βq and CB compare the densities of source and target domains, but provide distinct strategies to relax the point-wise weight ratio; the former by lowering the value of q and the latter by considering larger subspaces B. On the cluster assumption. A target domain fulfills the cluster assumption when examples of the same label belong to a common area of the input space, and the differently labeled areas are well separated by low-density regions (formalized by the probabilistic Lipschitzness of Urner et al., 2011). Once specialized to linear classifiers, d T (ρ) behaves nicely in this context (see Section 7). 5.3. On representation learning The main assumption underlying our domain adaptation algorithm exhibited in Section 7 is that the support of the target domain is mostly included in the support of the source domain, i.e., the value of the term ηT \S is small. When T \S is sufficiently large to prevent proper adaptation, one could try to reduce its volume while taking care to preserve a good compromise between d T (ρ) and e S(ρ), using a representation learning approach, i.e., by projecting source and target examples into a new common space, as done for example by Chen et al. (2012); Ganin et al. (2016). 6. PAC-Bayesian Generalization Guarantees To compute our domain adaptation bound, one needs to know the distributions S and TX, which is never the case in real life tasks. The PAC-Bayesian theory provides tools to convert the bound of Theorem 3 into a generalization bound on the target risk computable from a pair of sourcetarget samples (S, T) (S)ms (TX)mt. To achieve this, we first provide generalization guarantees for d T (ρ) and e S(ρ). These results are presented as corollaries of Theorem 4 below, that generalizes a PAC-Bayesian theorem of Catoni (2007) to arbitrary loss functions.8 Indeed, Theorem 4, with ℓ(h, x, y)=I h(x) =y and Equation (1), gives the usual bound on the Gibbs risk. Theorem 4. For any domain D over X Y , any set of voters H, any prior π over H, any loss ℓ: H X Y [0, 1], any real number c>0, with a probability at least 1 δ over the choice of {(xi, yi)}m i=1 (D)m, we have for all ρ on H: E (x,y) D E h ρ ℓ(h, x, y) i=1 E h ρℓ(h, xi, yi) + KL(ρ π) + ln 1 Note that, similarly to Mc Allester & Keshet (2011), we could choose to restrict c (0, 2) to obtain a slightly looser but simpler bound. Using e c 1 c 1 2c2, an upper bound on the right hand side of above equation is given by h 1 m Pm i=1 Eh ρ ℓ(h, xi, yi) + KL(ρ π)+ln 1 We now exploit Theorem 4 to obtain generalization guarantees on the expected disagreement and the expected joint error. PAC-Bayesian bounds on these quantities appeared in Germain et al. (2015), but under different forms. In Corollary 5 below, we are especially interested in the possibility of controlling the trade-off between the empirical estimate computed on the samples and the complexity term KL(ρ π) with the help of parameters b and c. Corollary 5. For any domains S and T over X Y , any set of voters H, any prior π over H, any δ (0, 1], any real numbers b > 0 and c > 0, we have: with a probability at least 1 δ over T (TX)mt, ρ on H, d T (ρ) c 1 e c bd T (ρ)+ 2KL(ρ π)+ln 1 with a probability at least 1 δ over S (S)ms, ρ on H, e S(ρ) b 1 e b be S(ρ)+ 2KL(ρ π)+ln 1 where bd T (ρ) and be S(ρ) are the empirical estimations of the target voters disagreement and the source joint error. 8To do so, we exploit a result of Maurer (2004) that allows to generalize PAC-Bayes theorems to arbitrary bounded loss function (see the proof of Theorem 4 in supplemental). A New PAC-Bayesian Perspective on Domain Adaptation Proof. Given π and ρ over H, we consider a new prior π2 and a new posterior ρ2, both over H2, such that: hij = (hi, hj) H2, π2(hij) = π(hi)π(hj), and ρ2(hij) = ρ(hi)ρ(hj). Thus, KL(ρ2 π2) = 2KL(ρ2 π2) (see Germain et al., 2015). Let us define two new loss functions for a paired voter hij H2: ℓd(hij, x, y) = I hi(x) = hj(x) , and ℓe(hij, x, y) = I hi(x) = y I hj(x) = y . Then, the bound on d T (ρ) is obtained from Theorem 4 with ℓ:= ℓd, and Equation (2). The bound on e S(ρ) is similarly obtained with ℓ:= ℓe and using Equation (5). For algorithmic simplicity, we deal with Theorem 3 when q . Thanks to Corollary 5, we obtain the following generalization bound defined with respect to the empirical estimates of the target disagreement and the source joint error. Theorem 6. For any domains S and T over X Y , any set of voters H, any prior π over H, any δ (0, 1], any b>0 and c>0, with a probability at least 1 δ over the choices of S (S)ms and T (TX)mt, we have ρ on H, RT (Gρ) c 1 2 bd T (ρ) + b be S(ρ) + ηT \S + c mt c+ b ms b 2 KL(ρ π) + ln 2 where bd T (ρ) and be S(ρ) are the empirical estimations of the target voters disagreement and the source joint error, and b = b 1 e b β (T S), and c = c 1 e c . Proof. We bound separately d T (ρ) and e S(ρ) using Corollary 5 (with probability 1 δ 2 each), and then combine the two upper bounds according to Theorem 3. From an optimization perspective, the problem suggested by the bound of Theorem 6 is much more convenient to minimize than the PAC-Bayesian bound derived from Theorem 2 in Germain et al. (2013). The former is smoother than the latter: the absolute value related to the domain disagreement disρ(SX, TX) of Equation (4) disappears in benefit of the domain divergence β (T S), which is constant and can be considered as an hyperparameter of the algorithm. Additionally, Theorem 2 requires equal source and target sample sizes while Theorem 6 allows ms =mt. Moreover, recall that in Germain et al. (2013) the ρ-dependent non-constant term λ(ρ) is ignored. In our new analysis, such compromise is not mandatory in order to apply the theoretical result to real problems, since the non-estimable term ηT \S is constant and does not depend on the learned ρ. Hence, we can neglect ηT \S without any impact on the optimization problem described in the next section. Beside, it is realistic to consider ηT \S as a small quantity in situations where the source and target supports are similar. 7. Specialization to Linear Classifiers In order to derive an algorithm, we now specialize the bounds of Theorems 3 and 6 to the risk of a linear classifier hw, defined by a weight vector w Rd : x X, hw(x) = sign (w x) . The taken approach is the one privileged in numerous PACBayesian works (e.g., Langford & Shawe-Taylor, 2002; Ambroladze et al., 2006; Mc Allester & Keshet, 2011; Parrado-Hern andez et al., 2012; Germain et al., 2009; 2013), as it makes the risk of the linear classifier hw and the risk of a (properly parametrized) majority vote coincide, while in the same time promoting large margin classifiers. To this end, let H be the set of all linear classifiers over the input space, H = hw | w Rd , and let ρw over H be a posterior distribution, resp. a prior distribution π0, that is constrained to be a spherical Gaussian with identity covariance matrix centered on vector w, resp. 0, hw H, ρw(hw ) = 1 and π0(hw ) = 1 The KL-divergence between ρw and π0 simply is KL(ρw π0) = 1 2 w 2 . (10) Thanks to this parameterization, the majority vote classifier Bρw corresponds to the one of the linear classifier hw (see above cited PAC-Bayesian works). That is, x X, w H, hw(x) = sign E hw ρwhw (x) =Bρw(x) . Then, RD(hw) = RD(Bρw) for any data distribution D. Moreover, Langford & Shawe-Taylor (2002) showed that the closely related Gibbs risk (Equation 1) is related to the linear classifier margin y w x x , as follows: RD(Gρw) = E (x,y) D Φ y w x where Φ(x)= 1 2 , and Erf(x) = 2 π R x 0 e t2dt is the Gauss error function. Here, Φ(x) can be seen as a smooth surrogate sometimes called the probit loss (e.g., Mc Allester & Keshet, 2011) of the zero-one loss function I x 0 relying on y w x x . Note that w plays an important role on the value of RD(Gρw), but not on RD(hw). Indeed, RD(Gρw) tends to RD(hw) as w grows, which can provide very tight bounds (see the empirical analyses of Ambroladze et al., 2006; Germain et al., 2009). In the PAC-Bayesian context, w turns out to be a measure of complexity of the learned classifier, as Equation (10) shows. We now seek to express the expected disagreement d D(ρw) and the expected joint error e D(ρw) of Equations (2) A New PAC-Bayesian Perspective on Domain Adaptation Figure 1. Graphical representation of the loss functions given by the specialization to linear classifiers. Figure 2. Decision boundaries of DALC on the intertwining moons toy problem, for fixed parameters B=C=1, and a RBF kernel k(x, x ) = exp( x x 2). The target points are black. The positive, resp. negative, source points are red, resp. green. The blue dashed line shows the decision boundaries of algorithm PBDA (Germain et al., 2013). and (5) related to the parameterized distribution ρw. As shown in Germain et al. (2013) the former is given by d D(ρw) = E x DX Φdis where Φdis(x) = 2 Φ(x) Φ( x). Following a similar approach, we obtain, for all w R, e D(ρw) = E (x,y) D E h ρw E h ρw I h(x) =y I h (x) =y = E (x,y) D E h ρw I h(x) =y E h ρw I h (x) =y = E (x,y) D Φerr with Φerr(x) = Φ(x) 2. As function Φ in Equation (11), functions Φerr and Φdis defined above can be interpreted as loss functions for linear classifiers (illustrated by Figure 1). Domain adaptation bound. Theorem 3 specialized to linear classifiers gives the following corollary. Note that, as mentioned above, RT (hw) = RT (Bρw) 2 RT (Gρw). Corollary 7. Let S and T respectively be the source and the target domains on X Y . For all w R, we have : RT (hw) d T (ρw) + 2 β (T S) e S(ρw) + 2 ηT \S , Figure 1 leads to an insightful geometric interpretation of the domain adaptation trade-off promoted by Corollary 7. For fixed values of β (T S) and ηT \S, the target risk RT (hw) is upper-bounded by a (β -weighted) sum of two losses. The expected Φerr-loss (i.e., the joint error) is computed on the (labeled) source domain; it aims to label the source examples correctly, but is more permissive on the required margin than the Φ-loss (i.e., the Gibbs risk). The expected Φdis-loss (i.e., the disagreement) is computed on the target (unlabeled) domain; it promotes large unsigned target margins. Thus, if a target domain fulfills the cluster assumption (described in Section 5.2), d T (ρw) will be low when the decision boundary crosses a low-density region between the homogeneous labeled clusters. Hence, Corollary 7 reflects that some source errors may be allowed if, doing so, the separation of the target domain is improved. Generalization bound and learning algorithm. Theorem 6 specialized to linear classifiers gives the following. Corollary 8. For any domains S and T over X Y , any δ (0, 1], any a>0 and b>0, with a probability at least 1 δ over the choices of S (S)ms and T (TX)mt, we have w R : RT (hw) c bd T (ρw) + 2 b be S(ρw) + 2 ηT \S + 2 c mt c + b ms b w 2 + ln 2 For a source S={(xi, yi)}ms i=1 and a target T={(x i)}mt i=1 samples of potentially different size, and some hyperparameters C>0, B>0, minimizing the next objective function w.r.t w R is equivalent to minimize the above bound. C bd T (ρw) + B be S(ρw) + w 2 (12) i=1 Φdis w x i x i + B i=1 Φerr yi w xi We call the optimization of Equation (12) by gradient descent the DALC algorithm, for Domain Adaptation of Linear Classifiers. The kernel trick applies to DALC. That is, given a kernel k:Rd Rd R, one can express a linear classifier in a RKHS9 by a dual weight vector α Rms+mt : hw( ) = sign i=1 αik(xi, ) + i=1 αi+msk(x i, ) Even though the objective function is highly non-convex, we achieved good empirical results by minimizing the kernelized version of Equation (12) by gradient descent, with a uniform weight vector as a starting point. More details are given in the supplementary material. 8. Experimental Results Firstly, Figure 2 illustrates the behavior of the decision boundary of our algorithm DALC on an intertwining moons toy problem10, where each moon corresponds to a label. 9It is non-trivial to show that the kernel trick holds when π0 and ρw are Gaussian over infinite-dimensional feature space. As mentioned by Mc Allester & Keshet (2011), it is, however, the case provided we consider Gaussian processes as measure of distributions π0 and ρw over (infinite) H. 10We generate each pair of moons with the make moons function provided in scikit-learn (Pedregosa et al., 2011). A New PAC-Bayesian Perspective on Domain Adaptation Table 1. Error rates on Amazon dataset. Best risks appear in bold and seconds are in italic. SVM DASVM CODA PBDA DALC (CV) (RCV) (RCV) (RCV) (RCV) books DVDs 0.179 0.193 0.181 0.183 0.178 books electro 0.290 0.226 0.232 0.263 0.212 books kitchen 0.251 0.179 0.215 0.229 0.194 DVDs books 0.203 0.202 0.217 0.197 0.186 DVDs electro 0.269 0.186 0.214 0.241 0.245 DVDs kitchen 0.232 0.183 0.181 0.186 0.175 electro books 0.287 0.305 0.275 0.232 0.240 electro DVDs 0.267 0.214 0.239 0.221 0.256 electro kitchen 0.129 0.149 0.134 0.141 0.123 kitchen books 0.267 0.259 0.247 0.247 0.236 kitchen DVDs 0.253 0.198 0.238 0.233 0.225 kitchen electro 0.149 0.157 0.153 0.129 0.131 Average 0.231 0.204 0.210 0.208 0.200 The target domain, for which we have no label, is a rotation of the source one. The figure shows clearly that DALC succeeds to adapt to the target domain, even for a rotation angle of 50 . We see that DALC does not rely on the restrictive covariate shift assumption, as some source examples are misclassified. This behavior illustrates the DALC trade-off in action, that concedes some errors on the source sample to lower the disagreement on the target sample. Secondly, we evaluate DALC on the classical Amazon.com Reviews benchmark (Blitzer et al., 2006) according to the setting used by Chen et al. (2011); Germain et al. (2013). This dataset contains reviews of four types of products (books, DVDs, electronics, and kitchen appliances) described with about 100, 000 attributes. Originally, the reviews were labeled with a rating from 1 to 5. Chen et al. (2011) proposed a simplified binary setting by regrouping ratings into two classes (products rated lower than 3 and products rated higher than 4). Moreover, they reduced the dimensionality to about 40,000 by only keeping the features appearing at least ten times for a given domain adaptation task. Finally, the data are pre-processed with a tf-idf re-weighting. A domain corresponds to a kind of product. Therefore, we perform twelve domain adaptation tasks. For instance, books DVD s is the task for which the source domain is books and the target one is DVDs . We compare DALC with the classical non-adaptive algorithm SVM (trained only on the source sample), the adaptive algorithm DASVM (Bruzzone & Marconcini, 2010), the adaptive cotraining CODA (Chen et al., 2011), and the PAC-Bayesian domain adaptation algorithm PBDA (Germain et al., 2013) based on Theorem 2. Note that, in Germain et al. (2013), DASVM has shown better accuracy than SVM, CODA and PBDA. Each parameter is selected with a grid search thanks to a usual cross-validation (CV) on the source sample for SVM, and thanks to a reverse validation procedure11 (RCV) 11For details on the reverse validation procedure, see Bruzzone & Marconcini (2010); Zhong et al. (2010). Other details on our for CODA, DASVM, PBDA, and DALC. The algorithms use a linear kernel and consider 2,000 labeled source examples and 2,000 unlabeled target examples. Table 1 reports the error rates of all the methods evaluated on the same separate target test sets proposed by Chen et al. (2011). Above all, the adaptive approaches show the best result, implying that tackling this problem with a domain adaptation method is reasonable. Then, our new method DALC is the best algorithm overall on this task. Except for the two adaptive tasks between electronics and DVDs , DALC is either the best one (six times), or the second one (four times). Moreover, according to a Wilcoxon signed rank test with a 5% significance level, we obtain a probability of 89.5% that DALC is better than PBDA. This test tends to confirm that our new bound improves the analysis done previously in Germain et al. (2013), in addition to being more interpretable. 9. Conclusion We propose a new domain adaptation analysis for majority vote learning. It relies on an upper bound on the target risk, expressed as a trade-off between the voters disagreement on the target domain, the voters joint errors on the source one, and a term reflecting the worst case error in regions where the source domain is non-informative. To the best of our knowledge, a crucial novelty of our contribution is that the trade-off is controlled by the divergence βq(T S) (Equation 7) between the domains: The divergence is not an additive term (as in many domain adaptation bounds) but is a factor weighting the importance of the source information. Our analysis, combined with a PAC-Bayesian generalization bound, leads to a new domain adaptation algorithm for linear classifiers. The empirical experiments show that our new algorithm outperforms the previous PAC-Bayesian approach (Germain et al., 2013). As future work, we first aim at investigating the case where the domains divergence βq(T S) can be estimated, i.e., when the covariate shift assumption holds or when some target labels are available. In these scenarios, βq(T S) might not be considered as a hyperparameter to tune. Last but not least, the term ηT \S of our bound suggesting that the two domains should live in the same regions can be dealt with a representation learning approach. As mentioned in Section 5.3, this could be an incentive to combine our learning algorithm with existing representation learning techniques. In another vein, considering an active learning setup (as in Berlind & Urner, 2015), one could query the labels of target examples to estimate the value bounded by ηT \S. We see this as a great source of inspiration for new algorithms for this learning paradigm. experimental protocol are given in supplementary material. A New PAC-Bayesian Perspective on Domain Adaptation Acknowledgements This work was supported in part by the French project LIVES ANR-15-CE23-0026-03, and in part by NSERC discovery grant 262067. Ambroladze, A., Parrado-Hern andez, E., and Shawe Taylor, J. Tighter PAC-Bayes bounds. In NIPS, pp. 9 16, 2006. Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. Analysis of representations for domain adaptation. In NIPS, pp. 137 144, 2006. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. Wortman. A theory of learning from different domains. Mach. Learn., 79(1-2):151 175, 2010. Ben-David, S., Shalev-Shwartz, S., and Urner, R. Domain adaptation can quantity compensate for quality? In ISAIM, 2012. Berlind, C. and Urner, R. Active nearest neighbors in changing environments. In ICML, pp. 1870 1879, 2015. Blitzer, J. Domain adaptation of natural language processing systems. Ph D thesis, UPenn, 2007. Blitzer, J., Mc Donald, R., and Pereira, F. Domain adaptation with structural correspondence learning. In EMNLP, pp. 120 128, 2006. Bruzzone, L. and Marconcini, M. Domain adaptation problems: A DASVM classification technique and a circular validation strategy. IEEE Trans. Pattern Anal. Mach. Intel., 32(5):770 787, 2010. Catoni, O. PAC-Bayesian supervised classification: the thermodynamics of statistical learning, volume 56. Inst. of Mathematical Statistic, 2007. Chen, M., Weinberger, K.Q., and Blitzer, J. Co-training for domain adaptation. In NIPS, pp. 2456 2464, 2011. Chen, M., Xu, Z. E., Weinberger, K. Q., and Sha, F. Marginalized denoising autoencoders for domain adaptation. In ICML, pp. 767 774, 2012. Cortes, C. and Mohri, M. Domain adaptation and sample bias correction theory and algorithm for regression. Theor. Comput. Sci., 519:103 126, 2014. Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In NIPS, pp. 442 450, 2010. Cortes, C., Mohri, M., and Medina, A. Mu noz. Adaptation algorithm and theory based on generalized discrepancy. In ACM SIGKDD, pp. 169 178, 2015. Daum e III, H. Frustratingly easy domain adaptation. In ACL, 2007. Ganin, Y. and Lempitsky, V. S. Unsupervised domain adaptation by backpropagation. In ICML, pp. 1180 1189, 2015. Ganin, Y., Ustinova, E., H, Ajakan, Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. JMLR, 17(59):1 35, 2016. Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. PAC-Bayesian learning of linear classifiers. In ICML, pp. 353 360, 2009. Germain, P., Habrard, A., Laviolette, F., and Morvant, E. A PAC-Bayesian approach for domain adaptation with specialization to linear classifiers. In ICML, pp. 738 746, 2013. Germain, P., Lacasse, A., Laviolette, F., Marchand, ., and Roy, J.-F. Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. JMLR, 16:787 860, 2015. Herbrich, R. and Graepel, T. A PAC-Bayesian margin bound for linear classifiers: Why svms work. In NIPS, pp. 224 230, 2000. Huang, J., Smola, A., Gretton, A., Borgwardt, K., and Sch olkopf, B. Correcting sample selection bias by unlabeled data. In NIPS, pp. 601 608, 2006. Jiang, J. A literature survey on domain adaptation of statistical classifiers, 2008. Lacasse, A., Laviolette, F., Marchand, M., Germain, P., and Usunier, N. PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier. In NIPS, pp. 769 776, 2006. Langford, J. and Shawe-Taylor, J. PAC-Bayes & margins. In NIPS, pp. 439 446, 2002. Li, X. and Bilmes, J. A Bayesian divergence prior for classiffier adaptation. In AISTATS, pp. 275 282, 2007. Liu, Q., Mackey, A. J., Roos, D. S., and Pereira, F. Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction. Bioinformatics, 24(5): 597 605, 2008. Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. In COLT, 2009. A New PAC-Bayesian Perspective on Domain Adaptation Margolis, A. A literature review of domain adaptation with unlabeled data, 2011. Maurer, A. A note on the PAC-Bayesian theorem. Co RR, cs.LG/0411099, 2004. Mc Allester, D. A. Some PAC-Bayesian theorems. Mach. Learn., 37:355 363, 1999. Mc Allester, D. A. and Keshet, J. Generalization bounds and consistency for latent structural probit and ramp loss. In NIPS, pp. 2205 2212, 2011. Morvant, E., Habrard, A., and Ayache, S. Parsimonious Unsupervised and Semi-Supervised Domain Adaptation with Good Similarity Functions. KAIS, 33(2):309 349, 2012. Pan, S. J. and Yang, Q. A survey on transfer learning. T. Knowl. Data En., 22(10):1345 1359, 2010. Parrado-Hern andez, E., Ambroladze, A., Shawe-Taylor, J., and Sun, S. PAC-Bayes bounds with data dependent priors. JMLR, 13:3507 3531, 2012. Patel, V. M., Gopalan, R., Li, R., and Chellappa, R. Visual domain adaptation: A survey of recent advances. IEEE Signal Proc. Mag., 32(3):53 69, 2015. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. JMLR, 12: 2825 2830, 2011. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. The MIT Press, 2009. Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Statist. Plann. Inference, 90(2):227 244, 2000. Sugiyama, M., Nakajima, S., Kashima, H., von B unau, P., and Kawanabe, M. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, 2007. Urner, R., Shalev-Shwartz, S., and Ben-David, S. Access to unlabeled data can speed up prediction time. In ICML, pp. 641 648, 2011. Zhang, C., Zhang, L., and Ye, J. Generalization bounds for domain adaptation. In NIPS, pp. 3320 3328, 2012. Zhong, E., Fan, W., Yang, Q., Verscheure, O., and Ren, J. Cross validation framework to choose amongst models and datasets for transfer learning. In ECML-PKDD, pp. 547 562, 2010.