# adaptation_based_on_generalized_discrepancy__517c7f1a.pdf Journal of Machine Learning Research 20 (2019) 1-30 Submitted 5/15; Revised 10/15; Published 1/19 Adaptation Based on Generalized Discrepancy Corinna Cortes corinna@google.com Google Research, 111 8th ave, New York, NY 10011 Mehryar Mohri mohri@cims.nyu.edu Andr es Mu noz Medina munoz@cims.nyu.edu Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012 Editor: Multi Task Learning We present a new algorithm for domain adaptation improving upon a discrepancy minimization algorithm, (DM), previously shown to outperform a number of algorithms for this problem. Unlike many previously proposed solutions for domain adaptation, our algorithm does not consist of a fixed reweighting of the losses over the training sample. Instead, the reweighting depends on the hypothesis sought. The algorithm is derived from a less conservative notion of discrepancy than the DM algorithm called generalized discrepancy. We present a detailed description of our algorithm and show that it can be formulated as a convex optimization problem. We also give a detailed theoretical analysis of its learning guarantees which helps us select its parameters. Finally, we report the results of experiments demonstrating that it improves upon discrepancy minimization. Keywords: domain adaptation, learning theory 1. Introduction A standard assumption in statistical learning theory and PAC learning is that training and test samples are drawn from the same distribution (Vapnik, 1998; Valiant, 1984). In practice, however, this assumption often does not hold: the source and target distributions may somewhat differ. This problem is known as domain adaptation and arises in a variety of applications such as natural language processing and computer vision (Dredze et al., 2007; Blitzer et al., 2007; Jiang and Zhai, 2007; Leggetter and Woodland, 1995; Mart ınez, 2002; Hoffman et al., 2014). The domain adaptation problem may appear when the distributions over the instance space differ, the so-called covariate shift problem, or when the labeling functions associated with each domain disagree. In practice, a combination of both issues occurs and, for adaptation to succeed, the divergence between the two domains needs to be relatively small. This is clear for the labeling functions since, if the learner receives source labels that are vastly different from the target ones, no learning algorithm can generalize well to the target domain. The same holds when input distributions largely differ. This intuition was formalized by Ben-David et al. (2010) and Ben-David and Urner (2012) who showed that even in the favorable scenario where the source and target distribution admit the same support, a sample of size in the order of that of the support is needed in order to solve the domain adaptation problem. As the authors point out, the domain adaptation problem becomes intractable when the labeling function for the training c 2019 Corinna Cortes, Mehryar Mohri, Andr es Mu noz Medina. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/15-192.html. Cortes, Mohri and Mu noz Medina data is vastly different from the labeling function used for testing. On the other hand, when some similarity between domains exist, it has been empirically and theoretically shown that adaptation algorithms can be beneficial and in fact a large number of algorithms for this task have been proposed over the past decade. The large majority of them fall in one of the following paradigms: 1. Learning a new feature representation. The core idea behind these algorithms is to map the source and target data into a new feature space where the difference between source and target distributions is reduced. Transfer Component Analysis (TCA) (Pan et al., 2011) and the work on Frustratingly Easy Domain Adaptation (FE) (Daum e III, 2007) belong to this family of algorithms. Whereas some empirical evidence of the effectiveness of these algorithms exists in the literature, to the best of our knowledge, no work has been done to provide learning guarantees for these algorithms. 2. Reweighting. Originated in the Statistics literature on sample bias correction, these techniques attempt to correct the difference between distributions by multiplying the loss at each training example by a positive weight. Most of the classical algorithms such as KMM (Huang et al., 2006), KLIEP (Sugiyama et al., 2007) and a two-step algorithm by Bickel et al. (2007) fall in this category. The main focus of this work will be on the latter. A common trait shared by most algorithms in this category is that their reweighting schemes are based on the minimization of a divergence measure between the empirical source and target distributions. For instance, the KL-divergence in the case of KLIEP and the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) for KMM. The guarantees of these algorithms are therefore given as a function of the chosen divergence. The main drawback of these measures is that they do not take into account the hypothesis set or the loss function, both crucial components of any learning algorithm. In contrast, the discrepancy introduced by Mansour et al. (2009) and further studied by Cortes and Mohri (2011) is a measure of the divergence between distributions tailored to domain adaptation that precisely takes into account both the loss function and the hypothesis set. The d A-distance, introduced by Devroye et al. (1996)[pp. 271-272] under the name of generalized Kolmogorov-Smirnov distance, later by Ben-David et al. (2006), coincides with the discrepancy when the binary loss function is used. The discrepancy is a pivotal concept used in the analysis of several adaptation scenarios: the Y-discrepancy or integral probability metric (Zhang et al., 2012) was successfully used by Mohri and Mu noz (2012) to provide tight learning guarantees for the related task of learning with drifting distributions, whereas a modified version of the discrepancy was used by Germain et al. (2013) to study the problem of domain adaptation in a PAC-Bayesian setting. The discrepancy-based generalization bounds given by Mansour et al. (2009) motivated a discrepancy minimization (DM) algorithm (Cortes and Mohri, 2013), which attempts to minimize said bounds. Besides its favorable theoretical guarantees, this algorithm was shown to perform well in a number of adaptation tasks and to match or outperform several other algorithms such as KMM, KLIEP and the aforementioned two stage algorithm by Bickel et al. (2007). One shortcoming of the DM algorithm, however, is that it seeks to reweight the loss on the training samples to minimize a quantity defined as the maximum over all pairs of hypotheses, including hypotheses that the learning algorithm might not ever consider as candidates. Thus, the algorithm tends to be too conservative on its choice of weights. We Adaptation Based on Generalized Discrepancy present an alternative theoretically well founded algorithm for domain adaptation that is based on minimizing a finer quantity, the generalized discrepancy, and that seeks to improve upon DM. Unlike the DM algorithm, our algorithm does not consist of a fixed reweighting of the losses over the training sample. Instead, the weights assigned to training sample losses vary as a function of the hypothesis h. This helps us ensure that for every hypothesis, h, the empirical loss on the source distribution is as close as possible to the empirical loss on the target distribution for that particular h. We describe the learning scenario considered (Section 2), then present a detailed description of our algorithm and show that it can be formulated as a convex optimization problem (Section 3). Next, we analyze the theoretical properties of our algorithm, which guide us in choosing the surrogate hypothesis set defining our algorithm (Section 4). In Section 5, we further analyze the optimization problem defining our algorithm and derive an equivalent form that can be handled by a standard convex optimization solver. In Section 6, we report the results of experiments demonstrating that our algorithm improves upon the DM algorithm in several tasks. 2. Learning Scenario This section defines the learning scenario of domain adaptation we consider, which coincides with that of Ben-David et al. (2006) or Mansour et al. (2009); Cortes and Mohri (2013). We first introduce the definitions and concepts needed for the following sections. For the most part, we follow the definitions and notation of Cortes and Mohri (2013). Let X denote the input space and Y R the output space. We define a domain as a pair formed by a distribution over X and a target labeling function mapping from X to Y. Throughout the paper, (Q, f Q) denotes the source domain and (P, f P ) the target domain with Q the source and P the target distribution over X and with f Q, f P : X Y the source and target labeling functions, respectively. In the scenario of domain adaptation we consider, the learner receives two samples: a labeled sample of m points from the source domain S = ((x1, y1), . . . , (xm, ym)) (X Y)m with x1, . . . , xm drawn i.i.d. according to Q and yi = f Q(xi) for i [1, m]; and an unlabeled sample T = (x 1, . . . , x n) X n of size n drawn i.i.d. according to the target distribution P. We denote by b Q the empirical distribution corresponding to the (unlabeled) sample SX = (x1, . . . , xm) and by b P the empirical distribution corresponding to T . We will be in fact more interested in the scenario commonly encountered in practice where, in addition to these two samples, the learner receives a small amount of labeled data T = ((x 1, y 1), . . . , (x s, y s)) (X Y)s from the target domain. We consider a loss function L: Y Y R+ jointly convex in its two arguments. The Lp losses commonly used in regression and defined by Lp(y, y ) = |y y|p for p 1 are special instances of this definition. For any two functions h, h : X Y and any distribution D over X, we denote by LD(h, h ) the expected loss of h(x) and h (x): LD(h, h ) = Ex D[L(h(x), h (x))]. The learning problem consists of selecting a hypothesis h out of a hypothesis set H with a small expected loss LP (h, f P ) with respect to the target domain. We further extend this notation to arbitrary functions q: X R with a finite support as follows: Lq(h, h ) = P x X q(x)L(h(x), h (x)). Cortes, Mohri and Mu noz Medina X Input space Y Output space P Target distribution Q Source distribution b P Empirical target distribution b Q Empirical source distribution T Target unlabeled sample S Labeled source sample T Small target labeled sample SX Unlabeled source sample f P Target labeling function f Q Source labeling function LP (h,f P ) Expected target loss LQ(h,f Q) Expected source loss L b P (h,f P ) Empirical target loss L b Q(h,f Q) Empirical source loss disc(P,Q) Discrepancy DISC( b P,U) Generalized discrepancy disc H (P,Q) Local Discrepancy disc Y(P,Q) Y-discrepancy qmin DM solution Qh GDM solution Table 1: Notation table. 3. Algorithm In this section, we describe our new adaptation algorithm. We first review some related previous work. Next, we present the key idea behind our algorithm and derive its general form, and finally, formulate it as a convex optimization problem. 3.1. Previous Work It was shown by Mansour et al. (2009) and Cortes and Mohri (2011) (see also the d A-distance (Ben-David et al., 2006) in the case of binary loss for classification) that a key measure of the difference of two distributions in the context of adaptation is the discrepancy. Given a hypothesis set H, the discrepancy, disc, between two distributions P and Q over X is defined by: disc(P, Q) = max h,h H LP (h , h) LQ(h , h) . (1) The discrepancy has several advantages over other common divergence measures such as the L1 distance. We refer the reader to (Medina, 2015) for a detailed discussion on this subject. Several generalization bounds for adaptation in terms of the discrepancy have been given in the past (Ben-David et al., 2006; Mansour et al., 2009; Cortes and Mohri, 2011, 2013). including pointwise guarantees in the case of kernel-based regularization algorithms, which includes algorithms such as support vector machines (SVM), kernel ridge regression, or support vector regression (SVR). The bounds given in (Mansour et al., 2009) motivated a discrepancy minimization algorithm. Given a positive semi-definite (PSD) kernel K, the hypothesis returned by the algorithm is the solution of the following optimization problem min h H λ h 2 K + Lqmin(h, f Q), (2) where K is the norm in the reproducing Hilbert space H induced by the kernel K and qmin is a distribution over the support of b Q such that qmin = argminq Q disc(q, b P), where Q = [0, 1]SX is the set of all distributions defined over the support of b Q. Besides its theoretical motivation, this algorithm has been shown to outperform several other algorithms in a series of experiments carried out by Cortes and Mohri (2013). Adaptation Based on Generalized Discrepancy Observe that, by definition, the objective function optimized by qmin corresponds to a maximum over all pairs of hypotheses. But, the maximizing pair of hypotheses may not be among the candidates ever considered by the learning algorithm. Thus, a learning algorithm based on discrepancy minimization tends to be too conservative. 3.2. Main Idea From here on we assume the algorithm selected by the learner is an instance of a regularized risk minimization algorithm over the Hilbert space H induced by a PSD kernel K. With knowledge of the target labels, these algorithms return a hypothesis h solution of minh H F(h) where F(h) = λ h 2 K + L b P (h, f P ), (3) where λ 0 is a regularization parameter. Thus, h can be viewed as the ideal hypothesis. In view of that, we can formulate our objective, in the presence of a domain adaptation problem, as that of finding a hypothesis h whose loss LP (h, f P ) with respect to the target domain is as close as possible to LP (h , f P ). To do so, we will seek in fact a hypothesis h that is as close as possible to h , which would imply the closeness of the losses with respect to the target domains. We do not have access to f P and can only access the labels of the training sample S. Thus, we must resort to using in our objective function, instead of L b P (h, f P ), a reweighted empirical loss over the training sample S. The main idea behind our algorithm is to define, for any h H, a reweighting function Qh : SX = {x1, . . . , xm} R such that the objective function G defined for all h H by G(h) = λ h 2 K + LQh(h, f Q) (4) is uniformly close to F, thereby resulting in close minimizers. Since the first term of (3) and (4) coincide, the idea consists equivalently of seeking Qh such that LQh(h, f Q) and L b P (h, f P ) be as close as possible. Observe that this departs from the standard reweighting methods: instead of reweighting the training sample with some fixed set of weights, we allow the weights to vary as a function of the hypothesis h. Note that we have further relaxed the condition commonly adopted by reweighting techniques that the weights must be non-negative and sum to one. Of course, searching for Qh to directly minimize |LQh(h, f Q) L b P (h, f P )| is in general not possible since we do not have access to f P , but it is instructive to consider the imaginary case where the average loss L b P (h, f P ) is known to us for any h H. Qh could then be determined via Qh = argmin q F(SX ,R) |Lq(h, f Q) L b P (h, f P )|, (5) where F(SX , R) is the set of real-valued functions defined over SX . For any h, we can in fact select Qh such that LQh(h, f Q) = L b P (h, f P ) since Lq(h, f Q) is a linear function of q. Thus, the optimization problem (5) reduces to solving a simple linear equation. With this choice of Qh, the objective functions F and G coincide and by minimizing G we can recover the ideal solution h . Note that, in general, the DM algorithm could not recover that ideal solution. Even a finer discrepancy minimization algorithm exploiting the knowledge of L b P (h, f P ) for all h and seeking a distribution q min minimizing maxh H |Lq(h, f Q) L b P (h, f P )| could not, Cortes, Mohri and Mu noz Medina in general, recover the ideal solution since we could not have Lq min(h, f Q) = L b P (h, f P ) for all h H. Of course, L b P (h, f P ) is not accessible since the sample T is unlabeled. Instead, we will consider a non-empty convex set of candidate hypotheses H H that could contain a good approximation of f P . Using H as a set of surrogate labeling functions leads to the following definition of Qh instead of (5): Qh = argmin q F(SX ,R) max h H |Lq(h, f Q) L b P (h, h )|. (6) The choice of the subset H is of course key. Our choice will be based on the theoretical analysis of Section 4. Nevertheless, we now present the formulation of the optimization problem for an arbitrary choice of the convex subset H . Proposition 1 For any h H, let Qh be defined by (6). Then, the following identity holds for any h H: LQh(h, f Q) = 1 max h H L b P (h, h ) + min h H L b P (h, h ) . Proof For any h H, the equation Lq(h, f Q) = l with l R admits a solution q F(SX , R). Thus, {Lq(h, f Q) : q F(SX , R)} = R and for any h H, we can write LQh(h, f Q) = argmin l {Lq(h,f Q): q F(SX ,R)} max h H |l L b P (h, h )| = argmin l R max h H |l L b P (h, h )| = argmin l R max h H max n L b P (h, h ) l, l L b P (h, h ) o = argmin l R max n max h H L b P (h, h ) l, l min h H L b P (h, h ) o max h H L b P (h, h ) + min h H L b P (h, h ) , since the minimizing l is obtained for max h H L b P (h, h ) l=l min h H L b P (h, h ). In view of this proposition, with our choice of Qh based on (6), the objective function G of our algorithm (4) can be equivalently written for all h H as follows: G(h) = λ h 2 K + 1 max h H L b P (h, h ) + min h H L b P (h, h ) . (7) Using the fact the L b P is a jointly convex function, it is easy to show (see for instance Boyd and Vandenberghe, 2004) that G is in fact a convex function too. 4. Learning Guarantees Here, we present two different types of guarantees: a tight learning bound based on the Rademacher complexity and a pointwise bound derived from a stability analysis. We further show that our algorithm is in fact minimizing this pointwise bound. As in previous work, we assume that the loss function L is µ-admissible. Adaptation Based on Generalized Discrepancy Definition 2 A loss function L is µ-admissible if there exists µ > 0 such that the inequality |L(h(x), y) L(h (x), y)| µ|h(x) h (x)| (8) holds for all (x, y) X Y and h , h H. The Lp losses commonly used in regression, p 1, verify this condition (see Appendix C). 4.1. Rademacher Complexity Bounds Definition 3 Let Z be any set and G be a family of functions mapping Z to R. Given a sample S = {z1, . . . , zn} Z, the empirical Rademacher complexity of G is denoted by b RS(G) and defined by b RS(G) = 1 i=1 σig(zi) , where σis, called Rademacher variables, are independent random variables distributed according to the uniform distribution over { 1, 1}. The Rademacher complexity of G is defined as Rn(G) = E S b RS(G) . Our first generalization bound is given in terms of the Y-discrepancy, which is a generalization of the discrepancy distance. The Y-discrepancy was first introduced by Mohri and Mu noz (2012) in the context of learning with drifting distributions. Definition 4 The Y-discrepancy between two domains (P, f P ) and (Q, f Q) is defined by disc Y(P, Q) = sup h H LQ(h, f Q) LP (h, f P ) . Note that the definition depends on the labeling functions f P and f Q. We do not explicitly indicate that dependency for the sake of simplicity of the notation. We follow the analysis of (Mohri and Mu noz, 2012) to derive the following tight generalization bounds based on the notion of Y-discrepancy. Proposition 5 Let HQ and HP be the families of functions defined as follows: HQ := {x 7 L(h(x), f Q(x)): h H} and HP := {x 7 L(h(x), f P (x)): h H}. Define MQ and MP as MQ = supx X,h H L(h(x), f Q(x)) and MP = supx X,h H L(h(x), f P (x)). Then, for any δ > 0, 1. with probability at least 1 δ over the choice of a labeled sample S of size m, the following inequality holds for all h H: LP (h, f P ) L b Q(h, f Q) + disc Y(P, Q) + 2Rm(HQ) + MQ δ) 2m ; (9) 2. with probability at least 1 δ over the choice of a sample T of size n, the following inequality holds for all h H and any distribution q over a sample SX : LP (h, f P ) Lq(h, f Q) + disc Y( b P, q) + 2Rn(HP ) + MP δ) 2n . (10) Cortes, Mohri and Mu noz Medina Proof Let Φ(S) denote suph H L b Q(h, f Q) LP (h, f P ). Changing one point in S changes Φ(S) by at most MQ m . Thus, by Mc Diarmid s inequality, we have P Φ(S) E[Φ(S)] > ϵ M2 Q . Therefore, for any δ > 0, with probability at least 1 δ, the following holds for all h H: LP (h, f P ) L b Q(h, f Q) + E[Φ(S)] + MQ Next, we can bound E[Φ(S)] as follows: E[Φ(S)] = E sup h H L b Q(h, f Q) LP (h, f P ) E sup h H L b Q(h, f Q) LQ(h, f Q) + sup h H LQ(h, f Q) LP (h, f P ) 2Rm(HQ) + disc Y(P, Q), where the last inequality follows from a standard symmetrization inequality in terms of the Rademacher complexity and the definition of disc Y(P, Q). For the second bound we have, starting with a standard Rademacher complexity bound for HP , for any δ > 0, with probability at least 1 δ, the following holds for all h H: LP (h, f P ) L b P (h, f P ) + 2Rn(HP ) + MP Lq(h, f Q) + L b P (h, f P ) Lq(h, f Q) + 2Rn(HP ) + MP δ) 2n . (11) Moreover, by definition L b P (h, f P ) Lq(h, f Q) disc Y( b P, q) for any q. Replacing this bound in (11) yields the result. Observe that these bounds are tight as a function of the divergence measure (discrepancy) we use: in the absence of adaptation, the following standard Rademacher complexity learning bound holds: L b P (h, f P ) L b P (h, f P ) + 2Rn(HP ) + MP Our second adaptation bound differs from this inequality only by the fact that L b P (h, f P ) is replaced with Lq(h, f Q) + disc Y( b P, q). But, by definition of Y-discrepancy, there exists an h H such that |L b P (h, f P ) Lq(h, f Q)| = disc Y( b P, q). A similar analysis shows that our first bound is also tight. Given a labeled sample S from the source domain, Proposition 5 suggests choosing a distribution q with support SX that minimizes the right-hand side of (10). However, the quantity disc Y( b P, q) depends, by definition, on the unknown labels from the target domain and therefore cannot be minimized. Thus, we will instead upper bound the Y-discrepancy in terms of quantities that can be estimated. Adaptation Based on Generalized Discrepancy Let A(H) denote the set of all functions U: h 7 Uh mapping H to F(SX , R) such that for all h H, h 7 LUh(h, f Q) is a convex function. Thus, for any h H, Uh is a reweighting function defined over SX . A(H) contains all constant functions U such that Uh = q for all h H, where q is a distribution over SX . We will abuse the notation and denote this functions also by q. By Proposition 1, A(H) also includes the function Q: h Qh used by our algorithm. Definition 6 (Generalized discrepancy) For any U A(H), the generalized discrepancy between b P and U is denoted by DISC( b P, U) and is defined by DISC( b P, U) = sup h H,h H |L b P (h, h ) LUh(h, f Q)|. (12) We also denote by d1(f P , H ) the L1 distance of f P to H : d1(f P , H ) = min h0 H E b P |h0(x) f P (x)|. (13) The following theorem gives an upper bound on the Y-discrepancy in terms of the generalized discrepancy and d1(f P , H ). Proposition 7 For any distribution q over SX and any set H , the following inequality holds: disc Y( b P, q) DISC( b P, q) + µ d1(f P , H ). Proof Let h0 H , by the triangle inequality, we can write disc Y( b P, q) = sup h H |Lq(h, f Q) L b P (h, f P )| sup h H |Lq(h, f Q) L b P (h, h0)| + sup h H |L b P (h, h0) L b P (h, f P )| sup h H max h H |Lq(h, f Q) L b P (h, h )| + sup h H |L b P (h, h0) L b P (h, f P )|. The hypothesis h0 will later be chosen to minimize the distance of f P to H . By the µ-admissibility of the loss, the last term can be bounded as follows: sup h H |L b P (h, h0) L b P (h, f P )| µ E b P |f P (x) h0(x)|. Using this inequality and minimizing over h0 H yields: disc Y( b P, q) sup h H max h H |Lq(h, f Q) L b P (h, h )| + µ d1(f P , H ) = DISC( b P, q) + µ d1(f P , H ), which completes the proof. We can also bound the Y-discrepancy in terms of the discrepancy measure and the following measure of the difference of the source and target labeling functions: ηH(f P , f Q) = min h0 H max x supp( b P) |f P (x) h0(x)| + max x supp( b Q) |f Q(x) h0(x)| . Cortes, Mohri and Mu noz Medina Proposition 8 The following inequality holds for all distributions q over SX : disc Y( b P, q) disc( b P, q) + µ ηH(f P , f Q). Proof By the triangle inequality and the µ-admissibility of the loss, the following inequality holds for all h0 H: disc Y( b P, q) = sup h H |Lq(h, f Q) L b P (h, f P )| |L b P (h, h0) L b P (h, f P )| + |Lq(h, f Q) Lq(h, h0)| + sup h H |Lq(h, h0) L b P (h, h0)| µ sup x supp( b P) |h0(x) f P (x)|] + sup x supp( b Q) [|f Q(x) h0(x)|] + disc( b P, q). Minimizing over all h0 H gives disc Y( b P, q) µ ηH(f P , f Q)+disc( b P, q) and completes the proof. The following learning guarantees are immediate consequences of Propositions 5, 7 and 8. Corollary 9 Let H H be a convex set and q a distribution over SX . Then, for any δ > 0, each of the following inequalities holds with probability at least 1 δ for all h H: LP (h, f P ) Lq(h, f Q) + DISC( b P, q) + µ d1(f P , H ) + 2Rn(HP ) + MP δ) 2n , (14) LP (h, f P ) Lq(h, f Q) + disc( b P, q) + µ ηH(f P , f Q) + 2Rn(HP ) + MP δ) 2n . (15) In general, the bounds (14) and (15) are not comparable. However, when L is an LP loss for some p 1, we can show the existence of a set H for which (14) is a tighter bound than (15). The result is expressed in terms of the local discrepancy defined by: disc H ( b P, q) = sup h H,h H |L b P (h, h ) Lq(h, h )|, which is a finer measure than the standard discrepancy for which the supremum is defined over a pair of hypotheses both in H H . Theorem 10 Let L be the LP loss for some p 1. Let H := {B(r): r 0} be a set of all balls B(r) = {h H|Lq(h , f Q) rp}. Then, for any distribution q over SX , there exists H H such that the following holds: DISC( b P, q) + µ d1(f P , H ) disc H ( b P, q) + µ ηH(f P , f Q). Adaptation Based on Generalized Discrepancy Proof Fix a distribution q over SX . Let h 0 be an element of argminh0 H L b P (h0, f P ) 1 p + Lq(h0, f Q) 1 p . Choose H H as H = {h H|Lq(h , f Q) rp} with r = Lq(h 0, f Q) 1 p . Then, by definition, h 0 is in H . For the Lp loss, it is not hard to show that for all h, h H, |Lq(h, h ) Lq(h, f Q)| µ[Lq(h , f Q)] 1 p (see Appendix C). In view of this inequality, we can write: DISC( b P, q) = sup h H,h H |L b P (h, h ) Lq(h, f Q)| sup h H,h H |L b P (h, h ) Lq(h, h )| + sup h H,h H |Lq(h, h ) Lq(h, f Q)| disc H ( b P, q) + max h H µ[Lq(h , f Q)] 1 p = disc H ( b P, q) + µr = disc H ( b P, q) + µLq(h 0, f Q) 1 p . Using this inequality, Jensen s inequality, and the fact that h 0 is in H , we can write µ d1(f P , H ) + DISC( b P, q) µ min h0 H E x b P [|f P (x) h0(x)|] + µLq(h 0, f Q) 1 p + disc H ( b P, q) µ min h0 H E x b P [|f P (x) h0(x)|p] 1 p + µLq(h 0, f Q) 1 p + disc H ( b P, q) µ L b P (h 0, f P ) 1 p + µLq(h 0, f Q) 1 p + disc H ( b P, q). Moreover, by definition of h 0 the last expression is equal to L b P (h0, f P ) 1 p + Lq(h0, f Q) 1 p + disc H ( b P, q) µ min h0 H max x supp( b P) |f P (x) h0(x)| + max x supp( b Q) |f Q(x) h0(x)| + disc H ( b P, q) = µ ηH(f P , f Q) + disc H ( b P, q). which concludes the proof. Theorem 10 shows that the generalized discrepancy can provide a finer measure of the difference between two domains for some choices of H . Therefore, for a good choice of H , an algorithm minimizing the right-hand side of (14) would benefit from better theoretical guarantees than the DM algorithm. However, the optimization problem defined by (14) is not jointly convex in q and h. Instead, we propose to first minimize the generalized discrepancy and then use this reweighting function as input to our learning algorithm. Further motivation for this two-stage algorithm is given in the following section. 4.2. Pointwise Guarantees Similar to the guarantee presented by Cortes and Mohri (2013), we will seek to bound the difference between an ideal solution h and the solution obtained by our algorithm. We begin by stating the following bound motivating the DM algorithm. Cortes, Mohri and Mu noz Medina Theorem 11 (Cortes and Mohri, 2013) Let q be an arbitrary distribution over SX and let h and hq be the hypotheses minimizing λ h 2 K + L b P (h, f P ) and λ h 2 K + Lq(h, f Q) respectively. Then, the following inequality holds: λ h hq 2 K µ ηH(f P , f Q) + disc( b P, q). (16) Notice that the solution of DM minimizes the right-hand side of (16), that is disc( b P, q). The following theorem provides an analogous bound for our algorithm. Theorem 12 Let U be an arbitrary element of A(H) and let h and h U be the hypotheses minimizing λ h 2 K + L b P (h, f P ) and λ h 2 K + LUh(h, f Q) respectively. Then, the following inequality holds for any convex set H H: λ h h U 2 K µ d1(f P , H ) + DISC( b P, U). (17) Proof Fix U A(H) and let G b P denote h 7 L b P (h, f P ) and GU the function h 7 LUh(h, f Q). Since h 7 λ h 2 K + G b P (h) is convex and differentiable and since h is its minimizer, the gradient is zero at h , that is 2λh = G b P (h ). Similarly, since h 7 λ h 2 K +GU(h) is convex, it admits a sub-differential at any h H. Since h U is a minimizer, its sub-differential at h U must contain 0. Thus, there exists a sub-gradient g0 GU(h U) such that 2λh U = g0, where GU(h U) denotes the sub-differential of GU at h U. Using these two equalities we can write 2λ h h U 2 K = h h U, g0 G b P (h ) = g0, h h U G b P (h ), h h U GU(h ) GU(h U) + G b P (h U) G b P (h ) = L b P (h U, f P ) LUh(h U, f Q) + LUh(h , f Q) L b P (h , f P ) 2 sup h H |L b P (h, f P ) LUh(h, f Q)|, where we used for the first inequality the convexity of GU combined with the sub-gradient property of g0 GU(h U), and the convexity of G b P . For any h H, using the µadmissibility of the loss, we can upper bound the operand of the max operator as follows: |L b P (h, f P ) LUh(h, f Q)| |L b P (h, f P ) L b P (h, h0)| + |L b P (h, h0) LUh(h, f Q)| µ E x b P |f P (x) h0(x)| + max h H |L b P (h, h ) LUh(h, f Q)|, where h0 is an arbitrary element of H . Since this bound holds for all h0 H , it follows immediately that λ h h U 2 K µ min h0 H E b P |f P (x) h0(x)| + sup h H max h H |L b P (h, h ) LUh(h, f Q)|, which concludes the proof. Note that our choice of Q: h 7 Qh minimizes the right-hand side of (17) among all functions U A(H) since, for any U, we can write DISC( b P, U)= sup h H max h H |L b P (h, h ) LUh(h, f Q)| sup h H min q F(SX ) max h H |L b P (h, h ) Lq(h, f Q)| = sup h H max h H |L b P (h, h ) LQh(h, f Q)| = DISC( b P, Q). Adaptation Based on Generalized Discrepancy Thus, in view of Theorem 10, for any constant function U A(H) with Uh = q for some fixed distribution q over SX , the right-hand side of the bound of Theorem 11 is lower bounded by the right-hand side of the bound of Theorem 12, since the local discrepancy is a finer quantity than the discrepancy: disc H ( b P, q) disc( b P, q). Thus, as expected from the discussion after Theorem 10, our algorithm benefits from a more favorable guarantee than the DM algorithm for some particular choices of H , especially since, our choice of Q is based on the minimization over all elements in A(H) and not just the subset of constant functions mapping to a distribution. The following pointwise guarantee follows directly from Theorem 12. Corollary 13 Let h be a minimizer of λ h 2 K +L b P (h, f P ) and h Q a minimizer of λ h 2 K + LQh(h, f Q). Then, the following holds for any convex set H H and for all (x, y) X Y: |L(h Q(x), y) L(h (x), y)| µR µ d1(f P , H ) + DISC( b P, Q) where R2 = supx X K(x, x). Proof By the µ-admissibility of the loss, the reproducing property of H, and the Cauchy Schwarz inequality, the following holds for all x X and y Y: |L(h Q(x), y) L(h (x), y)| µ|h Q(x) h (x)| = µ| h Q h , K(x, ) K| µ h Q h K p K(x, x) R h Q h K. Upper bounding h Q h K using Theorem 12 and using the fact that Q: h Qh is a minimizer of the bound over all choices of U A(H) yields the desired result. The pointwise loss guarantee just presented can be directly used to bound the difference of the expected loss of h and h Q in terms of the same upper bounds, e.g., LP (h Q, f P ) LP (h , f P ) + µR µ d1(f P , H ) + DISC( b P, Q) Similarly, Theorem 10 directly implies the following Corollary. Corollary 14 Let h be a minimizer of λ h 2 K +L b P (h, f P ) and h Q a minimizer of λ h 2 K + LQh(h, f Q). Let supx X K(x, x) = R2. Then, there exists a choice of H H for which the following inequality holds uniformly over over (x, y) X Y: |L(h Q(x), y) L(h (x), y)| µR µηH(f P , f Q)+disc H ( b P, qmin) where qmin is the solution of the DM algorithm. Cortes, Mohri and Mu noz Medina The choice of the set H defining our algorithm is strongly motivated by the theoretical results of this section. In view of Theorem 10, we restrict our choice of H to the family H, parametrized only by the radius r. Since the generalized discrepancy DISC is a function of the set H which in turn depends only on r, the radius r is chosen to minimize (19). This can be done by using as a validation set a small amount of labeled data from the target domain which is typically available in practice. In particular, as the size of the unlabeled sample T increases, our estimate of the optimal radius r becomes more accurate. We provide a detailed description of our algorithm s implementation in Section 5. 4.3. Comparison against Other Learning Bounds We now compare the learning bounds just derived for our algorithm with those of some common reweighting techniques. In particular, we compare our bounds with those of Cortes et al. (2008) for the KMM algorithm. A similar comparison however can be derived for other algorithms based on importance weighting such as KLIEP or u LSIF. Assume P and Q admit densities p and q respectively. For every x X we denote by β(x) = p(x) q(x) the importance ratio and by β = β SX its restriction to SX . We also let bβ be the solution to the optimization problem solved by the KMM algorithm. Let hβ denote the solution to min h H λ h 2 + Lβ(h, f Q), (20) and hbβ be the solution to min h H λ h 2 + Lbβ(h, f Q). (21) The following proposition due to Cortes et al. (2008) relates the error of these hypotheses. The proposition requires the kernel K to be a strictly positive definite universal kernel, with Gram matrix K given by Kij = K(xi, xj). Proposition 15 Assume L(h(x), y) 1 for all (x, y) X Y, h H. For any δ > 0, with probability at least 1 δ we have: |LP (hβ, f P ) LP (hbβ, f P )| µ2R2λ 1 2max(K) λ λ1/2 min(K) where ϵ and B are the hyperparameters defining the KMM algorithm and λmax(K), λmin(K) denote the largest and smallest eigenvalues of K respectively. This bound and the one obtained in (19) are of course not comparable since the dependence on µ, R and λ is different. In some cases this dependency can be more favorable in (22) whereas for other values of these parameters (19) provides a better bound. Moreover, (22) depends on the condition number of K which can become really large in practice. However, the most important difference between these bounds is that (19) is given in terms of the ideal hypothesis h while (22) is given in terms of hβ, which, in view of the results of Cortes et al. (2010) is not guaranteed to have a good performance on the target distribution. Therefore (22) does not, in general, provide an informative bound. Adaptation Based on Generalized Discrepancy 4.4. Scenario of Additional Labeled Data Here, we consider a rather common scenario in practice where, in addition to the labeled sample S drawn from the source domain and the unlabeled sample T from the target domain, the learner receives a small amount of labeled data from the target domain T = ((x 1, y 1), . . . , (x s, y s)) (X Y)s. This sample is typically too small to be used solely to train an algorithm and achieve a good performance. However, it can be useful in at least two ways that we discuss here. One important benefit of T is to serve as a validation set to determine the parameter r that defines the convex set H used by our algorithm. The sample T can also be used to enhance the discrepancy minimization algorithm as we now show. Let b P denote the empirical distribution associated with T . To take advantage of T , the DM algorithm can be trained on the sample of size (m+s) obtained by combining S and T , which corresponds to the new empirical distribution b Q = m m+s b Q + s m+s b P . Note that for a fixed value m and large values of s, b Q essentially ignores the points from the source distribution Q, which corresponds to the standard supervised learning scenario in the absence of adaptation. Let q min denote the discrepancy minimization solution when using b Q . Since supp( b Q ) supp( b Q), the discrepancy using q min is a lower bound on the discrepancy using qmin: disc(q min, b P) = min supp(q) supp( b Q ) disc( b P, q) min supp(q) supp( b Q) disc( b P, q) = disc(qmin, b P). 5. Optimization Solution As shown in Section 3.2, the function G defining our algorithm is convex and the problem of minimizing the expression (7) is a convex optimization problem. Nevertheless, the problem is not straightforward to solve, in particular because evaluating a term like maxh H L b P (h, h ) that it contains requires solving a non-convex optimization problem. Here, we present an exact solution in the case of the L2 loss by solving a semi-definite programming (SDP) problem. 5.1. SDP Formulation As discussed in Section 4, the choice of H is a key component of our algorithm. In view of Corollary 14, we will consider the set H = {h | Lqmin(h , f Q) r2}. Equivalently, as a result of the reproducing property of H and the representer theorem, H may be defined as {a Rm| Pm j=1 qmin(xj)(Pm i=1 aiqmin(xi)1/2K(xi, xj) yj)2 r2}. Also by the representer theorem, the solution to (7) will be of the form h = n 1/2 Pn i=1 bi K(x i, ). Therefore, given normalized kernel matrices Kt, Ks, Kst defined respectively as Kij t = n 1K(x i, x j), Kij s = qmin(xi)1/2qmin(xj)1/2K(xi, xj) and Kij st = n 1/2qmin(xj)1/2K(x i, xj), problem (7) is equivalent to min b Rn λb Ktb + 1 max a Rm Ksa y 2 r2 Ksta Ktb 2 + min a Rm Ksa y 2 r2 Ksta Ktb 2 , (23) where y = (qmin(x1)1/2y1, . . . , qmin(xm)1/2ym) is the vector of normalized labels. Cortes, Mohri and Mu noz Medina Figure 1: Illustration of the sampling process on the set H . Lemma 16 The Lagrangian dual of the problem max a Rm Ksa y 2 r2 1 2 Ksta 2 b Kt Ksta, is given by min η 0,γ γ 2K st Kst + ηK2 s 1 2K st Ktb ηKsy 1 2b Kt Kst ηy Ks η( y 2 r2) + γ Furthermore, the duality gap for these problems is zero. The proof of the lemma is given in Appendix A. The lemma helps us derive the following equivalent SDP formulation for our original optimization problem. Its solution can be found in polynomial time using standard convex optimization solvers. Proposition 17 The optimization problem (23) is equivalent to the following SDP: max α,β,ν,Z,z 1 2 Tr(K st Kst Z) β α 2K st Kst 1 4 e K νKsy + 1 4z e K α + ν( y 2 r2) λKt + K2 t 1 2Kt Kstz 1 2z K st Kt β 0 ν 0 Tr(K2 s Z) 2y Ksz + y 2 r2, where e K = K st Kt(λKt + K2 t ) Kt Kst, and A denotes the pseudo-inverse of matrix A. In the following section we derive a more efficient approximate solution to the optimization problem using sampling, which helps reducing the problem to a simple QP. 5.2. QP Formulation The SDP formulation described in the previous section is applicable for a specific choice of H . In this section, we present an analysis that holds for an arbitrary compact, convex set H . First, notice that the problem of minimizing G (expression (7)) is related to the Adaptation Based on Generalized Discrepancy minimum enclosing ball (MEB) problem. For a set D Rd, the MEB problem is defined as follows: min u Rd max v D u v 2. Omitting the regularization and the min term from (7) leads to a problem similar to the MEB. Thus, we could benefit from the extensive literature and algorithmic study available for this problem (Kumar et al., 2003; Sch onherr, 2002; Yildirim, 2008). However, to the best of our knowledge, there is currently no solution available to this problem in the case of an infinite set D, as in the case of our problem. Instead, we present a solution for solving an approximation of (7) based on sampling. Let {h1, . . . , hk} be a set of hypotheses on the boundary of H , H and let C = C(h1, . . . , hk) denote their convex hull. The following is the sampling-based approximation of (7) that we consider: min h H λ h 2 K + 1 2 max i=1,...,k L b P (h, hi) + 1 2 min h C L b P (h, h ). (24) Proposition 18 Let Y = (Yij) Rn k be the matrix defined by Yij = n 1/2hj(x i) and y = (y 1, . . . , y k) Rk the vector defined by y i = n 1 Pn j=1 hi(x j)2. Then, the dual problem of (24) is given by max α,γ,β Yα + γ 2Kt 1 Yα + γ 2γ Kt K tγ + α y β (25) s.t. 1 α = 1 2, 1β Y γ, α 0, where 1 is the vector in Rk with all components equal to 1. Furthermore, the solution h of (24) can be recovered from a solution (α, γ, β) of (25) by x, h(x) = Pn i=1 ai K(xi, x), where a = λI + 1 2Kt) 1(Yα + 1 The proof of the proposition is given in Appendix B. The result shows that, given a finite sample h1, . . . , hk on the boundary of H , (24) is in fact equivalent to a standard QP. Hence, a solution can be found efficiently with one of the many off-the-shelf algorithms for quadratic programming. We now describe the process of sampling from the boundary of the set H , which is a necessary step for defining problem (24). We consider compact sets of the form H := {h H | gi(h ) 0}, where the functions gi are continuous and convex. For instance, we could consider the set H defined in the previous section. More generally, we can consider a family of sets H p = {h H| | Pm i=1 qmin(xi)|h(xi) yi|p rp}. Assume that there exists h0 satisfying gi(h0) < 0. Our sampling process is illustrated by Figure 1 and works as follows: pick a random direction bh and define λi to be the minimal solution to the system (λ 0) (gi(h0 + λbh) = 0). Set λi = if no solution is found and define λ = mini λi. By the convexity and compactness of H we can guarantee that λ < . The hypothesis h = h0 + λ bh satisfies h H and gj(h) = 0 for j such that λj = λ . The latter is straightforward. To verify the former, Cortes, Mohri and Mu noz Medina assume that gi(h0 + λ bh) > 0 for some i. The continuity of gi would imply the existence of λ i with 0 < λ i < λ λi such that gi(h0 + λ ibh) = 0. This would contradict the choice of λi, thus, the inequality gi(h0 + λ bh) 0 must hold for all i. Since a point h0 with gi(h0) < 0 can be obtained by solving a convex program and solving the equations defining λi is, in general, simple, the process described provides an efficient way of sampling points from the convex set H . 5.3. Implementation for the L2 Loss We now describe how to fully implement our sampling-based algorithm for the case where L is equal to the L2 loss. In view of the results of Section 4, we let H = {h | h K Λ Lq(h , f Q) r2}. We first describe the steps needed to find a point h0 H . Let hΛ be such that hΛ K = Λ and λr R+ be such that the solution hr to the optimization problem min h H λr h 2 + Lq(h, f Q), satisfies Lq(hr, f Q) = r2. It is easy to verify that the existence of λr is guaranteed for minh H Lq(h, f Q) r2 Pm i=1 q(xi)y2 i . It is clear that the point h0 = 1 2(hr + hΛ) is in the interior of H . Of course, finding λr with the desired properties is not straightforward. However, since r is chosen via validation, we do not need to find λr as a function of r. Instead, we can simply select λr through validation too. In order to complete the sampling process, we must have an efficient way of selecting a random direction bh. If H Rd is a set of linear hypotheses, a direction bh can be sampled uniformly by letting bh = ξ ξ , where ξ is a standard Gaussian random variable in Rd. If H is a subset of a RKHS, by the representer theorem, we only need to consider hypotheses of the form h = Pm i=1 αi K(xi, ). Therefore, we can sample a direction bh = Pm i=1 α i K(xi, ), where the vector α = (α 1, . . . , α m) is drawn uniformly from the unit sphere in Rm. A full implementation of our algorithm thus consists of the following steps: find the distribution qmin = argminq Q disc(q, b P). This can be done by using the smooth approximation algorithm of Cortes and Mohri (2013); sample points from the set H using the sampling process described above; solve the QP introduced in Section 5.2. Notice that our algorithm only requires solving a simple QP and therefore its complexity is the same as other adaptation algorithms such as KMM, KLIEP and DM. 6. Experiments Here, we report the results of extensive comparisons between GDM and several other adaptation algorithms which demonstrate the benefits of our algorithm. We use the implementation described in the previous section. The source code for our algorithm as well as all other baselines described in this section can be found at http://cims.nyu.edu/~munoz. 6.1. Synthetic Data Set To compare the performances of the GDM and DM algorithms, we considered the following synthetic one-dimensional task, which is similar to the one considered by Huang et al. Adaptation Based on Generalized Discrepancy 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.2 0.8 0.4 0.0 0.2 Source Target DM GDM 1.5 1.0 0.5 0.0 0.5 1.0 0.00 0.05 0.10 0.15 0.20 Source Target DM GDM Figure 2: (a) Hypotheses obtained by training on source (green circles), target (red triangles) and using DM (dashed blue) and GDM algorithms (solid blue). (b) Objective functions for source and target distribution as well as GDM and DM algorithms. Sets H and surrogate hypothesis set H H are shown at the bottom. The vertical lines represent the minimizing hypothesis for each loss. (2006): the source domain examples were sampled from the uniform distribution over the interval [.2, 1] and target ones sampled uniformly over [0, .25]. The labels were given by the map x 7 x + x3 + ξ, where ξ is a Gaussian random variable with mean 0 and standard deviation 0.1. Our hypothesis set was defined by the family of linear functions without an offset. Figure 2(a) shows the regression hypotheses obtained by training the DM and GDM algorithm as well as those obtained by training on the source and target distributions. The ideal hypothesis is shown in red. Notice how the GDM solution gives a closer approximation than DM to the ideal solution. In order to better understand the difference between the solutions of these algorithms, Figure 2(b) depicts the objective function minimized by each algorithm as a function of the slope w of the linear function, the only variable of the hypothesis. The vertical lines show the value of the minimizing hypothesis for each loss. Keeping in mind that the regularization parameter λ used in ridge regression corresponds to a Lagrange multiplier for the constraint w2 Λ2 for some Λ (Cortes and Mohri, 2013) [Lemma 1], the hypothesis set H = {w|w2 Λ2} is depicted at the bottom of this plot. The shaded region represents the set H = H {h |Lqmin(h ) r}. It is clear from this plot that DM helps approximate the target loss function. Nevertheless, only GDM seems to uniformly approach it. This should come as no surprise since our algorithm was precisely designed to achieve that. 6.2. Adaptation Data Sets We now present the results of evaluating our algorithm against several other adaptation algorithms. GDM is compared against DM and training on the uniform distribution. The following baselines were also used: 1. The KMM algorithm (Huang et al., 2006), which reweights examples from the source distribution in order to match the mean of the source and target data in a feature space induced by a universal kernel. The hyper-parameters of this algorithm were set to the recommended values of B = 1000 and ϵ = m m 1 . 2. KLIEP (Sugiyama et al., 2007). This algorithm estimates the importance ratio of the source and target distribution by modeling this ratio as a mixture of basis functions and Cortes, Mohri and Mu noz Medina 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Source distribution: kin 8fh 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 GDM DM Unif Target KMM KLIEP FE kin 8nm kin 8fm kin 8nh 0.0 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 1.0 G fh to nm fh to fm fh to nh Figure 3: (a)MSE performance for different adaptation algorithms when adapting from kin-8fh to the three other kin-8xy domains. (b) Relative error of DM over GDM as a function of the ratio r learning the mixture coefficients from the data. Gaussian kernels were used as basis functions for this algorithm and KMM. The bandwidth for the kernel was selected from the set σd: σ = 2 5, . . . , 25 via validation on the test set, where d is the mean distance between points sampled from the source domain. 3. FE (Daum e III, 2007). This algorithm maps source and target data into a common high-dimensional feature space where the difference of the distributions is expected to reduce. Since the two-stage algorithm of Bickel et al. (2007) was already shown to perform similarly to KMM and KLIEP (Cortes and Mohri, 2013), for the sake of readability of our results, we omitted the results of comparison with this algorithm. Finally, we compare our algorithm with the ideal hypothesis h returned by training on the target sample T , which we denote by Tar. Notice that in practice, this is impossible as T is unlabeled so we use this result only to show the best attainable performance. We selected the set of linear functions as our hypothesis set. The learning algorithm used for all tasks was ridge regression and the performance evaluated by the mean squared error. We follow the setup of Cortes and Mohri (2011) and for all adaptation algorithms we selected the parameter λ via 10-fold cross validation over the training data by using a grid search over the set of values λ {2 10, . . . , 210}. The results of training on the target distribution are presented for a parameter λ tuned via 10-fold cross validation over the target data. We used the QP implementation of our algorithm with the sampling set H and the sampling mechanism defined at the end of Section 5.2, where the parameter λr was chosen from the same set as λ via validation on a small amount of data from the target distribution. Whereas there are other methods such as transfer cross validation (Zhong et al., 2010) to select the parameters for our algorithm, these methods require the use of importance weighting which as shown in (Cortes et al., 2010) is not theoretically justified. In order to achieve a fair comparison, all other algorithms were allowed to use the small amount of labeled data too. Since, with the exception of FE, all other baselines do not propose a way of dealing with labeled data from the target distribution, we simply added this data to the training set and ran the algorithms on the extended source data as discussed in Section 4.4. Adaptation Based on Generalized Discrepancy Task: Sentiment S T GDM DM Unif Tar KMM KLIEP F K 0.763 (0.222) 1.056 (0.289) 1.00 0.517 (0.152) 3.328 (0.845) 3.494 (1.144) 0.942 (0.093) E 0.574 (0.211) 1.018 (0.206) 1.00 0.367 (0.124) 3.018 (0.319) 3.022 (0.318) 0.857 (0.135) D 0.936 (0.256) 1.215 (0.255) 1.00 0.623 (0.152) 2.842 (0.492) 2.764 (0.446) 0.936 (0.110) B 0.854 (0.119) 1.258 (0.117) 1.00 0.665 (0.085) 2.784 (0.244) 2.642 (0.218) 1.047 (0.047) E 0.975 (0.131) 1.460 (0.633) 1.00 0.653 (0.201) 2.408 (0.582) 2.157 (0.255) 0.969 (0.131) D 0.884 (0.101) 1.174 (0.140) 1.00 0.665 (0.071) 2.771 (0.157) 2.620 (0.210) 1.111 (0.059) B 0.723 (0.138) 1.016 (0.187) 1.00 0.551 (0.109) 3.433 (0.694) 3.290 (0.583) 1.035 (0.059) K 1.030 (0.312) 1.277 (0.283) 1.00 0.636 (0.176) 2.173 (0.249) 2.223 (0.293) 0.955 (0.199) D 0.731 (0.171) 1.005 (0.166) 1.00 0.518 (0.117) 3.363 (0.402) 3.231 (0.483) 0.974 (0.102) B 0.992 (0.191) 1.026 (0.090) 1.00 0.740 (0.138) 2.571 (0.616) 2.475 (0.400) 0.986 (0.041) K 0.870 (0.212) 1.062 (0.318) 1.00 0.557 (0.137) 2.755 (0.375) 2.741 (0.347) 0.940 (0.087) E 0.674 (0.135) 0.994 (0.171) 1.00 0.478 (0.098) 2.939 (0.501) 2.878 (0.418) 0.907 (0.081) Table 2: Adaptation from books (B), kitchen (K), electronics (E) and dvd (D) to all other domains. Normalized results: MSE of training on the unweighted source data is equal to 1. Results in bold represent the algorithm with the lowest MSE. The first task we considered is given by the 4 kin-8xy Delve data sets (Rasmussen et al., 1996). These data sets are variations of the same model: a realistic simulation of the forward dynamics of an 8 link all-revolute robot arm. The task in all data sets consists of predicting the distance of the end-effector from a target. The data sets differ by the degree of non-linearity (fairly linear, x=f, or non-linear, x=n) and the amount of noise in the output (moderate, y=m, or high, y=h). The data set defines 4 different domains, that is 12 pairs of different distributions and labeling functions. A sample of 200 points from each domain was used for training and 10 labeled points from the target distribution were used to select H . The experiment was carried out 10 times and the results of testing on a sample of 400 points from the target domain are reported in Figure 3(a). The bars represent the median performance of each algorithm. The error bars are the low and high 25% quartiles respectively. All results were normalized in such a way that the median performance of training on the source is equal to 1. Notice that the performance of all algorithms is comparable when adapting to kin8-fm since both labeling functions are fairly linear, yet only GDM is able to reasonably adapt to the two data sets with different labeling functions. In order to better understand the advantages of GDM over DM we plot the relative error of DM against GDM as a function of the ratio r/Λ in Figure 3(b), where r is the radius defining H and is selected through cross validation. Notice that when the ratio r/Λ is small then both algorithms behave similarly which is most of the times for the adaptation task fh to fm. On the other hand, a better performance of GDM can be obtained when the ratio is larger. This is due to the fact that r/Λ measures the effective size of the set H . A small ratio means that the size of H is small and therefore the hypothesis returned by GDM will be close to that of DM where as if H is large then GDM has the possibility of finding a better hypothesis. For our next experiment we considered the cross-domain sentiment analysis data set of Blitzer et al. (2007). This data set consists of consumer reviews from 4 different domains: books, kitchen, electronics and dvds. We used the top 1000 uni-grams and bi-grams Cortes, Mohri and Mu noz Medina Task: Images S T GDM DM Unif Tar KMM KLIEP F I 0.927 (0.051) 1.005 (0.010) 1.00 0.879 (0.048) 2.752 (3.820) 0.936 (0.016) 0.959 (0.035) S 0.938 (0.064) 0.993 (0.018) 1.00 0.840 (0.057) 0.827 (0.017) 0.835 (0.020) 0.947 (0.025) B 0.909 (0.040) 1.003 (0.013) 1.00 0.886 (0.052) 0.945 (0.022) 0.942 (0.017) 0.947 (0.019) C 1.011 (0.015) 0.951 (0.011) 1.00 0.802 (0.040) 0.989 (0.036) 1.009 (0.042) 0.971 (0.024) S 1.006 (0.030) 0.992 (0.016) 1.00 0.871 (0.030) 0.930 (0.018) 0.936 (0.016) 0.973 (0.017) B 0.987 (0.022) 1.009 (0.010) 1.00 0.986 (0.028) 1.011 (0.028) 1.011 (0.028) 0.994 (0.018) C 1.022 (0.037) 0.982 (0.035) 1.00 0.759 (0.033) 1.172 (0.043) 1.201 (0.038) 0.938 (0.036) I 0.924 (0.049) 0.998 (0.030) 1.00 0.831 (0.047) 3.868 (4.231) 1.227 (0.039) 0.947 (0.028) B 0.898 (0.072) 1.003 (0.044) 1.00 0.821 (0.053) 1.240 (0.039) 1.248 (0.041) 0.945 (0.021) C 1.010 (0.014) 0.956 (0.017) 1.00 0.777 (0.031) 1.028 (0.033) 1.032 (0.031) 0.980 (0.019) I 1.012 (0.010) 1.004 (0.007) 1.00 0.966 (0.009) 2.785 (3.803) 0.981 (0.018) 1.000 (0.004) S 1.009 (0.018) 0.988 (0.010) 1.00 0.850 (0.035) 0.930 (0.022) 0.934 (0.024) 0.983 (0.013) Table 3: Adaptation from caltech256 (C), imagenet (I), sun (S) and bing (B). as the features for this task. For each pair of adaptation tasks we sampled 700 points from the source distribution and 700 unlabeled points from the target. Only 50 labeled points from the target distribution were used to tune the parameter r of our algorithm. The final evaluation is done on a test set of 1000 points. The mean results and standard deviations of this task are shown in Table 2 where the MSE values have been normalized in such a way that the performance of training on the source without reweighting is always 1. Finally, we considered a novel domain adaptation task (Tommasi et al., 2014) of paramount importance in the computer vision community. The domains correspond to 4 well known collections of images: bing, caltech256, sun and imagenet. These data sets have been standardized so that they all share the same feature representation and labeling function (Tommasi et al., 2014). We sampled 800 labeled points from the source distribution and 800 unlabeled points from the target distribution as well as 50 labeled target points to be used for validation of r. The results of testing on 1000 points from the target domain are presented in Table 3 where, again, the results were normalized in such a way that the performance of training on the source data is always 1. After analyzing the results of this section we notice that the GDM algorithm consistently outperforms DM and achieves similar or better performance than all other common adaptation algorithms. It is worth noticing that in some cases, other algorithms perform even worse than training on the unweighted sample. This deficiency of the KLIEP algorithm had already been pointed out by Sugiyama et al. (2007) but here we observe that this problem can also affect the KMM algorithm. Finally, let us point out that even though the FE algorithm also achieved performances similar to GDM on the sentiment and image adaptation, its performance was far from optimal adapting on the kin-8xy task. Since there is a lack of theoretical understanding for this algorithm, it is hard to characterize the scenarios where FE would perform better than GDM. 7. Conclusion We presented a new theoretically well-founded domain adaptation algorithm seeking to minimize a less conservative quantity than the DM algorithm. We presented an SDP so- Adaptation Based on Generalized Discrepancy lution for the particular case of the L2 loss which can be solved in polynomial time. Our empirical results show that our new algorithm always performs better than or is on par with the otherwise state-of-the-art DM algorithm. We also provided tight generalization bounds for the domain adaptation problem based on the Y-discrepancy. As pointed out in Section 4, an algorithm that minimizes the Y-discrepancy would benefit from the best possible guarantees. However, the lack of labeled data from the target distribution makes this algorithm not viable. This suggests analyzing a richer scenario where the learner is allowed to ask for a limited number of labels from the target distribution. This setup, which is related to active learning, seems to be in fact the closest one to real-life applications and has started to receive attention from the research community (Berlind and Urner, 2015). We believe that the discrepancy disc will play a central role in the analysis of that scenario as well. Acknowledgments We would like to thank the reviewers of KDD and JMLR for their suggestions to improve this paper. This work was partly funded by the NSF awards IIS-1117591 and CCF-1535987. Appendix A. SDP Formulation Lemma 19 The Lagrangian dual of the problem max a Rm Ksa y 2 r2 1 2 Ksta 2 b Kt Ksta, (26) is given by min η 0,γ γ 2K st Kst + ηK2 s 1 2K st Ktb ηKsy 1 2b Kt Kst ηy Ks η( y 2 r2) + γ Furthermore, the duality gap for these problems is zero. Proof For η 0 the Lagrangian of (26) is given by L(a, η) = 1 2 Ksta 2 b Kt Ksta η( Ksa y 2 r2) 2K st Kst ηK2 s a + (2ηKsy K st Ktb) a η( y 2 r2). Since the Lagrangian is a quadratic function of a and that the conjugate function of a quadratic can be expressed in terms of the pseudo-inverse, the dual is given by min η 0 1 4(2ηKsy K st Ktb) ηK2 s 1 2K st Kst (2ηKsy K st Ktb) η( y 2 r2) s. t. ηK2 s 1 2K st Kst 0. Cortes, Mohri and Mu noz Medina Introducing the variable γ to replace the objective function yields the equivalent problem s. t. ηK2 s 1 2K st Kst 0 4(2ηKsy K st Ktb) ηK2 s 1 2K st Kst (2ηKsy K st Ktb) + η( y 2 r2) 0 Finally, by the properties of the Schur complement (Boyd and Vandenberghe, 2004), the two constraints above are equivalent to 1 2K st Kst + ηK2 s 1 2K st Ktb ηKsy 1 2K st Ktb ηKsy η( y 2 r) + γ Since duality holds for a general QCQP with only one constraint (Boyd and Vandenberghe, 2004)[Appendix B], the duality gap between these problems is 0. Proposition 20 The optimization problem (23) is equivalent to the following SDP: max α,β,ν,Z,z 1 2 Tr(K st Kst Z) β α 2K st Kst 1 4 e K νKsy + 1 4z e K α + ν( y 2 r2) λKt + K2 t 1 2Kt Kstz 1 2z K st Kt β 0 Tr(K2 s Z) 2y Ksz + y 2 r2 ν 0, where e K = K st Kt(λKt + K2 t ) Kt Kst. Proof By Lemma 16, we may rewrite (23) as min a,γ,η,b b (λKt + K2 t )b + 1 2a K st Ksta a K st Ktb + γ (27) 2K st Kst + ηK2 s 1 2K st Ktb ηKsy 1 2b Kt Kst ηy Ks η( y 2 r2) + γ Ksa y 2 r2. Let us apply the change of variables b = 1 2(λKt +K2 t ) Kt Ksta+v. The following equalities can be easily verified. b (λKt + K2 t )b = 1 4a K st Kt(λKt + K2 t ) Kt Ksta + v Kt Ksta + v (λKt + K2 t )v. a K st Ktb = 1 2a K st Kt(λKt + K2 t ) Kt Ksta + v Kt Ksta. Adaptation Based on Generalized Discrepancy Thus, replacing b on (27) yields min a,v,γ,η v (λKt + K2 t )v + a 1 2K st Kst 1 4 e K a + γ 2K st Kst + ηK2 s 1 4 e Ka + 1 2K st Ktv ηKsy 1 4a e K + 1 2v Kt Kst ηy Ks η( y 2 r2) + γ Ksa y 2 r2. Introducing the scalar multipliers µ, ν 0 and the matrix Z z z ez, as a multiplier for the matrix constraint, we can form the Lagrangian: L := v (λKt + K2 t )v + a 1 2K st Kst 1 4 e K a + γ µη + ν( Ksa y 2 r2) 2K st Kst + ηK2 s 1 4 e Ka + 1 2K st Ktv ηKsy 1 4a e K + 1 2v Kt Kst ηy Ks η( y 2 r2) + γ The KKT conditions L γ = 0 trivially imply ez = 1 and Tr(K2 s Z) 2y Ksz + y 2 r2 + µ = 0. These constraints on the dual variables guarantee that the primal variables η and γ will vanish from the Lagrangian, thus yielding 2 Tr(K st Kst Z) + ν( y 2 r2) + v (λKt + K2 t )v z K st Ktv + a νK2 s + 1 2K st Kst 1 4 e K a 2νKsy + 1 This is a quadratic function on the primal variables a and v with minimizing solutions 2K st Kst 1 4 e K 2νKsy + 1 2 e Kz and v = 1 2(λKt + K2 t ) Kt Kstz, and optimal value equal to the objective of the Lagrangian dual: 1 2 Tr(K st Kst Z) + ν( y 2 r2) 1 2 e Kz νK2 s + 1 2K st Kst 1 4 e K 2νKsy + 1 As in Lemma 16, we apply the properties of the Schur complement to show that the dual is given by max α,β,ν,Z,z 1 2 Tr(K st Kst Z) β α 2K st Kst 1 4 e K νKsy + 1 4z e K α + ν( y 2 r2) Tr(K2 s Z) 2y Ksz + y 2 r2 β 1 4z e Kz ν 0. Cortes, Mohri and Mu noz Medina Finally, recalling the definition of e K and using the Schur complement one more time we arrive to the final SDP formulation max α,β,ν,Z,z 1 2 Tr(K st Kst Z) β α 2K st Kst 1 4 e K νKsy + 1 4z e K α + ν( y 2 r2) λKt + K2 t 1 2Kt Kstz 1 2z K st Kt β 0 Tr(K2 s Z) 2y Ksz + y 2 r2 ν 0. Appendix B. QP Formulation Proposition 21 Let Y = (Yij) Rn k be the matrix defined by Yij = n 1/2hj(x i) and y = (y 1, . . . , y k) Rk the vector defined by y i = n 1 Pn j=1 hi(x j)2. Then, the dual problem of (24) is given by max α,γ,β Yα + γ 2Kt 1 Yα + γ 2γ Kt K tγ + α y β (28) s.t. 1 α = 1 2, 1β Y γ, α 0, where 1 is the vector in Rk with all components equal to 1. Furthermore, the solution h of (24) can be recovered from a solution (α, γ, β) of (28) by x, h(x) = Pn i=1 ai K(xi, x), where a = λI + 1 2Kt) 1(Yα + 1 We will first prove a simplified version of the proposition for the case of linear hypotheses, i.e. we can represent hypotheses in H and elements of X as vectors w, x Rd respectively. Define X = n 1/2(x 1, . . . , x n) to be the matrix whose columns are the normalized sample points from the target distribution. Let also {w1, . . . , wk} be a sample taken from H and define W := (w1, . . . , wk) Rd k. Under this notation, problem (24) may be rewritten as min w Rd λ w 2 + 1 2 max i=1,...,k X (w wi) 2 + 1 2 min w C X (w w ) 2. (29) Lemma 22 The Lagrange dual of problem (29) is given by max α,γ,β Yα+ γ 2γ X (X X ) X γ + α y β s. t. 1 α = 1 2 1β Y γ α 0, where Y = X W and y i = X wi 2. Adaptation Based on Generalized Discrepancy Proof By applying the change of variable u = w w, problem (29) is can be made equivalent to min w Rdu C w λ w 2 + 1 2 X w 2 + 1 2 X u 2 + 1 2 max i=1,...,k X wi 2 2w i X X w. By making the constraints on u explicit and replacing the maximization term with the variable r the above problem becomes min w,u,r,µ λ w 2 + 1 2 X w 2 + 1 2 X u 2 + 1 s. t. 1r y 2Y X w 1 µ = 1 µ 0 W µ w = u. For α, δ 0, the Lagrangian of this problem is defined as L(w, u, µ, r, α, β, δ, γ ) = λ w 2 + 1 2 X w 2 + 1 2 X u 2 + 1 2r + β(1 µ 1) + α (y 2(X Y) w 1r) δ µ + γ (W µ w u). Minimizing with respect to the primal variables yields the following KKT conditions: 2 1β = δ W γ . (30) X X u = γ 2 λI + X X w = 2(X Y )α + γ . (31) Condition (30) implies that the terms involving r and µ will vanish from the Lagrangian. Furthermore, the first equation in (31) implies that any feasible γ must satisfy γ = X γ for some γ Rn. Finally, it is immediate that γ u = u X X u and 2w λI + X X 2α (X Y) w + γ w. Thus, at the optimal point, the Lagrangian becomes 2u X X u + α y β s. t. 1 α = 1 2 1β = δ W γ α 0 δ 0. The positivity of δ implies that 1β W γ . Solving for w and u on (31) and applying the change of variable X γ = γ we obtain the final expression for the dual problem: max α,γ,β Yα+ γ 2γ X (X X ) X γ + α y β s. t. 1 α = 1 2 1β Y γ α 0, where we have used the fact that Y γ = WX γ to simplify the constraints. Notice also that we can recover the solution w of problem (29) as w = (λI + 1 2X X ) 1X (Yα + 1 The proof of Proposition 18 follows from a straightforward application of the well known matrix identities X (λI + X X ) 1 = (λI + X X ) 1X and X X (X X ) = X (X X ) X , and by the fact that the kernel matrix Kt is equal to X X . Cortes, Mohri and Mu noz Medina Appendix C. µ-admissibility Lemma 23 Assume that Lp(h(x), y) M for all x X and y Y, then Lp is µ-admissible with µ = p Mp 1. Proof Since x 7 xp is p-Lipschitz over [0, 1] we can write |L(h(x), y) L(h (x), y)| = Mp |h(x) y| p |h (x) y| p Mp 1|h(x) y + y h (x)| = p Mp 1|h(x) h (x)|. Lemma 24 Let L be the Lp loss for some p 1 and let h, h , h be functions satisfying Lp(h(x), h (x)) M and Lp(h (x), h (x)) M for all x X, for some M 0. Then, for any distribution D over X, the following inequality holds: |LD(h, h ) LD(h , h )| p Mp 1[LD(h, h )] 1 p . (32) Proof Proceeding as in the proof of Lemma 23, we obtain |LD(h, h ) LD(h , h )| = | E x D Lp(h(x), h (x)) Lp(h (x), h (x) | p Mp 1 E x D |h(x) h (x)| . Since p 1, by Jensen s inequality, we can write Ex D |h(x) h (x)| Ex D |h(x) h (x)|p 1/p = [LD(h, h )] 1 p . Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. In Proceedings of ALT, pages 139 153, 2012. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Proceedings of NIPS, pages 137 144, 2006. Shai Ben-David, Tyler Lu, Teresa Luu, and D avid P al. Impossibility theorems for domain adaptation. JMLR - Proceedings Track, 9:129 136, 2010. Christopher Berlind and Ruth Urner. Active nearest neighbors in changing environments. In Proceedings of ICML, pages 1870 1879, 2015. Steffen Bickel, Michael Br uckner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of ICML, pages 81 88, 2007. Adaptation Based on Generalized Discrepancy John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of ACL, 2007. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004. Corinna Cortes and Mehryar Mohri. Domain adaptation in regression. In Proceedings of ALT, 2011. Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 9474, 2013. Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Proceedings of ALT, pages 38 53, 2008. Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Proceedings of NIPS, pages 442 450, 2010. Hal Daum e III. Frustratingly easy domain adaptation. In Proceedings of ACL, 2007. Luc Devroye, L azl o Gy orfi, and G abor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996. Mark Dredze, John Blitzer, Partha Pratim Talukdar, Kuzman Ganchev, Jo ao Gra ca, and Fernando Pereira. Frustratingly hard domain adaptation for dependency parsing. In EMNLP-Co NLL, 2007. Pascal Germain, Amaury Habrard, Fran cois Laviolette, and Emilie Morvant. A PACBayesian approach for domain adaptation with specialization to linear classifiers. In Proceedings of ICML, 2013. Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch olkopf, and Alexander J. Smola. A kernel two-sample test. JMLR, 13:723 773, 2012. Judy Hoffman, Trevor Darrell, and Kate Saenko. Continuous manifold based adaptation for evolving visual domains. In Proceedings of IEEE CVPR, pages 867 874, 2014. Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Sch olkopf. Correcting sample selection bias by unlabeled data. In Proceedings of NIPS, volume 19, pages 601 608, 2006. Jing Jiang and Cheng Xiang Zhai. Instance Weighting for Domain Adaptation in NLP. In Proceedings of ACL, pages 264 271, 2007. Piyush Kumar, Joseph S. B. Mitchell, and E. Alper Yildirim. Computing core-sets and approximate smallest enclosing hyperspheres in high dimensions. In ALENEX, Lecture Notes Comput. Sci, pages 45 55, 2003. Cortes, Mohri and Mu noz Medina C. J. Leggetter and Philip C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9(2):171 185, 1995. Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of COLT. Omnipress, 2009. Aleix M. Mart ınez. Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class. IEEE Trans. Pattern Anal., 24(6), 2002. Andr es Mu noz Medina. Learning Theory and Algorithms for Auctioning and Adaptation Problems. Ph D thesis, New York University, 2015. Mehryar Mohri and Andres Mu noz. New analysis and algorithm for learning with drifting distributions. In Proceedings of ALT. Springer, 2012. Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Trans. on Neural Networks, 22(2):199 210, 2011. Carl Edward Rasmussen, Radford M. Neal, Geoffrey Hinton, Drew van Camp, Michael Revow Zoubin Ghahramani, Rafal Kustra, and Rob Tibshirani. The delve project. http: //www.cs.toronto.edu/~delve/data/datasets.html, 1996. version 1.0. Sven Sch onherr. Quadratic Programming in Geometric Optimization: Theory, Implementation, and Applications. Ph D thesis, Swiss Federal Institute of Technology, 2002. Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul von B unau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of NIPS, pages 1433 1440, 2007. Tatiana Tommasi, Tinne Tuytelaars, and Barbara Caputo. A testbed for cross-dataset analysis. Co RR, abs/1402.5923, 2014. URL http://arxiv.org/abs/1402.5923. Leslie G. Valiant. A Theory of the Learnable. ACM Press New York, NY, USA, 1984. Vladimir N. Vapnik. Statistical Learning Theory. J. Wiley & Sons, 1998. E. Alper Yildirim. Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization, 19(3):1368 1391, 2008. Chao Zhang, Lei Zhang, and Jieping Ye. Generalization bounds for domain adaptation. In Proceedings of NIPS, pages 1790 1798. MIT Press, 2012. Erheng Zhong, Wei Fan, Qiang Yang, Olivier Verscheure, and Jiangtao Ren. Cross validation framework to choose amongst models and datasets for transfer learning. In Proceedings of ECML PKDD 2010 Part III, pages 547 562, 2010.