# multisource_domain_adaptation_a_causal_view__32a3a78e.pdf Multi-Source Domain Adaptation: A Causal View Kun Zhang MPI for Intelligent Systems 72076, T ubingen, Germany kzhang@tuebingen.mpg.de Mingming Gong QCIS, University of Technology Sydney Ultimo, NSW 2007, Australia gongmingnju@gmail.com Bernhard Sch olkopf MPI for Intelligent Systems 72076, T ubingen, Germany bs@tuebingen.mpg.de This paper is concerned with the problem of domain adaptation with multiple sources from a causal point of view. In particular, we use causal models to represent the relationship between the features X and class label Y , and consider possible situations where different modules of the causal model change with the domain. In each situation, we investigate what knowledge is appropriate to transfer and find the optimal target-domain hypothesis. This gives an intuitive interpretation of the assumptions underlying certain previous methods and motivates new ones. We finally focus on the case where Y is the cause for X with changing PY and PX|Y , that is, PY and PX|Y change independently across domains. Under appropriate assumptions, the availability of multiple source domains allows a natural way to reconstruct the conditional distribution on the target domain; we propose to model PX|Y (the process to generate effect X from cause Y ) on the target domain as a linear mixture of those on source domains, and estimate all involved parameters by matching the target-domain feature distribution. Experimental results on both synthetic and real-world data verify our theoretical results. Traditional machine learning relies on the assumption that both training and test data are from the same distribution. In practice, however, training and test data are probably sampled under different conditions, thus violating this assumption, and the problem of domain adaptation (DA) arises. Consider remote sensing image classification as an example. Suppose we already have several data sets on which the class labels are known; they are called source domains here. For a new data set, or a target domain, it is usually difficult to find the ground truth reference labels, and we aim to determine the labels by making use of the information from the source domains. Note that those domains are usually obtained in different areas and time periods, and that the corresponding data distribution various due to the change in illumination conditions, physical factors related to ground (e.g., different soil moisture or composition), vegetation, and atmospheric conditions. Other well-known instances of this situation include sentiment data analysis (Blitzer, Dredze, and Pereira 2007) and flow cytometry data analysis (Blanchard, Lee, and Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Scott 2011). DA approaches have many applications in varies areas including natural language processing, computer vision, and biology. For surveys on DA, see, e.g., (Jiang 2008; Pan and Yang 2010; Candela et al. 2009). In this paper, we consider the situation with n source domains on which both the features X and label Y are given, i.e., we are given (x(i), y(i)) = (x(i) k , y(i) k )mi k=1, where i = 1, ..., n, and mi is the sample size of the ith source domain. Our goal is to find the classifier for the target domain, on which only the features xt = (xt k)m k=1 are available. Here we are concerned with a difficult scenario where no labeled point is available in the target domain, known as unsupervised domain adaptation. Since PXY changes across domains, we have to find what knowledge in the source domains should be transferred to the target one. Previous work in domain adaptation has usually assumed that PX changes but PY |X remain the same, i.e., the covariate shift situation; see, e.g., (Shimodaira 2000; Huang et al. 2007; Sugiyama et al. 2008; Ben-David, Shalev-Shwartz, and Urner 2012). It is also known as sample selection bias (particularly on the features X) in (Zadrozny 2004). In practice it is very often that both PX and PY |X change simultaneously across domains. For instance, both of them are likely to change over time and location for a satellite image classification system. If the data distribution changes arbitrarily across domains, clearly knowledge from the sources may not help in predicting Y on the target domain (Rosenstein et al. 2005). One has to find what type of information should be transferred from sources to the target. One possibility is to assume the change in both PX and PY |X is due to the change in PY , while PX|Y remains the same, as known as prior probability shift (Storkey 2009; Plessis and Sugiyama 2012) or target shift (Zhang et al. 2013). The latter further models the change in PX|Y caused by a location-scale (LS) transformation of the features for each class. The constraint of the LS transformation renders PX|Y on the target domain, denoted by P t X|Y , identifiable; however, it might be too restrictive. Fortunately, the availability of multiple source domains provides more hints as to find P t X|Y , as well as P t Y |X. Several algorithms have been proposed to combine knowledge from multiple source domains. For instance, (Mansour, Mohri, and Rostamizadeh 2008) proposed to form the target hypothesis Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence by combining source hypotheses with a distribution weighted rule. (Gao et al. 2008), (Duan et al. 2009), and (Chattopadhyay et al. 2011) combine the predictions made by the source hypotheses, with the weights determined in different ways. An intuitive interpretation of the assumptions underlying those algorithms would facilitate choosing or developing DA methods for the problem at hand. To the best of our knowledge, however, it is still missing in the literature. One of our contributions in this paper is to provide such an interpretation. This paper studies the multi-source DA problem from a causal point of view where we consider the underlying data generating process behind the observed domains. We are particularly interested in what types of information stay the same, what types of information change, and how they change across domains. This enables us to construct the optimal hypothesis for the target domain in various situations. To this end, we use causal models to represent the relationship between X and Y , because they provide a compact description of the properties of the change in the data distribution.1 They, for instance, help characterize transportability of experimental findings (Pearl and Bareinboim 2011) or recoverability from selection bias (Bareinboim, Tian, and Pearl 2014). As another contribution, we further focus on a typical DA scenario where both PY and PX|Y (or the causal mechanism to generate effect X from cause Y ) change across domains, but their changes are independent from each other, as implied by the causal model Y X. We assume that the source domains contains rich information such that for each class, P t X|Y can be approximated by a linear mixture of PX|Y on source domains. Together with other mild conditions on PX|Y , we then show that P t X|Y , as well as P t Y , is identifiable (or can be uniquely recovered). We present a computationally efficient method to estimate the involved parameters based on kernel mean distribution embedding (Smola et al. 2007; Gretton et al. 2007), followed by several approaches to constructing the target classifier using those parameters. One might wonder how to find the causal information underlying the data to facilitate domain adaptation. We note that in practice, background causal knowledge is usually available, helping formulating how to transfer the knowledge from source domains to the target. Even if this is not the case, multiple source domains with different data distributions may allow one to identify the causal structure, since the causal knowledge can be seen from the change in data distributions; see e.g., (Tian and Pearl 2001). 1 Possible DA Situations and Their Solutions DA can be considered as a learning problem in nonstationary environments (Sugiyama and Kawanabe 2012). It is helpful to find how the data distribution changes; it provides the clues as to find the learning machine for the target domain. We focus on how causality, which provides a compact and intuitive description about distribution changes, helps us in 1The causal model also describes how the components of the joint distribution are related to each other, which, for instance, gives a causal explanation of the behavior of semi-supervised learning (Sch olkopf et al. 2012). X, Y random variables X, Y domains P (i) XY distribution in the ith source domain P t XY distribution in the target domain (x(i), y(i)) = (x(i) k , y(i) k )mi k=1 sample in the ith source domain x(i) j = (x(i) jk ) mij k=1 X values with Y = cj in the ith source domain xt = (xt k)m k=1 X values in the target domain Kt kernel matrix on xt Kit cross kernel matrix between x(i) and xt ψ(X) feature map of X Table 1: Notation used in this paper. DA. Generally speaking, in the unconfounded case, the process that generates the effect from the cause does not depend on that generating the cause (Pearl 2000). We can represent such knowledge with graphical models, or selection diagrams defined in (Pearl and Bareinboim 2011). In particular, let us consider four situations which are often the case in practice; see Fig. 1. Here Ws and Vs are represent domain-specific selection variables, and they are hidden variables.2 Below we discuss what knowledge to transfer from source domains to target, and how to construct the optimal targetdomain hypothesis in each situation. For clarity and simplicity of the presentation, the causal models in the figure are simplified we do not consider the existence of possible confounders underlying X and Y or the relationship between the components of X. We would like to remark that in many supervised tasks, Y is the cause of X, e.g., in clinic diagnosis and handwritten digit recognition problems. The analysis in this section applies to both classification and regression. Situation 1 (Fig. 1.a): X Y with changing PX and fixed PY |X (covariate shift). Theoretically speaking, in this case PX is irrelevant for modeling PY |X; however, if one uses a simple model to predict Y , which is usually the case, under-fit of the conditional model causes the predicted Y to depend on the input distribution PX; importance reweighting according to the difference in PX between the source and target domains is widely used to correct covariate shift (Shimodaira 2000; Sugiyama et al. 2008). Situation 2 (Fig. 1.b): X Y with changing PY |X (and possibly changing PX). Below we derive the optimal hypothesis for the target domain. Let P t Y |X be the underlying optimal posterior of Y on the target domain; see Table 1 for the notation used in this paper. Since Vs is un- 2Such variable are graphically depicted as square nodes in (Pearl and Bareinboim 2011). We would like to distinguish between the domain-specific selection diagram and the sample selection bias procedure used in (Bareinboim, Tian, and Pearl 2014). In the former, the selection variables Ws and Vs are root variables and encode the information that they change the corresponding data-generating process across domains. In the latter, the selection variable is a sink node and encodes the property of the final sampling procedure. Ws X Y Ws X Y Ws Y X Ws Y X Figure 1: Possible situations of DA. X denote the feature vector, and Y is the target to be predicted. Ws and Vs are domain-specific selection variables assumed to be independent, leading to changing PXY across domains. (a) Covariate shift: PX is changed by Ws, but PY |X does not change. (b) Ws and Vs change PX and PY |X, respectively. (c) Target shift: Ws changes PY , with PX|Y unchanged. (d) Ws and Vs change PY and PX|Y , respectively. In the first two situations, we consider X as a cause for Y , whilst in the last two situations, Y is a cause of X. known, we can estimate the optimal hypothesis by minimizing the expected Kullback-Leibler divergence between P t XY |Ws,Vs = P t X|Ws P t Y |X,Vs = P t XP t Y |X,Vs and P t XP t Y |X (or maximizing the expected likelihood), which is given below, and the following position gives the solution. EVs KL(P t XY |Ws,Vs P t XP t Y |X) =EX,Y,Vs log P t XP t Y |X,Vs P t XP t Y |X = EX,Y,Vs log P t Y |X,Vs P t Y |X Proposition 1. Minimizing (1) w.r.t. a valid conditional distribution P t Y |X has the solution P t Y |X = R PY |X,Vsd PVs = EVs[PY |X,Vs] In practice, the constructed optimal hypothesis would be ˆP t Y |X = 1 n Pn i=1 P (i) Y |X. That is, the learned target hypothesis is a convex combination (or more specifically, the average) of the source hypotheses. In (Mansour, Mohri, and Rostamizadeh 2008) this is known as the convex combination rule. Situation 3 (Fig. 1.c): Y X, with changing PY and fixed PX|Y . This is called prior probability shift (Storkey 2009) or target shift (Zhang et al. 2013). (Plessis and Sugiyama 2012) and (Zhang et al. 2013) studied how to estimate the change in PY in this situation, and the latter also applies for regression problems (i.e., with continuous Y ). Here we consider multiple source domains. Suppose P t Y can be represented as P t Y = Pn i=1 αi P (i) Y ; we can derive the posterior of Y on the target domain: P t Y |X = PX|Y P t Y P t X = PX|Y Pn i=1 αi P (i) Y P t X = Pn i=1 αi P (i) XY Pn i=1 αi P (i) X = αi P (i) X Pn q=1 αq P (q) X P (i) Y |X. (2) The hypothesis for the target domain is then a distribution weighted combination of the individual hypotheses on source domains. This combination rule has been discussed in (Mansour, Mohri, and Rostamizadeh 2008), and here we have shown that in Situation 3 it is actually optimal. (Mansour, Mohri, and Rostamizadeh 2008) also compared this combination rule against the convex combination rule (see Situation 2), and the former was shown to be superior. This is consistent with the fact that in most classification problems Y is the cause for X; one can think of handwritten digit recognition and medical diagnosis as typical examples. Situation 4 (Fig. 1.d): Y X with changing PX|Y (and possibly changing PY ). This is known as generalized target shift in (Zhang et al. 2013), where only a single source domain was considered. In Situation 4 we have to make certain assumptions on how PX|Y changes; with them, fortunately, P t X might provide additional knowledge to find the optimal classifier. This case will be further discussed in detail in the next section. 2 DA with Independently Changing PY & PX|Y Here we consider Situation 4, where PY and PX|Y both change across domains, as shown in Fig. 1.d. According to the graphical model or the causal explanation Y X, we know that PY and PX|Y change independent from each other. In this section we restrict our attention to classification problems. Generally speaking, without further conditions on the data generating process, it is not possible to recover P t X|Y , the conditional distribution on the target domain. It is possible to solve the problem under rather restrictive assumptions. For instance, (Zhang et al. 2013) considers DA with a single source domain, and assumes that the change in PX|Y follows the location-scale (LS) transformation; P t X|Y is then generally identifiable. They have reported that LS generalized target shift produces a much better performance on remote sensing image classification then all alternatives, which demonstrates that Situation 4 is practically relevant for some rather complex DA problems. Compared to a single source domain, multiple source domains contain much richer information as to how to determine PX|Y on the target domain, and we can avoid the constraint of the LS transformation. 2.1 Model: Target Conditional as a Linear Mixture of Source Conditionals Motivation One can consider PX|Y,Vs (which is the conditional PX|Y in the domain associated with Vs; see Fig. 1.d) as the mechanism to generate features from the class label given the domain. Imagine that there exist L elementary sub-mechanisms , or class conditional feature distributions, P (l) X|Y , l = 1, ..., L, so that the mechanism in each domain, PX|Y,Vs, is a mixture of those sub-mechanisms, i.e., PX|Y =cj,Vs = PL l=1 αVs,j,l P (l) X|Y =cj, where αVs,j,l depend on both Vs and j, αVs,l 0, and PL l=1 αVs,l = 1. Consequently, in the multi-source DA scenario, if for each j, the rank of {P (i) X|Y =cj | i = 1, ..., n} is equal to L, P t X|Y =cj can always be represented as a linear mixture of P (i) X|Y =cj. More generally speaking, the proposed approach was also inspired by latent variable modeling. According to Fig. 1.d, we know that PX|Y =cj, or computationally more easily, its kernel embedding (Smola et al. 2007; Gretton et al. 2007), is actually a function of Vs: µ[PX|Y =cj,Vs] = Z ψ(x)PX|Y =cj,Vsdx = Fj(Vs), (3) where Fj are infinite-dimensional vector functions, which might vary for different values of j. Here Vs contains domainspecific conditions. For instance, for object recognition, it may contain the illumination condition, the angle from which the image was taken, etc. One can see that the intrinsic dimensionality of {µ[P (i) X|Y =cj] | i = 1, ..., n}, is upper bounded by the intrinsic dimensionality of Vs, denoted by df. They are equal if Fj is non-degenerate, i.e., if there is no loss of degree of freedom in the transformation (3). We define df as the degreeof-freedom in the conditional distribution change. Generally speaking, the higher df, the more complex the change in PX|Y =cj across domains. Since on source domains we only know that Vs might change across domains but cannot access its values, we cannot directly find df. For simplicity, let us assume that Fj in (3) can be approximated by a linear function,3 i.e., µ[PX|Y =cj,Vs] = LFj Vs, where LFj has an infinite number of rows and df columns. That is, µ[P (1) X|Y =cj], , µ[P (n) X|Y =cj] = LFj V (1) s , , V (n) s . If we further assume that V t s can be constructed as a linear mixture of Vs on source domains, then P t X|Y =cj is a linear mixture of PX|Y =cj on source domains. This tends to be the case if df is small: in this case, the rank of Vs is small, and then the class conditional feature distributions are likely to be linearly dependent, that is, the target-domain conditional distribution is likely to be represented as a linear mixture of those on source domains. If needed, in such situations we can directly estimate df from source domains by finding the rank of the estimated µ[P (i) X|Y =cj], i = 1, ..., n, under the condition that we have enough source domains which are diverse enough. More specifically, let ˆ µ j = ˆµ[P (1) X|Y =cj], ..., ˆµ[P (n) X|Y =cj] = 1 m1j ψ(x(1) j )1, ..., 1 m1j ψ(x(n) j )1 , where 1 denotes the vector of all 1 s of an appropriate size; under this condition, df can be estimated as the maximum of the following quantity for all j: rank( ˆ µ j) = rank( ˆ µ j ˆ µ j) = rank(Qj), (4) 3This holds if Fj is essentially linear, or if Vs does not change too much so that one can use linear approximation for Fj on all observed domains. where the (i, i )th entry of Qj is 1 mijmi j 1 K(x(i) j ), x(i ) j ))1. In practice, an appropriately chosen threshold is needed to determine the rank, due to the estimation error in the kernel mean embedding. Formulation Motivated by this, we make the following assumption on PX|Y on the target domain. A1. For each y, P t X|Y =y is a mixture of PX|Y =y on the source domains, i.e., there exist αij, which satisfy the constraint Pn i=1 αij = 1 for all j, such that P new X|Y =cj = i=1 αij P (i) X|Y =cj (5) is equal to P t X|Y =cj, where cj is the jth possible value of Y .4 Denote by P new Y a marginal distribution of Y , and use P new Y (cj) as shorthand for P new Y (Y = cj). The corresponding joint distribution is P new X,Y =cj = P new Y (cj)P new X|Y =cj, (6) and the marginal distribution of X is then j=1 P new Y (cj) i=1 αij P (i) X|Y =cj. (7) We aim to match P new X with P t X by tuning the parameters αij and P new Y (cj). Here we have the constraints P new Y (cj) 0, and PC j=1 P new Y (cj) = 1. Let βij P new Y (cj)αij, which satisfy the condition i=1 βij = 1. (8) Once we find the values of βij, we can reconstruct pnew Y and αij by P new Y (cj) = Pn i=1 βij, and αij = βij P new Y (cj). The following theorem states that under mild conditions, P t X|Y can be uniquely recovered. Theorem 1. Let Assumption A1 hold. Further make the following assumption: A2. For any constants dij that satisfy Pn i=1 d2 ij = 0, it holds that Pn i=1 dij P (i) X|Y =cj, j = 1, ..., C, are always linearly independent, if they are not zero. Then if P new X = P t X, we have P new Y = P t Y and P new X|Y = P t X|Y , i.e, P new XY is identical to P t XY . 4We have two remarks here. First, for the domains with P (i) Y (cj) = 0, P (i) X|Y =cj is undefined, and one can simply set αij = 0. Second, usually the weights αij in a distribution mixture model are assumed to be nonnegative; however, this is not necessary to guarantee that the constructed P t X|Y =cj is a valid distribution. For flexibility of the mixture model, we allow αij to be negative, as long as P new X|Y =cj is a valid distribution, which, under appropriate assumptions, is achieved by matching P new X with P t X, as implied by Theorem 1. To get an idea how strong (or weak) Assumption A2 is, note that it is an assumption of linear independence of probability measures, or of densities (as functions of x). For continuous x, those are objects in infinite-dimensional spaces, and linear independence is the generic case rather than a special situation. A sufficient condition for Assumption A2 is that P (i) X|Y =cj, i = 1, ..., n, j = 1, ..., C, are linearly independent. Note that this conditional is much stronger: Assumption A2 allows P (i) X|Y =cj, i = 1, ..., n, to be linear dependent for the same j. In fact, here we do not care about the the identifiability of the parameters βij (or αij and PY ), but the identifiability of P new X|Y . 2.2 Parameter Estimation by Reproducing the Target Feature Distribution We can estimate the parameters βij, and hence αij and P new Y , by minimizing the maximum mean discrepancy (MMD; see (Gretton et al. 2007)): µ[P new X ] µ[P t X]] = EP new X [ψ(X)] µ[P t X] j=1 P new Y (cj) i=1 αijµ[P (i) X|Y =cj] µ[P t X] . (9) Let x(i) jk , k = 1, ..., mij denote the data points of X in the ith source domain for which Y = cj, where mij is the total number of points in the ith source domain for which Y = cj. Similarly, xt k denotes the kth point of X in the target domain. In practice, we minimize the square of the empirical version of (9): j=1 P new Y (cj) k=1 ψ(x(i) jk ) 1 k=1 ψ(xt k) 2 βijβi j mijmi j k =1 k(x(i) jk , x(i ) j k ) k =1 k(x(i) jk , xt k ) + const. (10) Let β (β11, ..., β1C, β21, ..., β2C, ..., βn1, ..., βn C) , A be a n C n C matrix with A(i 1)C+j,(i 1)C+j = 1 mijmi j P k P k k(x(i) jk , x(i ) j k ) = 1 mijmi j 1 K(x(i) j , x(i ) j )1 for i {1, 2, ..., n}, i {1, 2, ..., n}, j {1, 2, ..., C}, and j {1, 2, ..., C}, and b be a n C-dimensional vector with its entries b(i 1)C+j = 1 mmij Pmij k=1 Pm k =1 k(x(i) jk , xt k ) = 1 mmij 1 K(x(i) j , xt k ) for i {1, 2, ..., n}, i {1, 2, ..., n}, j {1, 2, ..., C}. β can then be estimated by minimizing J0: J0 = β A β + 2b β + const, (11) subject to the constraint (8).5 This is a quadratic programming (QP) problem. After finding the values of β, we can then 5Here we use a hard constraint on βij. Note that in (Huang et al. 2007; Gretton et al. 2008), a slightly different constraint was used for importance weights to correct for covariate shift. construct αij and P new Y (cj). For some practical issues in this optimization procedure, including enforcing the sparsity constraint on αij; see Supplementary Material. In our experiments, we use the Gauss kernel, which is known to be characteristic; unless specified otherwise, we adopt the median heuristic to set the kernel width. 2.3 Construction of Target Classifiers Given the estimated parameters βij (or αij), we then present several natural ways to construct the target-domain classifier or directly determine the class labels on the target domain. By importance reweighting on source samples (denoted weigh sample) The first approach is to train the classifier on the original data points in source domains with appropriate importance weights. Once we find αij and P new Y (cj), we can construct P new XY , which mimics P t XY . According to (5), since an empirical estimator of P (i) X|Y (x|y = cj) is ˆP (i) X|Y (x|y = cj) = 1 mij Pmij k=1 δ x x(i) jk , where δ( ) is the Dirac delta function, an empirical estimator of P new XY (x, y = cj) is ˆP new XY = P new Y (cj) Pn i=1 αij mij Pmij k=1 δ x x(i) jk . We aim to find the function f(x) which minimizes the expected loss on the target domain. Denoted by l(x, y; θ) the loss function, where θ denotes the involved parameters, the expected loss is R[P t XY , θ, l(x, y; θ)] = EP t XY [l(x, y; θ)]. Its empirical estimator is Remp[ ˆP new XY , θ, l(x, y; θ)] = R ˆP new XY l(x, y; θ)dxdy = PC j=1 Pn i=1 Pmij k=1 αij P new Y (cj) mij l(x(i) jk , cj; θ). We can then train the classifier on all source data points with the reweighting coefficients αij P new Y (cj) mij . By generative modeling (denoted genar model) The second approach is purely generative. Let ηj(x) P t Y =cj|X(x) = P t Y (cj)P t X|Y =cj P t X . For any value of x, if ηj(x) is known, one can directly find the class label for x by comparing ηj(x), j = 1, ..., C. We propose a method to estimate ηj(x) without explicitly estimating those involved distributions. Again, we make use of the kernel mean embedding of distributions. For details see Supplementary Material. By weighted combination of source classifiers (denoted combn classf) Alternatively, we can combine the individual source classifiers to form the one for the target domain: P t Y |X(y = cj|x) = P t Y (cj) Pn i=1 αij P (i) X|Y (x|y = cj) i=1 γ(i) j (x)P (i) Y |X(y = cj|x), (12) where γ(i) j (x) αij P t Y (cj)P (i) X (x) P (i) Y (cj)P t X(x) . Note that under As- sumption A1, we have Pn i=1 γ(i) j (x) = 1. The weights γ(i) j (x) can be estimated in a similar way to ηj in approach genar model. This method involves construction of n classifiers and combines them with weights γ(i) j (x), which depend on all of the test point x, domain i, and class j. Comparisons of those approaches involve theoretical studies of discriminative and generative classifiers and the behavior of importance reweighting and weighted combination of classifiers. Generally speaking, as a generative approach, genar model might not work well when X is high-dimensional. We will next compare them empirically. 2.4 Special Case: Distribution Weighted Hypothesis Combination The distribution weighted hypothesis combination rule (Mansour, Mohri, and Rostamizadeh 2008) is actually a special case of the proposed combn classf under additional constraints; as stated in the following theorem. Theorem 2. Suppose the conditions in Theorem 1 hold. The source hypothesis combination rule (12) reduces to the distribution weighted combination rule in the form of (2) under any of the following conditions: 1. PX|Y does not change across domains, and P t Y is a linear mixture of PY on source domains, or 2. PY does not change, and αij in (5) are the same for all classes j = 1, ..., C, or 3. both PX|Y and PY change, but αij P (i) Y (cj)/P t Y (cj) are the same for all j. The three conditions in the above theorem all constrain how PY or PX|Y change. For the distribution weighted rule, the same coefficient, 1/n, was used in (Mansour, Mohri, and Rostamizadeh 2008) for all sources; here we denote this method by simple adapt. We propose to use kernel mean matching (KMM; see (Huang et al. 2007)) to estimate αi in the distribution weighted rule (2) from data such that P i αi P (i) X is as close to P t X as possible, and the resulting hybrid method is denoted by dstr wgh (H). Moreover, note that in our dstr wgh (H), the weights can be negative, while in (Mansour, Mohri, and Rostamizadeh 2008) all coefficients have to be nonnegative. 3 Experiments 3.1 Simulations We first test the performance of the multi-source DA methods proposed in Section 2.3 for classification on simulated data. We generated the data according to Assumption A1 in Sec. 2.1: on each domain, we generated the data points belonging to each class as a mixture of three fixed Gaussians, which have different means or variances, with random coefficients, and PY was also randomly chosen on each domain. We used three source domains, and in each domain the number of points in each class is a random number between 50 and 600. Fig. 2 shows the simulated data in one replication. We compare the three classification approaches proposed in Section 2.3 against a number of alternatives. We include the following representative hypothesis combination methods for comparison: LWE (Gao et al. 2008), convex hypothesis combination (Mansour, Mohri, and Rostamizadeh 2008), denoted convex, simple adapt (Mansour, Mohri, and Rostamizadeh 2008), and dstr wgh (H), which adopts the distribution weighted combination rule (2) with the weights αi estimated from data. KMM for correcting covariate shift (Huang et al. 2007), the pooling SVM (denoted pool SVM), which merge all source data to train the SVM, domain-invariant component analysis (DICA) (Muandet, Balduzzi, and Sch olkopf 2013), and Learning marginal predictors (LMP) proposed by (Blanchard, Lee, and Scott 2011) are also included. In our methods, we simply set the kernel width to 0.5, and the SVM parameters were selected by 5-fold cross validation on the parameter grids. Fig. 3 gives the boxplot of the misclassification rate of each method over 50 replications. We use both the Wilcoxon signed ranks test and Friedman test, recommended by (Dem sar 2006), for performance comparison. With both tests, we found that on simulated data, the proposed approaches weigh sample and combn classf outperform all alternatives with p values smaller than 0.01, and that genar model outperforms all the remaining methods with the p values smaller than 0.05. dstr wgh (H) and simple adapt are closely behind, verifying the finding that distribution weighted rule outperforms the convex combination of the source hypotheses reported in (Mansour, Mohri, and Rostamizadeh 2008). Since the data points from each class were drawn from the mixture of three Gaussians with random coefficients, for each class, df, the degree-of-freedom in the conditional distribution change, as defined in Section 2.1, is 3. Recall that it indicates how many non-redundant source domains are needed to reconstruct P t X|Y . On the simulated data we found that rank(Qj) = 3. We also varied the number of source domains from 3 to 5, and the rank of Qj is still 3, as confirmed by the test of the rank of Hermitian positive semidefinite matrices (Camba-Mendez and Kapetanios 2005). 3.2 Real data: Sentiment analysis & Object recognition The sentiment data (Blitzer, Dredze, and Pereira 2007) consist of review text and labels for four categories of goods (domains): book, dvd, electronics, and kitchen; each domain contains 2000 data points (or reviews) with four labels (or classes). We repeated the experiments on this dataset by (Mansour, Mohri, and Rostamizadeh 2008), but with a more general setting. (Mansour, Mohri, and Rostamizadeh 2008) constructed the target domain as a uniform mixture of data points randomly sampled from the four domains; the rest of the data were used as source-domain data. For each class, we sampled w% (w is a random number between 20 and 50) of the points from each source domain as the target-domain data. Our sampling scheme is more general: in our case P t XY is not necessary a uniform mixture of P (i) XY . We use the the frequency of the unigrams that appear 50 times or more in every domain as the features (in total there are 308 features). Each method was repeated 10 times by randomly sampling the data. The mean and standard deviation of the accuracies on target domains by each method are given in the upper part of Table 2. combn classf and weigh sample give the Class 1 Class 2 Figure 2: Simulated data in one replication. pool_SVM KMM DICA LWE LMP convex simple_adapt dstr_wgh (H) weigh_sample genar_model combn_classf 2 misclassification error (%) our approaches Figure 3: Boxplot of misclassification rate of each method on simulated data (50 replications). best accuracies. We also compared our approaches with alternatives on the object recognition data (Griffin, Holub, and Perona 2007), as done by (Gong et al. 2012). We evaluated different methods on four object recognition datasets (domains): Amazon (images downloaded from Amazon), Webcam (low-resolution images by a web camera), DSLR (high-resolution images by a SLR camera), and Caltech-256 (Griffin, Holub, and Perona 2007). We extracted 10 common categories among all domains. There are 8 to 151 samples per category per domain, and 2533 images in total. We used three domains as sources and the rest one as the target. We followed the feature extraction scheme in (Gong et al. 2012). We used SVM for all the DA methods, and the SVM hyper parameters were selected by 5-fold cross validation on a grid. The results are shown in table 2. The Friedman test gives the p value 0.02, indicating that those approaches give different performances at the significance level 0.05; furthermore, combn classf performs best, closely followed by dstr wgh (H). 4 Conclusion and discussions We provided a causal view to domain adaptation with multiple source domains and noted that the background causal knowledge the data-generating process helps greatly in domain adaptation. Under different causal assumptions, the knowledge to be transferred from source domains to the target may be different, leading to different algorithms for domain adaptation. We considered several simplified causal models for this task, and accordingly gave the optimal hypothesis for the target domain. In particular, we have focused on a multisource domain adaptation problem in which PY and PX|Y change independently across domains, where X denotes features and Y the target. The proposed methods consist of two steps. One first recovers PX|Y and PY on the test domain, by tuning involved parameters to reproduce the corresponding observed feature distribution. The second step constructs the classifier for the target domain or directly determines the target-domain class labels; to this end we presented three natural approaches for target-domain classification, which exploit importance reweighting, use generative learning, or resort to a weighted combination of source hypotheses. The proposed methods rely on the assumption that for each class, the target-domain conditional distribution PX|Y can be represented as a mixture of those on source domains. We remark that for some real problems, certain features could be highly noisy, and it is worth noting that this assumption might not hold for some features or components of features; therefore it would be beneficial to find appropriate feature representations, as in (Ben-David et al. 2007). Furthermore, another future line of research is to derive convergence bounds and learning guarantees for the proposed domain adaptation approaches, following (Cortes, Mansour, and Mohri 2010; Iyer, Nath, and Sarawagi 2014). Acknowledgments KZ would like to thank Elias Bareinboim for helpful discussions on selection diagrams and sample selection bias. We are grateful to the anonymous reviewers for their helpful comments and suggestions. Bareinboim, E.; Tian, J.; and Pearl, J. 2014. Recovering from selection bias in causal and statistical inference. In Proc. 28th AAAI Conference on Artificial Intelligence, 2410 2416. Ben-David, S.; Blitzer, J.; Crammer, K.; and Pereira, F. 2007. Analysis of representations for domain adaptation. In Proceedings of NIPS 2006. Ben-David, S.; Shalev-Shwartz, S.; and Urner, R. 2012. Domain adaptation can quantity compensate for quality? In ISAIM 2012. Blanchard, G.; Lee, G.; and Scott, C. 2011. Generalizing from several related classification tasks to a new unlabeled sample. In NIPS 2011, 2178 2186. Blitzer, J.; Dredze, M.; and Pereira, F. 2007. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. In In ACL, 187 205. Camba-Mendez, G., and Kapetanios, G. 2005. Estimating the rank of the spectral density matrix. Journal of Time Series Analysis 26:37 48. Table 2: Classification accuracy of different methods on the sentiment and object recognition data sets. dataset pool SVM KMM DICA LWE convex simple adapt dstr wgh (H) weigh sample combn classf sentiment 48.37(1.55) 41.75(1.03) 27.69(0.71) 35.39(5.62) 43.95(1.04) 44.76(1.21) 43.85(1.19) 50.66(1.36) 51.72(1.12) amazon 51.93 49.00 51.62 54.34 41.38 41.07 53.50 52.35 52.14 caltech 43.49 40.37 43.40 40.37 40.55 40.02 44.92 43.49 43.32 dslr 49.68 49.68 49.68 17.83 54.14 54.14 57.96 50.95 62.42 webcam 56.95 53.90 55.25 37.97 61.02 61.69 62.37 60.67 62.66 Candela, J.; Sugiyama, M.; Schwaighofer, A.; and Lawrence, N., eds. 2009. Dataset Shift in Machine Learning. MIT Press. Chattopadhyay, R.; Ye, J.; Panchanathan, S.; Fan, W.; and Davidson, I. 2011. Multi-source domain adaptation and its application to early detection of fatigue. In KDD. Cortes, C.; Mansour, Y.; and Mohri, M. 2010. Learning bounds for importance weighting. In NIPS 23. Dem sar, J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1 30. Duan, L.; Tsang, I. W.; Xu, D.; and Chua, T. S. 2009. Domain adaptation from multiple sources via auxiliary classifiers. In ICML. Gao, J.; Fan, W.; Jiang, J.; and Han, J. 2008. Knowledge transfer via multiple model local structure mapping. In In International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV. Gong, B.; Shi, Y.; Sha, F.; and Grauman, K. 2012. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2066 2073. Gretton, A.; Borgwardt, K.; Rasch, M.; Sch olkopf, B.; and Smola, A. 2007. A kernel method for the two-sample-problem. In NIPS 19, 513 520. Cambridge, MA: MIT Press. Gretton, A.; Smola, A.; Huang, J.; Schmittfull, M.; Borgwardt, K.; and Sch olkopf, B. 2008. Covariate shift and local learning by distribution matching. In Qui nonero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; ; and Lawrence, N., eds., Dataset shift in machine learning. Cambridge, MA: MIT Press. 131 160. Griffin, G.; Holub, A.; and Perona, P. 2007. Caltech-256 object category dataset. Huang, J.; Smola, A.; Gretton, A.; Borgwardt, K.; and Sch olkopf, B. 2007. Correcting sample selection bias by unlabeled data. In NIPS 19, 601 608. Iyer, A.; Nath, A.; and Sarawagi, S. 2014. Maximum mean discrepancy for class ratio estimation: Convergence bounds and kernel selection. In Proc. ICML 2014. Jiang, J. 2008. A literature survey on domain adaptation of statistical classifiers. http://sifaka.cs.uiuc.edu/jiang4/domain adaptation/survey. Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2008. Domain adaptation with multiple sources. In NIPS 19, 1041 1048. Cambridge, MA: MIT Press. Muandet, K.; Balduzzi, D.; and Sch olkopf, B. 2013. Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning, JMLR: W&CP Vol. 28. Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22:1345 1359. Pearl, J., and Bareinboim, E. 2011. Transportability of causal and statistical relations: A formal approach. In Proc. AAAI 2011, 247 254. Pearl, J. 2000. Causality: Models, Reasoning, and Inference. Cambridge: Cambridge University Press. Plessis, M. D., and Sugiyama, M. 2012. Semi-supervised learning of class balance under class-prior change by distribution matching. In Proceedings of the 29th International Conference on Machine Learning (ICML). Rosenstein, M.; Marx, Z.; Kaelbling, L.; and Dietterich, T. 2005. To transfer or not to transfer. In NIPS 2005 Workshop on Inductive Transfer: 10 Years Later. Sch olkopf, B.; Janzing, D.; Peters, J.; Sgouritsa, E.; Zhang, K.; and Mooij, J. 2012. On causal and anticausal learning. In Proc. 29th International Conference on Machine Learning (ICML 2012). Shimodaira, H. 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90:227 244. Smola, A.; Gretton, A.; Song, L.; and Sch olkopf, B. 2007. A Hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory, 13 31. Springer-Verlag. Storkey, A. 2009. When training and test sets are different: Characterizing learning transfer. In Candela, J.; Sugiyama, M.; Schwaighofer, A.; and Lawrence, N., eds., Dataset Shift in Machine Learning. MIT Press. 3 28. Sugiyama, M., and Kawanabe, M. 2012. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. Cambridge, MA, USA: MIT Press. Sugiyama, M.; Suzuki, T.; Nakajima, S.; Kashima, H.; von B unau, P.; and Kawanabe, M. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics 60:699 746. Tian, J., and Pearl, J. 2001. Causal discovery from changes: a bayesian approach. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI2001), 512 521. Zadrozny, B. 2004. Learning and evaluating classifiers under sample selection bias. In 21th International Conference on Machine Learning, 114 121. Zhang, K.; Sch olkopf, B.; Muandet, K.; and Wang, Z. 2013. Domain adaptation under target and conditional shift. In Proceedings of the 30th International Conference on Machine Learning, JMLR: W&CP Vol. 28.