# domain_adaptation_under_structural_causal_models__b61cd652.pdf Journal of Machine Learning Research 22 (2021) 1-80 Submitted 10/20; Revised 11/21; Published 11/21 Domain adaptation under structural causal models Yuansi Chen yuansi.chen@stat.math.ethz.ch Peter B uhlmann buhlmann@stat.math.ethz.ch Seminar for Statistics ETH Z urich Z urich, Switzerland Editor: Isabelle Guyon Domain adaptation (DA) arises as an important problem in statistical machine learning when the source data used to train a model is different from the target data used to test the model. Recent advances in DA have mainly been application-driven and have largely relied on the idea of a common subspace for source and target data. To understand the empirical successes and failures of DA methods, we propose a theoretical framework via structural causal models that enables analysis and comparison of the prediction performance of DA methods. This framework also allows us to itemize the assumptions needed for the DA methods to have a low target error. Additionally, with insights from our theory, we propose a new DA method called CIRM that outperforms existing DA methods when both the covariates and label distributions are perturbed in the target data. We complement the theoretical analysis with extensive simulations to show the necessity of the devised assumptions. Reproducible synthetic and real data experiments are also provided to illustrate the strengths and weaknesses of DA methods when parts of the assumptions in our theory are violated. Keywords: anticausal, conditionally invariant components, domain generalization, domain invariant projection, label shift, structural equation models 1. Introduction Domain adaptation (DA) is a statistical machine learning problem in which one aims at learning a model from a labeled source dataset and expecting it to perform well on an unlabeled target dataset drawn from a different but related data distribution. Domain adaptation is considered a sub-field of transfer learning and also a sub-field of semi-supervised learning (Pan and Yang, 2009). The possibility of DA is inspired by the human ability to apply knowledge acquired on previous tasks to unseen tasks with minimal or no supervision. For example, it is common to believe that humans who learned driving in sunny days would adapt their skills to drive reasonably well in a rainy day without additional training. However, the scenario where the source and target data distribution is different (e.g. sunny vs. rainy) is difficult to handle for many machine learning systems. This is mainly because the classical statistical learning theory mostly focuses on statistical learning methods and guarantees when the training and test data are generated from the same distribution. While existing DA theory is limited, more and more application scenarios have emerged where DA is needed and useful. DA is desired especially when obtaining unlabeled data is c 2021 Yuansi Chen and Peter B uhlmann. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-1227.html. Chen and B uhlmann cheap while labeling data is difficult. The difficulties of labeling data typically arise due to the required human expertise or the large amount of human labor. For example, to annotate the Image Net Large Scale Visual Recognition Challenge (ILSVRC) dataset (Russakovsky et al., 2015) with more than 1.2 million labeled images, it costs an average worker on the Amazon Mechanical Turk (www.mturk.com) about 0.5 seconds per image (Fei-Fei, 2010). Thus, annotating more than 1.2 million images requires more than 150 human hours. The actual time needed for labeling is much longer, due to the additional time spent on label verification and on labeling the images which are not used in the final challenge. Now if one is only interested in a similar but different task such as classifying objects in oil paintings, one cannot expect the ILSVRC dataset which contains pictures of natural objects to be representative. As a consequence, one needs to collect new data. It is relatively easy to collect digital images for the new task, but it is costly to label them if human experts have to be involved. Similar situations where the labeling is costly emerge in many other fields such as part-of-speech tagging (Ratnaparkhi, 1996; Ben-David et al., 2007), web-page classification (Blum and Mitchell, 1998), etc. Despite the broad need of DA methods in practice, a priori it is impossible to provide a generic solution to DA. In fact, the DA problem is ill-posed if assumptions on the relationship between the source and target datasets are absent. One cannot learn a model with good performance if the target dataset can be arbitrary. As it is often difficult to specify the relationship between the source and target datasets, a large body of existing DA work are driven by applications. This line of work often focuses on developing new DA methods to solve very specific data problems. Despite the empirical success on these specific problems, it is not clear why DA succeeds or how applicable the proposed DA methods are to new related data problems. The growing development of domain adaptation calls forth a theoretical framework to analyze the existing methods and to guide the design of new procedures. More specifically, can one formulate the assumptions needed for a DA method to have a low target error? Can these assumptions be itemized, so that once specified the performance of different DA methods can be compared? More specifically, does the causal direction of the data generation or the type and location of data perturbation (Figure 1) affect the choice of the best DA method? To answer the above questions, this work develops a theoretical framework via structural causal models that enables the comparison of various existing DA methods. Since there are no clear winning DA methods in general, the performance of DA methods has to be analyzed and compared with precise assumptions on the underlying data structure. Through analysis on simple models, we aim to give insights on when and why one DA method outperform others. Our contributions: Our contributions are three-fold. First, we develop a theoretical framework via structural causal models (SCM) to analyze and compare the prediction performance of existing DA methods, such as domain invariant projection (DIP) (Pan et al., 2010; Baktashmotlagh et al., 2013) and conditional invariance penalty (CIP) or conditional transferable component (Gong et al., 2016; Heinze-Deml and Meinshausen, 2021), under precise assumptions relating source and target data. In particular, we show that under linear SCM the popular DA method DIP is guaranteed to have a low target error when the prediction problem is anticausal without label distribution perturbation. However, DIP Domain adaptation under structural causal models (i) causal prediction (ii) anticausal (iii) mixed Figure 1: (i) causal prediction setting where the label Y is the descendant of the covariates X. The hammers indicate the location of data perturbation. (2) anti-causal prediction setting where all the covariates X are descendants of the label Y . (3) mixed-causal-anticausal setting where some covariates are descendants of the label Y and some covariates are ancestors of Y . Would the causal direction of the data generation affect DA performance? Would changing the location of the perturbation (black hammer vs. white hammer) cause some DA methods to fail? fails to outperform the estimator trained solely on the source data when there are perturbations on the label distribution or when the prediction problem is either causal or mixed-causal-anticausal. Second, based on our theory, we introduce a new DA method called CIRM and develop its variants which can have better prediction performance than DIP in mixed-causal-anticausal DA scenarios and those with label distribution perturbation. Third, we illustrate via extensive simulations and real data experiments that our theoretical DA framework enables a better understanding of the success and failure of DA methods even in cases where presumably not all assumptions in our theory are satisfied. Our theory and experiments make it clear that knowing the relevant information about the data generation process such as the causal direction or the existence of label distribution perturbation is the key to the success of domain adaptation. The rest of the paper is organized as follows. In Section 2 we review previous ways to mathematically formulate the DA problem and summarize the existing DA methods. Section 3 contains background on structural causal models, our problem setup, a formal introduction of the DA methods to study and three simple motivating examples. In Section 4 we analyze and compare the performance of three DA methods (DIP, CIP and CIRM) under our theoretical framework. Based on whether the DA problem is causal, anticausal or mixed and whether there is label distribution perturbation, we identify scenarios where these DA methods are guaranteed to have low target errors. Section 5 contains numerical experiments on synthetic and real datasets to illustrate the validity of our theoretical results and the implication of our theory for practical scenarios. 2. Related work Domain adaptation is a sub-field of the broader research area of transfer learning. More specifically, domain adaptation is also named transductive transfer learning (Pan and Yang, 2009; Redko et al., 2020). In this paper, we concentrate on the DA problem without diving Chen and B uhlmann into the broader field of transfer learning. Consequently, many previous work on transfer learning are omitted for the sake of space. We direct the interested readers to the survey paper by Pan and Yang (2009) and references therein for a literature review on transfer learning. Focusing on the DA problem, we first review existing theoretical frameworks of DA and then provide an overview of DA methods and algorithms. Theoretical DA frameworks: The DA problem is ill-posed if one does not specify any assumptions on the relationship between the source and target data distribution. For this reason, depending on how this relationship is specified, many ways to formulate the DA problem have been introduced. Ben-David et al. (2007) were the first to provide a DA prediction performance bound via Vapnik-Chervonenkis (VC) theory for classifiers from a generic hypothesis class. Since this bound is obtained without making explicit assumptions on the relationship between the source and target data distribution, it involves a divergence term that characterizes the closeness of the source and target distribution. A follow-up from Ben-David et al. (2010) further formally proves the necessity of assuming similarity between source and target distribution to ensure learnability. Ben-David et al. s work laid the foundation of many further studies that attempt to bound the target and source prediction performance difference via divergence measures (Mansour et al., 2009; Cortes and Mohri, 2011, 2014; Cortes et al., 2015; Hoffman et al., 2018). See also the survey paper by Redko et al. (2020) for a complete review of VC-theory-type DA prediction performance bounds. One natural way to explicitly relate the source and target data distribution is to assume that both of them are generated via the same well-specified generative model and to treat the missing target label problem as a missing data problem. Given a well-specified probabilistic generative model, the problem of imputing the missing data is well-studied and it is commonly solved via the expectation maximization (EM) algorithm (see Mc Lachlan and Krishnan, 2007). This idea of casting a DA problem to a missing data problem has been introduced in Amini and Gallinari (2003) and Nigam et al. (2006). It is also possible to loosely relate the source and target data distribution by assuming that they differ only by a small amount. How to specify this small amount depends on applications. If one assumes that the source data distribution is contaminated so that it is ϵ total variation distance away from the target data distribution, then the problem goes back to the classical robust statistics literature (Huber, 1964; Yuan et al., 2019). More recently, Wasserstein distance or f-divergence have been considered to describe the difference between source and target data distribution. These work and contributions are referred to as distributional robust learning (Sinha et al., 2018; Duchi and Namkoong, 2018; Gao et al., 2017). A closely related line of work directly assumes that target data points can be interpreted as source data points contaminated with arbitrary small additive noise quantified via norm constraints (ℓ1 or ℓ ). This direction is called adversarial machine learning (Goodfellow et al., 2018; Raghunathan et al., 2018). Another way to make the DA problem tractable is to assume that the conditional distribution Y | X is invariant across source and target data, where Y and X denote the response (label) and covariates, respectively. The only difference between source and target distributions comes from the change in the distribution of the covariates X. This type of assumption is called the covariate shift assumption (Quionero-Candela et al., 2009; Sugiyama Domain adaptation under structural causal models and Kawanabe, 2012; Storkey, 2009). Alternatively, assuming the other conditional distribution X | Y distribution to be invariant is also plausible in certain applications. This approach has a similar name called the label shift assumption (Lipton et al., 2018; Azizzadenesheli et al., 2019; Garg et al., 2020). Finally on the causality side, it has been pointed out by Pearl and Bareinboim (2014) that full specification of a structural causal model (SCM) allows to study transportability of learning methods on the relationship between variables in the structural causal model. We refer to this approach as full-SCM transfer learning. The full-SCM transfer learning approach is very powerful in describing many data generation models. However, the main drawback of this framework is that the full specification of the structural causal model might be difficult to learn in many applications with limited data. On the other hand, the pioneering work by Sch olkopf et al. (2012) reveals that distinguishing between the causal and anticausal prediction may already be useful to facilitate the selection of DA and semisupervised learning methods. Figuring out the right amount of causal information needed to carry out DA is one of our main motivations in this paper. Previous DA methods: While it is in general helpful to have theoretical DA frameworks to relate the source and target data, theory is not essential for the development of new DA methods. A large number of DA methods and algorithms were introduced with the focus of addressing DA for specific datasets. Here we highlight several popular ones. Self-training (Amini and Gallinari, 2003) is one of the earliest DA methods which is originated in the semi-supervised learning literature (see the book by Chapelle et al., 2009). The self-training algorithm begins with an estimator trained on the source data, and gradually labels a part of unlabeled target data and then updates the estimator with appropriate regularization after combining newly labeled target data. It has been shown to have good empirical performance on several computer vision domain adaptation tasks with small labeled source datasets (Xie et al., 2020; Carmon et al., 2019). A theoretical analysis of the performance of self-training under a gradual shift assumption on Gaussian mixture data was recently provided by Kumar et al. (2020). An important line of DA methods relate the source and target data by assuming the existence of a common subspace. Pan et al. (2010) first came up with the idea of projecting the source and target data onto a reproducing kernel Hilbert space to preserve common properties and applied the idea to text classification datasets. The existence of an intermediate subspace that relates source and target data was made precise in Gopalan et al. (2011) for visual objection recognition datasets. This method was further developed and analyzed for sentiment analysis and web-page classification (Blitzer et al., 2011; Gong et al., 2012; Muandet et al., 2013). Baktashmotlagh et al. (2013) simplified the idea of enforcing a common subspace to adding a regularization term based on the maximum mean discrepancy (MMD) (Gretton et al., 2012). Their method is named domain invariant projection (DIP) because the regularization term enforces a projection of the source and target data on the subspace to be invariant in distribution. Recently, with the development of deep neural networks and the introduction of generative adversarial nets based distributional distance measures, the common subspace approach was further extended to allow for neural network implementations (Ganin et al., 2016; Peng et al., 2019). Chen and B uhlmann The empirical success of DIP like DA methods and their neural network variants such as in Ganin et al. (2016) has sparked a wide discussion on the general validity of these methods. Zhao et al. (2019) constructed a simple counterexample showing that domain invariant projection is not sufficient to guarantee successful domain adaptation. Furthermore, Zhao et al. (2019) provided a lower bound of the target error when the label distribution is perturbed. Johansson et al. (2019) illustrated via a simple example the danger of the blind use of distributional distance such as MMD for domain invariant penalization. Li et al. (2019) and Tachet des Combes et al. (2020) discussed the failure of DIP in the presence of target label perturbation and proposed label shift correction using conditional invariant features. However, it is not clear whether their proposed algorithms have any guarantees for estimating the conditional invariant features or achieving low target error. The mixed messages about the success and failure of DIP like DA methods motivate us to set up rigorous target error comparisons of DIP and other DA methods. Another line of DA methods that is worth mentioning is the one that only makes use of source data. Gong et al. (2016) introduced conditional transferable components which consist of features that are have invariant distributions given the label. The search of conditional transferable components is achieved via a penalty that matches the conditional distribution for any label across source environments. A related idea was proposed by Heinze-Deml and Meinshausen (2021). They extract the conditionally invariant (or core) components (CICs) across data points that share the same identifier but have different style features. The invariance is enforced via adding a conditional variance penalty to the training loss. Enforcing the conditional invariance allows them to learn models that are robust across perturbed computer vision datasets. Later, other concepts of invariance beyond conditional invariance across source environments were developed, such as invariant risk minimization (Arjovsky et al., 2019). It should be noted that, though with a different focus, the idea of using the heterogeneity across multiple source datasets to learn invariant or robust models has also appeared in the causal inference literature (Peters et al., 2016; Meinshausen, 2018; Rothenh ausler et al., 2021). Based on these ideas, Rojas-Carulla et al. (2018) and Magliacane et al. (2018) provided guarantees for domain adaptation under the assumption that the conditional distribution of the label given some subset of covariates is invariant. There are many other interesting DA methods that are less related to our work. For the sake of space, we direct the interested readers to the book by Chapelle et al. (2009) on semi-supervised learning and other surveys (Zhu, 2005; Wilson and Cook, 2020; Wang and Deng, 2018) for additional references. 3. Preliminaries and problem setup In this section, we first provide a brief summary of structural causal models, which are essential components of our theoretical framework. Then we formalize our domain adaptation problem setup and introduce the DA methods we study. Finally, we provide three simple motivating examples to illustrate the need of a DA theory. 3.1 Background on structural causal models Structural causal models (SCMs) (see Pearl, 2000) are introduced to describe causal relationships between variables in a system. A SCM integrates the structural equation models Domain adaptation under structural causal models (SEMs) used in economics and social sciences, the potential-outcome framework of Neyman (1923) and Rubin (1974), and the graphical models developed for probabilistic reasoning and causal analysis. A SCM can be seen as a set of generative equations that describe not only the data generation process of the observational data, but also that of the intervention data. We refer the readers to Chapter 7 of the book by Pearl (2000) for a detailed description of SCM for causal inference. In the context of DA, SCMs can be used to describe both the data generation process of the source and target domains (or environments). The SCMs are specified via a set of structural equations with a corresponding causal graph to describe the relationship between variables. While SCMs are very powerful tools to describe data generation processes in interventional environments, fully specifying a SCM for a DA problem has two main drawbacks in practice: first, defining the functional forms that relate variables in a SCM can be difficult for data involving many variables; second, even if the functional forms are specified, learning all the functions from data may result in a more complicated statistical learning task than the original DA problem. Focusing on solving DA problems, the way we address the two main drawbacks differentiates our work from the full-SCM transfer learning approach by Pearl and Bareinboim (2014). To address the first drawback and to ease our theoretical analysis, we adopt a simplification that replaces all the functional models in the SCM with linear models as in previous work (Peters et al., 2016; Rothenh ausler et al., 2021, 2019). We call this simplified SCM a linear SCM. The simplification allows us to develop rigorous DA theory and to study more complicated DA problems as extensions of the linear case. Regarding the second drawback, we focus on DA methods that can be applied without relying on the full specification of the SCM. Only when we analyze the performance of these DA methods, we bring in SCMs to set forth the assumptions needed for the DA methods to have low target errors. 3.2 Domain adaptation problem setup In this subsection, we set up the domain adaptation problem with M (M 1) labeled source environments and one unlabeled target environment. Although SCMs are general enough to also handle classification problems, we focus our theory on regression to simplify the presentation. Later in Section 5, we show the adaptation of our results to the classification setting through numerical experiments. For the m-th source environment (m {1, 2, , M}), we observe nm i.i.d. samples S(m) = (x(m) 1 , y(m) 1 ), , (x(m) nm , y(m) nm ) drawn from the source data distribution P(m), with (x(m) k , y(m) k ) Rd+1 for each k {1, 2, , nm}. The m-th dataset is also called the m-th source environment. Furthermore, there are n i.i.d. samples S = (( x1, y1), , ( x n, y n)) from the target distribution e P, but we only observe the covariates SX = ( x1, , x n) from e PX. Here e PX is used to denote the marginal distribution of e P on X. The goal of the DA problem is to estimate a function fβ : Rd 7 R, mapping covariates to the label, parametrized by β Θ so that the target population risk is small . Here the parameter space Θ is a subset of a finite-dimensional space. The performance metric target population risk for an estimator f is defined as R(f) = E(X,Y ) e P [l(f(X), Y )] , (1) Chen and B uhlmann where l is a loss function and it is set to the squared loss function x 7 x2 if not specified otherwise. Similarly, we can define the m-th source population risk as R(m)(f) = E(X,Y ) P(m) [l(f(X), Y )] . (2) In addition to the risk on an absolute scale, one can quantify the target population risk achieved by an arbitrary estimator on a relative scale by comparing it with the oracle target population risk R(fβoracle), where βoracle is defined as βoracle arg min β Θ E(X,Y ) e P [l(fβ(X), Y )] . If we don t assume any relationship between the source distribution P(m) and the target distribution e P, the target population risk of an estimator learned from the source and unlabeled target data can be arbitrarily larger than the oracle target population risk. Assumptions on the relationship between source and target distribution are needed to make the DA problem tractable. In this work, we consider the DA setting where source and target data are both generated through similar linear SCMs with additional structural assumptions on the interventions. Domain adaptation under linear SCM with noise interventions: For m {1, 2, , M}, the data distribution P(m) of the m-th source environment is specified by the following data generation equations on X(m), Y (m) from P(m), + g(a(m), ε(m)), (3) and the target data distribution e P is specified via the same equation except for the noise distribution, " e X e Y + g(ea, eε). (4) Here B Rd d is an unknown constant matrix with zero diagonal such that Id B is invertible, b Rd and ω Rd are unknown constant vectors; ε(m) and eε are d + 1 dimensional random vectors drawn from the same noise distribution E; g is a fixed function to model the change (or intervention) across source and target environments; a(m) Rd+1 is an unknown (random or non-random) intervention that changes from one environment to another. Note that the assumption on the invertibility of Id B ensures the uniqueness of the data generation given a draw of the noise ε(m). This assumption is in general weaker than requiring the corresponding causal graph to be directed acyclic (Rothenh ausler et al., 2019). The only difference between the source and target data distribution is due to the difference in intervention a(m) and ea. According to the SCM, the way we specify the difference between source and target distribution is through the term g(ea, eε). This type of intervention is often called noise intervention or soft intervention (Eberhardt and Scheines, 2007; Peters et al., 2016). As a concrete example, if the mean shift noise intervention is considered, then Domain adaptation under structural causal models we specify the function g : Rd+1 Rd+1 Rd+1 as (a, ε) 7 a+ε, and define the intervention term with a deterministic vector in Rd+1. As another example, if the variance shift noise intervention is considered, then we specify g as (a, ε) 7 a ε, where is the element-wise product. We focus our theoretical results on the mean shift noise intervention. Other types of noise interventions are discussed in numerical experiments. The linear SCM with noise interventions clearly does not cover all kinds of perturbations to the data. For example, our theory does not apply when the matrix B is no longer invariant across source and target environments. We show via numerical experiments that assuming that the perturbations are due to noise interventions is plausible in many settings. 3.3 Oracle and baseline DA methods As we aim to compare existing and new DA methods rigorously under our theoretical framework, it is useful to start with several estimators to establish the basis of comparison. First, we introduce two oracle estimators. They are defined using the unobserved information such as target labels or SCM parameters. OLSTar: the population ordinary least squares (OLS) estimator on the target data. f OLSTar(x) := x βOLSTar + βOLSTar,0 βOLSTar, βOLSTar,0 := arg min β,β0 E(X,Y ) e P Y X β β0 2 . (5) This is the oracle target population estimator when we restrict the function class to be linear. Hence the target risk of OLSTar defines the lowest target risk that any linear DA estimator can achieve. Causal: the population causal estimator via the linear SCM f Causal(x) := x βCausal βCausal := ω, (6) where ω appeared in the last row of the SCM matrix in Equation (3). Note that this formulation of the causal estimator assumes that there is no intervention on Y and the intercept is also zero. The Causal estimator is closely related to distributional robust estimators. That is, the Causal estimator is the robust estimator which achieves the minimum worse-case risk when the perturbations on the covariates are allowed to be arbitrary (B uhlmann, 2020). However, in our DA setting where observed target covariates provide additional information, it is no longer clear whether Causal still achieves the lowest target risk. Second, we introduce two population estimators that only use the source data. OLSSrc(m): the population OLS estimator on the single m-th source environment. f(m) OLSSrc(x) := x β(m) OLSSrc + β(m) OLSSrc,0 β(m) OLSSrc, β(m) OLSSrc,0 := arg min β,β0 E(X,Y ) P(m) Y X β β0 2 . (7) Chen and B uhlmann Src Pool: the population OLS estimator by pooling all source data together. f Src Pool(x) := x βSrc Pool + βSrc Pool,0 βSrc Pool, βSrc Pool,0 := arg min β,β0 E(X,Y ) Pallsrc Y X β β0 2 , (8) where Pallsrc is the uniform mixture over M source distributions P(m) m={1, ,M}. We omitted the Src Pool formulation with weighted mixtures because it is not the main focus of our study. These two estimators are natural estimators in the classical statistical learning setting when the source and target data share the same distribution. A DA method that has larger target risk than Src Pool is clearly unfavorable. One goal throughout our paper is to understand under which conditions DA methods are guaranteed to outperform OLSSrc(m) and Src Pool. 3.4 Advanced DA methods In this subsection, we introduce three advanced DA methods. The first two have been introduced previously and the third one is our new DA method. First, we consider the DA method called domain invariant projection (DIP). DIP is a subspace-based DA method which aims at learning a common intermediate subspace that relates the source and target domains. The specific form of DIP we study follows from Baktashmotlagh et al. (2013). Specifically, the population DIP estimator involves the following optimization problem f DIP(x) := u DIP v DIP(x) u DIP, v DIP := arg min u U,v V E(X,Y ) P(1), X e Pl(u v(X), Y ) + λ D(v(X), v( X)), (9) where D( , ) measures the distance between two distribution, U and V are function classes that are specified through expert knowledge of the problem and λ is a positive regularization parameter. DIP, in its simple form, only uses a single source environment. Hence, without loss of generality, we used the first source data environment. In Baktashmotlagh et al. (2013), maximum mean discrepancy (MMD) is used as the distributional distance measure and both U and V are set to be linear mappings. This DIP idea of matching the mappings of source and target data is extended later in many DA papers with various choices of distance and function classes (see e.g. Ghifary et al., 2016; Li et al., 2018). A noteworthy line of follow-up work consist of replacing the distribution distance and function classes in DIP with neural networks. For example, Ganin et al. (2016) introduced domain-adversarial neural network (DANN) which uses generative adversarial nets (GAN) in place of MMD to measure distributional distance and makes both U and V to be neural networks. Analyzing the most generic form of DIP is out of the scope of this study. Instead, we start with a simple DIP formulation. DIP(m)-mean: the population DIP estimator where mean squared difference is used as distributional distance, V is linear and U is the singleton of the identity mapping Domain adaptation under structural causal models and λ is chosen to be . DIP, in its simple form, only uses the data from one source environment and the target covariates. This form of DIP is defined as f(m) DIP(x) := x β(m) DIP + β(m) DIP,0 β(m) DIP, β(m) DIP,0 := arg min β,β0 E(X,Y ) P(m) Y X β β0 2 s.t. E X P(m) X h X β i = EX e PX h X β i . (10) For simplicity, we use the shorthand notation DIP(m) to refer to DIP(m)-mean. The constraint in Equation (10) is called the DIP matching penalty. Second, we introduce the conditional invariant penalty (CIP) estimator. Unlike DIP which projects the source and target covariates to the same subspace, CIP directly uses the label information in multiple source environments to look for the conditionally invariant components. CIP only makes use of source data. CIP-mean: the population conditional invariance penalty (CIP) estimator where the conditional mean is matched across multiple source environments. f CIP(x) := x βCIP + βCIP,0 βCIP, βCIP,0 := arg min β,β0 m=1 E(X,Y ) P(m) Y X β β0 2 s.t. E(X,Y ) P(m) h X β | Y i = E(X,Y ) P(1) h X β | Y i a.s., m {2, , M} , (11) where the equality between the conditional expectation is in the sense of almost sure equality of random variables. Note that CIP naturally requires M 2, because when M = 1 the CIP constraint becomes vacuous. For simplicity, we use the shorthand notation CIP to refer to CIP-mean. CIP puts more regression weights on the conditionally invariant components from multiple source datasets via the conditional invariance constraint in Equation (11). The idea of conditional invariance penalty in the context of anticausal learning has appeared in multiple papers with slightly different settings. Gong et al. (2016) introduced conditional transferable components which are in fact conditionally invariant features. However, unlike the formulation above, Gong et al. (2016) propose to learn the conditional transferable components with only one source environment which requires assumptions that are difficult to check. On the other hand, the algorithm from Heinze-Deml and Meinshausen (2021) learns the conditionally invariant features from a single source dataset if multiple observations of data points that share the same identifier are present. They use their conditional variance penalty to enforce their algorithm to learn the conditionally invariant features in their specific datasets with identifiers. The same idea can be applied if we replace identifiers with source environments. Third, we introduce our new DA estimator conditional invariant residual matching (CIRM). Chen and B uhlmann CIRM(m)-mean: the population conditional invariant residual matching estimator that uses all source environments to compute CIP and the m-th source environment to perform risk minimization. f(m) CIRM(x) := x β(m) CIRM + β(m) CIRM,0 β(m) CIRM, β(m) CIRM,0 := arg min β,β0 E(X,Y ) P(m) Y X β β0 2 s.t. E X P(m) X h β X X βCIP ϑCIRM i = EX e PX h β X X βCIP ϑCIRM i , ϑCIRM := E(X,Y ) Pallsrc [X (Y E[Y ])] E(X,Y ) Pallsrc [(X βCIP E[X βCIP]) (Y E[Y ])], (13) with Pallsrc denoting the uniform mixture of all source distributions. For simplicity, we use the shorthand notation CIRM(m) to refer to CIRM(m)-mean. At first glance, the CIRM estimator is a combination of DIP and CIP. CIRM first uses CIP to compute a linear combination of conditionally invariant components to serve as a proxy of the label Y . To tackle the label distribution perturbation, CIRM then performs the DIP-type matching on the residual obtained by regressing Y on its proxy. We can show that the joint distribution of the residual together with the covariates do not suffer from the label distribution perturbation. CIRM is designed to be applied in DA scenarios with label distribution perturbation. Note that the idea of using domain-invariant features to tackle label distribution shift was proposed previously in Li et al. (2019) and in Tachet des Combes et al. (2020), while it was not clear how to formulate the exact algorithm so that its success and failure conditions can be analyzed. The intuition behind the CIRM construction becomes clearer after we state Theorem 5. The comparison of DA methods in the following sections is centered around DIP, CIP, CIRM and their variants. To keep track of these variants, we provide a summary of all DA methods appeared in this paper in Table 5 of Appendix A.1. 3.5 Simple motivating examples Before we dive into theoretical comparisons of the DA methods, we go through three simple examples to illustrate the assumptions needed for DA methods to have low target risks. The simple examples have data generated via low-dimensional SCMs so that the DA estimators can be easily computed and understood. 3.5.1 Example 1: causal prediction Example 1 has one source environment and one target environment. The data in the source and target environment are generated independently according to the following SCMs, with the causal diagram on the left and the structural equations on the right of Figure 2. Here the noise variables follow independent Gaussian distributions with mean zero and variance 0.1 for X and variance 0.2 for Y , namely, ε(1) X1, ε(1) X2, ε(1) X3, eεX1, eεX2, eεX3 N(0, 0.1), Domain adaptation under structural causal models X(1) 1 = ε(1) X1 + 1 X(1) 2 = ε(1) X2 + 1 X(1) 3 = ε(1) X3 + 1 Y (1) = X(1) 1 + X(1) 2 + ε(1) Y , e X1 = eεX1 1 e X2 = eεX2 1 e X3 = eεX3 + 1 e Y = e X1 + e X2 + eεY , Figure 2: The causal diagram and structural equations for the source and target environments in Example 1: causal prediction. ε(1) Y , eεY N(0, 0.2). The type of intervention is mean shift noise intervention. That is, the function g in Equation (3) is taken to be g : (a(1), ε(1)) 7 a(1) + ε(1). The intervention is a(1) = 1 1 1 0 for the source environment and it is ea = 1 1 1 0 for the target environment. This example is called causal prediction, because the covariates X are parents of the label Y . In other words, we are predicting the effect from the causes as illustrated in the causal diagram in Figure 2. Given the source and target distribution, the population DA estimators in the previous subsection can be computed explicitly. We obtain βOLSTar βOLSTar,0 = 1 1 0 0 , βCausal βCausal,0 = 1 1 0 0 , " β(1) OLSSrc β(1) OLSSrc,0 = 1 1 0 0 , " β(1) DIP β(1) DIP,0 Note that OLSSrc(1), Causal and OLSTar share the same estimate β, while DIP(1) does not. The corresponding population source and target risks are summarized in the first two rows of Table 1. DIP(1) has a larger target population risk than OLSSrc(1). In this example of causal prediction, using the additional target covariate information via DIP is making the target prediction performance worse. This is because DIP matching penalty in Equation (10) which is not satisfied by the oracle estimator OLSTar, makes DIP too restrictive. 3.5.2 Example 2: anticausal prediction Example 2 has one source environment and one target environment. The data in the source and target environment are generated independently according to the following SCMs, with the causal diagram on the left and the structural equations on the right of Figure 3. Here the noise variables follow independent Gaussian distributions with mean zero and variance 0.1 for X and variance 0.2 for Y , namely, ε(1) X1, eεX1, ε(1) X2, eεX2, ε(1) X3, eεX3 N(0, 0.1), ε(1) Y , eεY N(0, 0.2). The type of intervention g is mean shift noise intervention, which is the same as in Example 1. The intervention is a(1) = 1 1 1 0 for the source environment Chen and B uhlmann PPPPPPPPPP P Risk Methods OLSTar (oracle) Causal OLSSrc(1) DIP(1) DIPAbs(1) Ex 1, source risk R(1) 0.200 0.200 0.200 0.333 - Ex 1, target risk R 0.200 0.200 0.200 16.333 - Ex 2, source risk R(1) 2.600 0.200 0.040 0.086 0.044 Ex 2, target risk R 0.040 0.200 2.600 0.086 0.667 Ex 3, source risk R(1) 0.200 0.200 0.040 0.066 - Ex 3, target risk R 0.040 1.200 0.200 4.066 - Table 1: Population source and target risks in two motivating examples (the lower the better). The oracle target risk is highlighted in bold. In Example 1 of causal prediction, DIP(1) performs worse than Causal and OLSSrc(1) in terms of target population risk. In Example 2 of anticausal prediction, DIP(1) performs better than Causal and OLSSrc(1), but it is still not as good as the oracle estimator. DIPAbs(1) is better in terms of population source risk can have worse target population risk than Causal. In Example 3 of anticausal prediction with intervention on Y , DIP(1) performs worse than OLSSrc(1) in terms of target population risk. X(1) 1 = Y (1) + ε(1) X1 + 1 X(1) 2 = Y (1) + ε(1) X2 + 1 X(1) 3 = ε(1) X3 + 1 Y (1) = ε(1) Y , e X1 = e Y + eεX1 1 e X2 = e Y + eεX2 1 e X3 = eεX3 1 e Y = eεY , Figure 3: The causal diagram and structural equations for the source and target environments in Example 2: anticausal prediction and it is ea = 1 1 1 0 for the target environment. Compared to Example 1, the main difference is that the causal direction between the covariates X and the label Y has changed. This example is called anticausal prediction, because the covariates X are descendants of the label Y . We are predicting the cause from the effects as illustrated in the causal diagram in Figure 3. In addition to the DA estimators used in the previous example, we introduce one more variant of DIP, DIPAbs. It is a made-up estimator to show that arbitrary choice of the Domain adaptation under structural causal models function classes U, V in DIP formulation (9) without consideration on the data generation process is in general a bad idea. DIPAbs(m)-mean: the population DIP estimator where mean squared difference is used as distributional distance, V is element-wise absolute value followed linear mapping and U is singleton of identity mapping and regularization parameter λ is chosen to be . For the m-th source environment, it is defined as f(m) DIPAbs(x) := |x| β(m) DIPAbs + β(m) DIPAbs,0 β(m) DIPAbs, β(m) DIPAbs,0 := arg min β,β0 E(X,Y ) P(m) Y |X| β β0 2 s.t. E X P(m) X h |X| β i = EX e PX h |X| β i . (14) The population estimators can be computed explicitly βOLSTar βOLSTar,0 5 2 5 0 4 5 , βCausal βCausal,0 = 0 0 0 0 , " β(1) OLSSrc β(1) OLSSrc,0 " β(1) DIP β(1) DIP,0 " β(1) DIPAbs β(1) DIPAbs,0 For the other four estimators, none of the estimated β perfectly agrees with that of the oracle OLSTar. The corresponding population source and target risks are summarized in the third and fourth rows of Table 1. DIP(1) improves upon Causal and OLSSrc(1) on the target population risk. However its target population risk is not as small as that of the oracle estimator OLSTar. DIPAbs(1) also uses the target covariate information as DIP(1) does, but DIPAbs(1) performs worse than Causal or DIP(1) in terms of target population risk. 3.5.3 Example 3: anticausal prediction when Y is intervened on Example 3 is also an anticausal prediction problem similar to Example 2, except that the label Y is also intervened on. The data in the source and target environment are generated independently according to the following SCMs, with the causal diagram on the left and the structural equations on the right of Figure 4. Here the noise variables follow independent Gaussian distributions with εX1, eεX1 N(0, 0.1), εX2, eεX2 N(0, 0.1), εY , eεY N(0, 0.2). The type of intervention is still mean shift noise intervention. The intervention is a(1) = 1 1 1 for the source environment and it is ea = 1 1 1 for the target environment. Compared to Example 2, the main difference is that the intervention on Y is nonzero as illustrated in the causal diagram in Figure 4. Chen and B uhlmann X(1) 1 = Y (1) + ε(1) X1 + 1 X(1) 2 = Y (1) + ε(1) X2 + 1 Y (1) = ε(1) Y + 1, e X1 = e Y + eεX1 1 e X2 = e Y + eεX2 1 e Y = eεY 1, Figure 4: The causal diagram and structural equations for the source and target environments in Example 3: anticausal prediction when Y is intervened on The population estimators can be computed explicitly βOLSTar βOLSTar,0 5 , βCausal βCausal,0 = 0 0 0 , " β(1) OLSSrc β(1) OLSSrc,0 " β(1) DIP β(1) DIP,0 Note that DIP(1) has zero weight on the first coordinate but non-zero weight on the second because X(1) 2 and e X2 share the same distribution. The corresponding population source and target risks are summarized in the last two rows of Table 1. In Example 3 when Y is intervened on, DIP(1) is again worse than OLSSrc(1) on the target population risk. 3.5.4 Lessons from the simple motivating examples The three simple motivating examples reveal three observations. First, even though DIP has a low target risk in the anticausal prediction setting (Example 2), DIP is not likely to outperform OLSSrc in the causal prediction setting (Example 1). The fact that the additional target covariate information is not useful in causal prediction problem was previously pointed out by Sch olkopf et al. (2012). Consequently, DIP-type methods (Baktashmotlagh et al., 2013; Ganin et al., 2016) which make use of the additional target covariate information can make target performance worse in causal prediction problems. Second, not all DIP variants have low target risks in the anticausal prediction setting. DIPAbs(1), despite using the target covariate information as DIP(1) does, performs worse than Causal and DIP(1). In general, it is dangerous to treat DA methods as generic solutions that always work without consideration on the data generation process. We remark that DIPAbs(1) is a made-up method. But one could imagine a case where the first part V in DIP(1) is set to be a neural network that can approximate a large class of functions, then it is no longer clear the DIP matching penalty that matches the source and target covariate distributions always helps to improve the target population risk. This danger of learning blindly invariant representations was pointed out previously via a simple nonlinear data generation model by Zhao et al. (2019). Third, when Y is intervened on, DIP can perform worse than OLSSrc. Actually, none of the DA methods in Table 1 can handle the intervention on Y well. The Domain adaptation under structural causal models weakness of DIP in the presence of label perturbation has been observed previously in Zhao et al. (2019); Li et al. (2019); Tachet des Combes et al. (2020). Especially, Tachet des Combes et al. (2020) proposed the idea of using conditional invariant components for label shift correction. However, since their proposed method has no guarantees for estimating the conditional invariant components, it is not clear whether their methods have any guarantees for correcting the label shift. The three examples are designed to demonstrate that the popular DA method DIP does not always outperform baseline methods such as OLSSrc or Causal. Besides, it also shows that the assumption of both source and data being generated from linear SCMs is not sufficient to guarantee DIP to have a low target risk. Additional assumptions on the data generation or on the intervention are needed. For example, knowing the causal direction of the data generation model or whether the label distribution is perturbed are all crucial information. Based on these examples, we ask the following questions: 1. In addition to the linear SCM assumption, what other assumptions are needed for DIP to perform better than Causal or OLSSrc? If such assumptions exist, can one quantify the gap between the target risk achieved by DIP and the oracle target risk? 2. If there is label distribution perturbation like in Example 3, are there DA methods that can outperform OLSSrc and have target risk guarantees? 3. If the prediction direction is a mix of causal and anticausal, is domain adaptation still beneficial? In the sequel, we address these questions one by one. The first question is addressed in Section 4.1. The second question is dealt with in Section 4.2 via the introduction of CIP and CIRM. The general solution to the third question remains open. Naive applications of DIP and CIRM are not optimal. We provide a partial answer to the third question in Section 4.3 and show that the mixed-causal anticausal DA problem can be reduced to the anticausal DA problem when the causal variables have been already identified. 4. Domain adaptation with theoretical guarantees In this section, to answer the three questions in the last section with rigor, we establish target risk guarantees for the DA estimators DIP, CIP and CIRM in three settings. Subsection 4.1 focuses on the DIP performance in the anticausal DA setting without intervention on Y . Subsection 4.2 demonstrates the difficulty of DIP in the anticausal DA setting with interventions on Y and then proves the advantage of CIP and CIRM over DIP when multiple source environments are available. Subsection 4.3 shows how domain adaptation is still possible in the mixed causal anticausal DA setting. 4.1 Anticausal domain adaptation without intervention on Y In this subsection, we study the anticausal domain adaptation with the additional assumption that the intervention on the covariates X is a mean shift noise intervention and there is no intervention on the label Y . In the anticausal domain adaptation, all the covariates X are descendants of the label Y in the SCM. First, we derive the target risk bound for DIP Chen and B uhlmann under these assumptions with a single source environment. Then we show how to make use of more source environments to improve the performance of DIP. Finally, we discuss ways to relax the mean shift noise intervention assumption. 4.1.1 Target risk guarantees of DIP with a single source environment Before we state the main theorem, we introduce another oracle estimator to simplify the theorem statement. DIPOracle minimizes the mean squared error on target data while it uses the same matching penalty as DIP. DIPOracle(m)-mean: the population DIP estimator which uses target labels and the covariate distribution from the m-th source and target environment. f(m) DIPOracle(x) := x β(m) DIPOracle + β(m) DIPOracle,0 β(m) DIPOracle, β(m) DIPOracle,0 := arg min β,β0 E(X,Y ) e P Y X β β0 2 s.t. E X P(m) X h X β i = EX e PX h X β i . (15) Next, we state the data generating assumptions needed for DIP to have target risk guarantees. Since a single source environment is considered in this subsection, without loss of generality, we assume that the first source environment is used. Assumption 1 Each data point in the source environment is generated i.i.d. according to distribution P(1) specified by the following SCM X(1) " a(1) X a(1) Y " ε(1) X ε(1) Y each data point in the target environment is generated i.i.d. according to distribution e P specified with the same SCM except for the mean shift intervention term " e X e Y + ea X ea Y where Id B Rd d is invertible, the prediction problem is anticausal i.e. ω = 0, b Rd, the intervention terms " a(1) X a(1) Y and ea X ea Y are vectors in Rd+1, the noise terms " ε(1) X ε(1) Y and eεX eεY share the same distribution with E h ε(1) X i = E [eεX] = 0, E ε(1) X ε(1) X = E h eεXeε X i = Σ Rd d, E h ε(1) Y i = E [eεY ] = 0, E ε(1) Y 2 = E eε2 Y = σ2 R. Additionally, the noise terms on X and Y are uncorrelated E h ε(1) Y ε(1) X i = E [eεY eεX] = 0. Domain adaptation under structural causal models Note that Assumption 1 does not require Σ to be diagonal, meaning that the noise terms of X can be correlated. Consequently, this assumption allows for unobserved confounders affecting the X variables to exist. Under the assumptions above and that there is no intervention on Y , we obtain the following theorem on the target risk of DIP. Theorem 1 Under the data generation Assumption 1 and the assumption of no intervention on Y i.e. a(1) Y = ea Y = 0, the target population risks of OLSTar, OLSSrc(1) and DIP(1)-mean satisfy R (f OLSTar) = σ2 1 + σ2b Σ 1b, (16) R f(1) OLSSrc = σ2 1 + σ2b Σ 1b + σ2b Σ 1 a(1) X ea X 2 (1 + σ2b Σ 1b)2 , (17) R f(1) DIP = R f(1) DIPOracle = σ2 1 + σ2b Σ 1 2 G(1) DIPΣ 1 where G(1) DIP = Σ1/2Q(1) DIP Q(1) DIP ΣQ(1) DIP 1 Q(1) DIP Σ1/2 is a projection matrix with rank d 1; Q(1) DIP Rd d 1 is a matrix with columns formed by the vectors that complete the vector u(1) to an orthonormal basis. Here u(1) = a(1) X ea X a(1) X ea X when the source and target distribution is not identical i.e. a(1) X = ea X. When a(1) X = ea X, meaning that the source distribution equals the target distribution, then u(1) = 0 and we have R f(1) DIP = R f(1) OLSSrc = R (f OLSTar). The proof of Theorem 1 is provided in Appendix B.1. Comparing Equation (18) with Equation (16), the target population risk of DIP(1) is larger than that of OLSTar but it is lower than the target population risk of the Causal estimator (which equals to σ2). The target population risk of OLSSrc(1) depends on the magnitude of the difference in intervention a(1) X ea X 2, while the target risk of DIP(1) is independent of that magnitude. Consequently when the difference in intervention becomes large, DIP(1) can outperform OLSSrc(1). Moreover, the first equality of Equation (18) shows that DIP(1) achieves the same target population risk as DIPOracle(1). DIPOracle(1) differs from OLSTar by only one linear constraint. The connection between DIP(1) and DIPOracle(1) intuitively explains why DIP(1) target risk should be close to that of OLSTar. In fact, with additional assumptions on how the interventions a(1) X and a X are positioned and simplifications on the covariance matrix Σ, we obtain the following corollary which clearly highlights the difference between the target population risk of DIP and the oracle target population risk of OSLTar. Chen and B uhlmann Corollary 2 In addition to the assumptions in Theorem 1, suppose Σ = σ2 ρ Id with ρ > 0, then R (f OLSTar) = σ2 1 + ρ b 2 2 , (19) R f(1) OLSSrc = σ2 1 + ρ b 2 2 + u(1) b 2 a(1) X ea X 2 2 1 + ρ b 2 2 2 , (20) R f(1) DIP = σ2 1 + ρ b 2 2 ρ u(1) b 2 , (21) where u(1) = a(1) X ea X a(1) X ea X if a(1) X = ea X and u(1) = 0 otherwise. Additionally, if a(1) X ea X is generated randomly from the Gaussian distribution N(0, τI2 d), then for a constant t satisfying 0 < t d 2, with probability at least 1 exp( t/16) 2 exp( t/4), we have R f(1) OLSSrc σ2 1 + ρ b 2 2 + τ 2t b 2 2 2 1 + ρ b 2 2 2 , R f(1) DIP σ2 d b 2 2 . (22) The proof of Corollary 2 is provided in Appendix B.2. Under the conditions of Corollary 2, Equation (21) shows that the gap between the target population risks of DIP(1) and OLSTar is smaller when the direction of the difference in intervention u(1) becomes less aligned with the vector b. When the intervention is generated randomly from a Gaussian distribution, the bound (22) shows that the gap between the target population risks of DIP(1) and OLSTar is small with high probability. The gap only comes from an order 1 d term in the denominator of the target risk. So when the dimension d is large, this difference in target population risk between DIP(1) and OLSTar becomes negligible. As for OLSSrc, it is now clear from Corollary 2 that the target risk of OLSSrc(1) depends on the magnitude of the difference in intervention τ. For brevity, we only provided the target risk upper bound of OLSSrc(1). The lower bound should hold similarly up to constant factors because the risks are derived with equality in Equation (20)-(21) and tight Gaussian concentration bounds are used to obtain the upper bounds. As the magnitude of the difference in intervention τ increases, OLSSrc(1) will eventually have a larger target risk than DIP(1). The intuition behind the success of DIP(1) over OLSSrc(1) is that the DIP matching penalty in Equation (10) allows the DIP(1) estimator to equalize the source and target covariate intervention by projecting it to a common space. As a result, given the DIP matching penalty, computing least squares on the source data is the same as computing least squares on the target data. This observation also explains why the target risk of DIP(1) matches that of DIPOracle(1). Having this intuition in mind, it is not hard to Domain adaptation under structural causal models imagine extending the guarantees of DIP(1) to the generic form of DIP in Equation (9) under appropriate assumptions. Specifically, under Assumption 1, as long as the function class V is chosen such that the DIP matching penalty D(v(X), v( X)) ensures the conditionals v(X) | Y and v( X) | Y to have the same distribution, then no matter how U is chosen we always match the target risks of DIP and DIPOracle. Moreover, we know that target risk of DIPOracle is close to the oracle target risk over the function class {u v | u U, v V}. So the target risk of DIP in this extended setting is also close to the oracle one. 4.1.2 Benefit of more source environments for DIP We show how to make use of more independent source environments to improve the performance of DIP. Here we consider M (M 2) source environments, where for each m {1, . . . , M} the m-th source environment is generated independently according to the source environment in Assumption 1 except for the unknown interventions a(m) X and we still have a(m) Y = 0. First, we state a corollary revealing that it is possible to pick the best source environment for DIP based on the best source risk. Second, we discuss the reason behind the performance improvement after picking the best source environment. First, based on the proof of Theorem 1, we derive the following corollary on the source population risk. Corollary 3 Under the data generation Assumption 1 and the assumption of no intervention on Y i.e. a(1) Y = ea Y = 0, the source population risk of DIP(1)-mean as defined in Equation (2) satisfies R(1) f(1) DIP = R f(1) DIP . (23) The proof of Corollary 3 is provided in Appendix B.3. According to Corollary 3, one can read offthe target population risk directly from the source population risk. Since the choice of source environment 1 is arbitrary, the same result holds for any of the M source environments. We have R(m) f(m) DIP = R f(m) DIP for any m {1, . . . , M}. Consequently, having more source environments allows one to apply DIP between each source environment and the target environment one by one, and then pick the estimator f(m) DIP with the lowest source population risk to reduce the target population risk. Second, we reason about the type of source environments that reduces the target population risk the most. According to Equation (16) and (18) in Theorem 1, the projection matrix G(1) DIP is the term that makes the target population risks of DIP(1) and DIP(m) different (m = 1). If the vector Σ 1/2b is in the span of the projection matrix G(1) DIP, then DIP(1) achieves the oracle target population risk. Otherwise, the target population risk of DIP(1) depends on the norm of the component of Σ 1/2b outside the span of the projection matrix G(1) DIP. The above intuition becomes clearer under the additional assumptions of Corollary 2. There we obtain a simple form for the projection matrix G(1) DIP = Id u(1)u(1) , where u(1) = a(1) X ea X a(1) X ea X assuming a(1) X = ea X. Under the same assumptions, the denominator in Chen and B uhlmann the DIP target risk in Equation (18) becomes 1+ b 2 2 u(1) b 2 . If a(1) X ea X is orthogonal to b then DIP(1) achieves the oracle target population risk. Otherwise, DIP(1) has larger target population risk than OLSTar. With more source environments, it is more likely to find a source environment m such that a(m) X ea X is closer to be orthogonal to b. Based on the above intuition, we propose the following DIP variant that provides a weighting choice to take advantage of multiple source environments. DIPweigh-mean: the population DIP estimator that makes use of multiple source environments based on the source risks. f DIPweigh(x) := 1 PM m=1 e η sm m=1 e η sm x β(m) DIP + β(m) DIP,0 sm := R(m) f(m) DIP . (24) η > 0 is a constant. Choosing η to be is equivalent to choosing the source estimator with the lowest source risk. DIPweigh weights all the source predictions based on the source risk of each source environment. Here we introduce the weighted version of DIP rather than directly selecting the source estimator with the lowest source risk. The main intuition is that in finite sample, averaging the predictions from several source environments with low source risks can take advantage of a larger sample size. For example, in the presence of a less related source environment with very large sample size, one might still include it if the reduction in variance outweighs the increase in bias. 4.1.3 Failure scenarios of DIP The target population risk guarantees of DIP in Theorem 1 rely on the anticausal data generation Assumption 1 and the assumption of no intervention on Y . We already showed in Example 1 that DIP cannot outperform OLSSrc in the causal prediction setting. Even in the anticausal prediction setting, we identify two scenarios where DIP(1)-mean might have large target risk: when the DIP matching penalty is not well-suited for the underlying type of the intervention that generates the data and when there is intervention on Y . Consider the following data generation model where the source environment is generated from the following SCM with interventions on the variance X(1) " a(1) X ε(1) X ε(1) Y and the target environment is generated from the same SCM except that the intervention on X takes the form ea X eεX. Here denotes the d-dimensional element-wise multiplication. This intervention is called variance shift noise intervention. The intervention term a(1) X and ea X are fixed vectors in Rd, and the noise terms are kept the same as in Assumption 1. Under the data generation model in this subsection, the matching penalty in DIP(1)-mean (10) is always satisfied because both the left and right hand sides are zero. Consequently, DIP(1)- mean becomes the same estimator as OLSSrc. Matching the mean between source and Domain adaptation under structural causal models target distribution in this case is no longer a good idea, because the intervention is on the variance rather than the mean. To adapt to the new type of intervention, one can consider DIP-std which puts the matching penalty on the standard deviations, DIP-std+ which puts the matching penalty on the means, standard deviations and 25% quantiles, or DIP-MMD which uses more generic distributional matching via MMD. The exact formulations consist of replacing the DIP mean matching penalty in Equation (10) with the appropriate matching penalties and they are formally introduced in Appendix A.1. If the assumptions are set up such that the intervention type agrees with the matching penalty in DIP, analyzing the theoretical performance of DIP-std, DIP-std+ or DIP-MMD under linear SCMs works similarly as analyzing that of DIP-mean. Hence we leave out the theoretical guarantees of DIP-std, DIP-std+ or DIP-MMD for brevity. The empirical performance of DIP-std+ and DIP-MMD is shown through simulations and real data experiments in Section 5. The second failure scenario of DIP is when there is intervention on Y . This failure scenario of DIP is already noticeable from Example 3 in Section 3.5.3. Here we provide a corollary that quantifies the additional target population risk made by DIP if there is intervention on Y . Corollary 4 Under the data generation Assumption 1, the target population risk of OLSTar and DIP(1)-mean satisfy R (f OLSTar) = σ2 1 + σ2b Σ 1b R f(1) DIP = σ2 1 + σ2b Σ 1 2 b + a(1) Y ea Y 2 , (25) where G2 = Σ1/2Q2 Q 2 ΣQ2 1 Q 2 Σ1/2 is a projection matrix with rank d 1. Q2 Rd d 1 is a matrix with columns formed by the vectors that complete the vector a(1) X +a(1) Y b ea X ea Y b a(1) X +a(1) Y b ea X ea Y b 2 to an orthonormal basis. The proof of Corollary 4 is provided in Appendix B.3. According to Equation (25) in Corollary 4, the target population risk has an extra term that depends on the difference between target and source Y intervention. The target population risk of DIP is increases as the difference between target and source Y intervention increases. This corollary highlights the result that DIP has a large target risk when the intervention on Y is large. 4.2 Anticausal domain adaptation with intervention on Y In this subsection, we study the anticausal domain adaptation where there is intervention on the label Y . As illustrated in Example 3 in Section 3.5.3 and in the last section, the intervention on Y can cause DIP to have a large target population risk. In fact, allowing arbitrary intervention on both X and Y can also lead to unidentifiable cases where two data generation models result in the same target covariate distribution in the anticausal domain adaptation. If it were the case, any DA estimator based solely on the target covariate Chen and B uhlmann distribution will have large target risk. The following example illustrates one concrete unidentifiable case. A simple unidentifiable example: The target data is generated form the following SCM e Y = eεY + ea Y e X = 2e Y + eεX + ea X, where e X is 1-dimensional random variable, eεY N(0, 1) and eεX N(0, 0.1). Assume we observe the target covariate distribution e X N(3, 1.1). Without observing the target label e Y distribution, it is impossible to tell whether the intervention is (ea X, ea Y ) = (0, 1.5), (1, 1) or (3, 0). This is because all three interventions result in the same target covariate distribution. However, the conditional distribution e Y | e X are different. Consequently, any estimator of the conditional mean E h e Y | e X i without access to the target label distribution can not get it correct. To make the anticausal domain adaptation problem with intervention on Y tractable, additional assumptions on the structure of the interventions is needed. Here we adopt one type of assumptions introduced by Gong et al. (2016); Heinze-Deml and Meinshausen (2021). This line of work assumes the existence of conditionally invariant components (CICs) across all source and target environments. That is, there exists an unknown transformation T of the covariates such that the conditional distribution T (X) | Y is invariant across source and target environments. In the following, we first prove target population risk guarantees for CIP and CIRM under the conditionally invariant components assumption. We show that the CIP and CIRM target risks are much less dependent on the Y intervention than the DIP target risk. Finally we discuss extensions and variants of CIP and CIRM. 4.2.1 Target risk guarantees of CIP and CIRM with multiple source environments and conditionally invariant components We start by stating the data generation assumptions for CIP and CIRM to have target risk guarantees, namely a SCM and the existence of conditionally invariant components. Assumption 2 There are M (M 2) source environments. Each data point in the mth source environment is generated i.i.d. according to distribution P(m) specified by the following SCM " a(m) X a(m) Y " ε(m) X ε(m) Y Each data point in the target environment is generated i.i.d. according to distribution e P specified with the same SCM except for the mean shift intervention term " e X e Y + ea X ea Y Domain adaptation under structural causal models where Id B is invertible, the prediction problem is anticausal ω = 0, the intervention " a(1) X a(1) Y and ea X ea Y are vectors in Rd+1, the noise term " ε(m) X ε(m) Y and eεX eεY share the same distribution with E h ε(m) X i = E [eεX] = 0, E ε(m) X ε(m) X = E h eεXeε X i = Σ, E h ε(m) Y i = E [eεY ] = 0, E ε(m) Y 2 = E eε2 Y = σ2. In addition, the noise terms on X and Y are uncorrelated E h ε(m) Y ε(m) X i = E [eεY eεX] = 0. The interventions do not span the whole space (to ensure the existence of conditionally invariant components) dim span a(2) X a(1) X , . . . , a(M) X a(1) X = p d 1, (26) and the target X intervention is in the span of source X interventions ea X a(1) X span a(2) X a(1) X , . . . , a(M) X a(1) X . (27) Compared to Assumption 1, Assumption 2 involves multiple source environments instead of one. Also, the way the source and target environments are generated up to Equation (26) is the same as in Assumption 1. Equation (26) assumes that the X interventions do not span the whole covariate space Rd. It ensures the existence of a invariant linear projection of the covariates given the label Y . More precisely, this assumption allows us to find a vector v which is orthogonal to the span (Id B) 1 a(2) X a(1) X , . . . , (Id B) 1 a(M) X a(1) X and the projection of the covariates X v has the same intervention term across source environments. This particular linear projection of the covariates becomes one conditionally invariant component. To make sure the same conditionally invariant component is also valid in the target environment, Equation (27) requires that the target X intervention is in the span of source X interventions. Given the data generation assumption above, we have the following theorem on the target risk of CIP and CIRM. Theorem 5 Under the data generation Assumption 2, the population target risks of CIPmean and CIRM(m)-mean for any m {1, . . . , M} satisfy R (f OLSTar) = σ2 1 + σ2b Σ 1b, (28) R (f CIP) = σ2 + Y 1 + (σ2 + Y ) b Σ 1 2 b + (ea Y a Y )2 Y 1 + (σ2 + Y ) b Σ 1 R f(m) CIRM = σ2 1 + σ2b Σ 1 2 G(m) DIP Σ 1 a(m) Y ea Y 2 1 + σ2b Σ 1 2 G(m) DIP Σ 1 2 b 2 , (30) Chen and B uhlmann where a Y = 1 M PM j=1 a(j) Y , Y = 1 M PM j=1 a(j) Y a Y 2 . GCIP = Σ1/2QCIP Q CIPΣQCIP 1 Q CIPΣ1/2 is a projection matrix with rank d p where QCIP Rd (d p) is a matrix with columns formed by an orthonormal basis of the orthogonal complement of span(a(2) X a(1) X , . . . , a(M) X a(1) X ). G(m) DIP is a projection matrix defined in the same way as G(1) DIP in Theorem 1. The proof of Theorem 5 is provided in Appendix C.1. Comparing the target population risk of CIRM(m) in Equation (30) with that of DIP(m) in Equation (25), CIRM(m) reduces the dependency on the difference in Y intervention a(m) Y ea Y . This is because we always have 1 + σ2b Σ 1 2 G(m) DIPΣ 1 2 b > 1. The target population risk of CIRM(m) when there is intervention on Y in Equation (30) has an additional term depending on a(m) Y ea Y when compared to that of DIP(m) when there is no intervention on Y in Equation (18). The additional term becomes close to zero when σ2b Σ 1 2 G(m) DIPΣ 1 2 b is large. In fact, without additional assumptions to Assumption 2, the dependence on a(m) Y ea Y or similar terms is unavoidable for any DA estimator. Because the Assumption 2 does not prevent b from being in the span a(2) X a(1) X , . . . , a(M) X a(1) X , it can still result in a slightly unidentifiable example similar to the one presented at the beginning of Section 4.2. See Appendix D for ways to get rid of the a(m) Y ea Y dependence at the cost of small additional target risk. In the same spirit of Corollary 2, we present a corollary that puts additional assumptions on the positions of the interventions a(m) X and ea X to make the results in Theorem 5 easier to understand. Corollary 6 In addition to Assumption 2, suppose Σ = σ2 ρ Id with ρ > 0, then R (f OLSTar) = σ2 1 + ρ b 2 2 , (31) R (f CIP) = σ2 1 + ρ b 2 2 ρ (P CIPb)2 + (ea Y a Y )2 Y h 1 + ρ b 2 2 ρ (P CIPb)2i2 , (32) R f CIRM(m) = σ2 1 + ρ b 2 2 ρ u(m) b 2 + ea Y a(m) Y 2 1 + ρ b 2 2 ρ u(m) b 2 2 , (33) where a Y = 1 M PM m=1 a(m) Y , Y = (ea Y a Y )2 1 M PM m=1 a(m) Y a Y 2 , PCIP = Q CIP Rd p is the matrix with columns formed by the orthonormal basis of span a(2) X a(1) X , . . . , a(M) X a(1) X and u(m) = a(m) X ea X a(m) X ea X . Additionally, if a(2) X a(1) X , . . . , a(M) X a(1) X , ξ are generated in- dependently from Gaussian distribution N 0, τ 2Id and ea X a(1) X = PCIPξ, d 6M, a constant 0 < t d/2, then with probability at least (1 2 exp( d/32) 2 exp( M/32)) (1 exp( t/16) 2 exp( t/4)), we have dim span a(2) X a(1) X , . . . , a(M) X a(1) X = M 1 Domain adaptation under structural causal models and additionally R (f CIP) σ2 1 + ρ 1 3(M 1) d b 2 2 + (ea Y a Y )2 Y h 1 + ρ 1 3(M 1) d b 2 2 i, (34) R (f CIP) σ2 1 + ρ 1 (M 1) 3d b 2 2 + (ea Y a Y )2 Y h 1 + ρ 1 (M 1) 3d b 2 2 i, (35) R f CIRM(m) σ2 ea Y a(m) Y 2 h 1 + ρ 1 t d b 2 2 i. (36) The proof of Corollary 6 is provided in Appendix C.2. Under the assumptions of Corollary 6, it is clear that for M and d sufficiently large, with high probability, CIRM(m) has smaller target population risk than CIP when all Y interventions are equal a(1) Y = . . . = a(M) Y = ea Y . This is because the last term in Equation (32) become zero can be ignored and M 1 d when M 3t+1. Intuitively, CIP only uses the conditionally invariant components (roughly d p coordinates) to build the estimator, while CIRM(m) takes advantage of the other coordinates of X (roughly d 1 coordinates). The intuition behind CIRM becomes apparent after the theorem statement. CIRM(m) is indeed a combination of DIP(m) and CIP: it applies CIP in the first step to obtain the conditionally invariant component which serves as a proxy of the label Y to correct for the intervention on Y , then it applies DIP(m) with the corrected target covariate distribution to improve the target prediction performance. The conditionally invariant components identified by CIP are useful to constitute a proxy of Y but they alone do not predict Y very well especially when there are only a few of them. DIP has good target risk guarantees when there is no Y intervention. When there is Y intervention, CIRM uses CIP to correct for the Y distribution perturbation and to create a residual which does not suffer from intervention, then apply DIP on the residual. Since CIRM can be seen as applying DIP on the label-distribution-corrected source and target data, it is natural to introduce a weighted version of CIRM called CIRMweigh similar to DIPweigh (24). A precise formulation of CIRMweigh can be found in Appendix A.1. Under the assumptions of Corollary 6, we summarize the target risk upper bounds of all DA methods in this paper in Table 2. For brevity, we only present the target population risk upper bounds under high probability when the interventions are generated i.i.d. Gaussian N 0, τ 2Id . The upper bounds are tight up to constant factors in the denominators because we first proved the exact target risks with equality and then applied Gaussian concentration to obtain the upper bounds. Consequently, the upper bounds serve the comparison purpose. 4.2.2 On relaxing the assumptions needed for CIP and CIRM We discuss two different relaxations of Assumption 2. First, the two assumptions (26) and (27) in Assumption 2 can be replaced by a single assumption, namely there exists a non-zero vector v Rd such that v X(1), . . . , v X(M) and v e X all have the same distribution. This is also equivalent to saying that the projection of the covariates on the direction v is a conditionally invariant component. Stating the CIC Chen and B uhlmann Estimator Target population risk upper bound interventions under general position OLSTar (oracle) σ2 1 + ρ b 2 2 No intervention on Y (Corollary 2) OLSSrc(1) σ2 1 + ρ b 2 2 + c τ 2 b 2 2 h 1 + ρ b 2 2 i2 d) b 2 2 + ea Y a(m) Y 2 Intervention on Y with CICs M sources (Corollary 6) 1 + ρ(1 c(M 1) d ) b 2 2 + (ea Y a Y )2 Y h 1 + ρ(1 c(M 1) d ) b 2 2 i2 ea Y a(m) Y 2 h 1 + ρ(1 c d) b 2 2 i2 Table 2: Summary of target population risk for DA estimators for anticausal domain adaptation under the assumptions of Corollary 2 and Corollary 6. Here c > 0 is a generic constant to improve readability, which can be different for different methods. For simplicity, we only present the target population risk upper bounds holding with high probability when the interventions are generated i.i.d. Gaussian N(0, τ 2Id) and when both M and d are large. Under the assumptions of Corollary 2, the target risk of OLSSrc(1) depends on the magnitude of the difference in intervention τ, while the other DA methods in the table do not: DIP(1) outperforms OLSSrc(1). Under the assumptions of Corollary 6, when intervention on Y becomes large, DIP(m) is worse than CIP or CIRM(m). CIRM(m) slightly outperforms CIP. assumption as we did in Assumption 2 has the advantage of separating the assumptions on the source interventions from those on the target intervention. When the dimension d is larger than the number of source environments M, the assumption (26) on the dimension of the span of the source X interventions is always satisfied. In the case of M < d, the more source environments there are, the more likely that the target X intervention is in the span of the source X interventions. Second, the mean shift noise intervention in Assumption 2 can be relaxed to other types of noise interventions. For example, the noise intervention can be the standard deviation shift as we did for DIP in Section 4.1.3. We may consider CIP-std which puts the conditional invariance penalty on the standard deviations, CIP-std+ which puts the conditional invariance penalty on the means, standard deviations and 25% quantiles or CIP-MMD which uses generic distributional matching to make sure the conditional distribution of v X is invariant across source environments. CIRM can be extended similarly. We do not provide Domain adaptation under structural causal models theoretical guarantees of these extensions and we only show their empirical performance through simulation and data experiments in Section 5. 4.3 Mixed-causal-anticausal domain adaptation In this subsection, we consider the mixed-causal-anticausal DA setting where both causal and anticausal variables are present. In terms of the SCMs, neither the vector b nor ω in Equation (3) is assumed to be zero. In terms of the graph associated with the SCM, some variables are descendants of the label Y and some variables are ancestors of Y . As we have seen in Example 1 and Example 2 in Section 3.5, DA methods such as DIP perform worse than Causal or OLSSrc in the causal setting, while DIP can have smaller target population risk than Causal or OLSSrc in the anticausal setting. When both causal and anticausal variables are present, it is no longer clear whether more sophisticated DA methods would be better than Causal or OLSSrc. To gain some intuition, we first show a simple mixedcausal-anticausal DA example to illustrate how the standard DIP, CIP and CIRM can have worse target risk than Causal. Then we introduce new algorithms to tackle the mixedcausal-anticausal DA problem. 4.3.1 A mixed-causal-anticausal DA example to illustrate the failure of the standard DIP, CIP and CIRM Here we provide a simple mixed-causal-anticausal DA problem to illustrate the difficulty of the mixed-causal-anticausal setting. Example 4 is a DA problem with a large number of source environments (M 4) and one target environment. The data in the source and target environment are generated independently according to the following SCMs, with the causal diagram on the left and the structural equations on the right of Figure 5. X(m) 1 = ε(m) X1 + a(m) X1 X(m) 2 = Y (m) + ε(m) X2 + a(m) X2 X(m) 3 = Y (m) + ε(m) X3 Y (m) = X(m) 1 + 0 + a(m) Y e X1 = eεX1 + 0 e X2 = e Y + eεX2 + 0 e X3 = e Y + eεX3 e Y = e X1 + 0 + 0 Figure 5: The causal diagram and structural equations for the m-th source (m {1, . . . , M}) and target environments in Example 4 Here the noise variables follow independent Gaussian distributions with ε(m) Xi , eεXi N(0, 0.1), i {1, . . . , 3}. The noise variable of Y is set to zero. The type of intervention is mean shift noise intervention. The variable X3 is never intervened on, so X3 is conditionally invariant. For the first source environment, the intervention a(1) = 1 1 0 1 . For m 2, a(m) = h a(m) X1 a(m) X2 0 a(m) Y i where each coordinate is generated i.i.d. from N(0, 10) and we ensure that the M 1 interventions span a subspace of dimension 3. For Chen and B uhlmann the target environment, the intervention is 0 0 0 0 . Compared to Example 3, the inclusion of the causal covariates X1 and X2 makes the DA problem more complicated. The population estimators when M tends to can be computed explicitly βOLSTar βOLSTar,0 = 1 0 0 0 , βCausal βCausal,0 = 1 0 0 0 , βSrc Pool βSrc Pool,0 = 0 0 1 0 , " β(1) DIP β(1) DIP,0 βCIP βCIP,0 = 0 0 1 0 , " β(1) CIRM β(1) CIRM,0 = 1 0 0 1 . Note that with a large number of source environments, CIP and Src Pool both identify the conditional invariant component X3. The Causal estimator is the best because we made the noise variable on Y to be 0 and the intervention on Y in the target to be 0. DIP(1) does not find the Causal estimator because the Y intervention makes it difficult to align the covariate interventions between source and target. CIRM(1) does not find the Causal estimator because even if it can correct for the label intervention, matching on the causal covariate X1 is not a good idea due to similar reason we have seen applying DIP for causal prediction in Example 1. The corresponding population source and target risks are summarize in Table 3. We conclude that unlike in the anticausal DA setting, in the presence of both causal and anticausal covariates, CIRM which is based on the idea of domain invariant projection after label correction no longer outperforms Src Pool. PPPPPPPPPP P Risk Methods OLSTar (oracle) Causal Src Pool DIP(1) CIP CIRM(1) Ex 4, source risk R(1) 1.000 1.000 0.100 0.000 0.100 0.000 Ex 4, target risk R 0.000 0.000 0.100 4.000 0.100 1.000 Table 3: Population source and target risks in Example 4 (the lower the better). The oracle target risk is highlighted in bold. In Example 3 of mixed-causal-anticausal prediction with intervention on Y , DIP(1), CIP and CIRM(1) all perform worse than Causal in terms of target population risk. One might argue that the superior performance of Causal over Src Pool is made up because we set the target label intervention ea Y to be zero. The concern is valid in general. However, it is not difficult to observe that no matter how we set ea Y there exist better estimators than Src Pool. In fact, if we know X1 is the only causal variable, we can regress Y, X2, X3 on X1 and consider the prediction problem on the residuals. The resulting prediction problem is anticausal with intervention on Y . Consequently, it can be solved with DA methods discussed in Section 4.2. We formulate this idea precisely in the next subsection. Domain adaptation under structural causal models 4.3.2 New DA methods for mixed-causal-anticausal domain adaptation Instead of studying the most general mixed-causal-anticausal domain adaptation, we first restrict ourselves to the setting where the rough causal structure around Y is known. That is, we know which covariates are the descendants of Y , denoted as XD, and we also know which covariates are the parents of Y or the parents of XD (denoted as XP) as shown in Figure 6. Assuming the rough causal structure , we show that this mixed-causal-anticausal problem can be reduced to the familiar problem in the previous subsections. Assumption 3 makes data generation requirements precise. Figure 6: The causal diagram for our mixed-causal-anticausal domain adaptation setup. Assumption 3 There are M (M 1) source environments. Each data point in the mth source environment is generated i.i.d. according to distribution P(m) specified by the following SCM X(m) P X(m) D Y (m) BP 0 0 BD-P BD b D ω P 0 0 X(m) P X(m) D Y (m) a(m) X,P a(m) X,D a(m) Y ε(m) X,P ε(m) X,D ε(m) Y Each data point in the target environment is generated i.i.d. according to distribution e P specified with the same SCM except for the mean shift intervention term e XP e XD e Y BP 0 0 BD-P BD b D ω P 0 0 e XP e XD e Y ea X,P ea X,D ea Y eεX,P eεX,D eεY where BP Rr r, BD R(d r) (d r) and BD-P R(d r) r are unknown constant matrices such that Id BP 0 BD-P BD is invertible, ωP Rr and b D Rd r are unknown constant vectors, the intervention terms a(m) X,P a(m) X,D a(m) Y ea X,P ea X,D ea Y are vectors in Rd+1, the noise terms Chen and B uhlmann ε(m) X,P ε(m) X,D ε(m) Y eεX,P eεX,D eεY share the same distribution with " ε(m) X,P ε(m) X,D = E eεX,P eεX,D " ε(m) X,P ε(m) X,D # " ε(m) X,P ε(m) X,D " eεX,P eεX,D eεX,P eεX,D = ΣP 0 0 ΣD E h ε(1) Y i = E [eεY ] = 0, E ε(1) Y 2 = E eε2 Y = σ2. Additionally, the noise terms on X and Y are uncorrelated, namely, E " ε(m) X,P ε(m) X,D E eεY eεX,P eεX,D Compared to the SCMs in Assumption 1 and 2, Assumption 3 introduces additional covariates XP which are the parents of Y or XD and are located at the first r coordinates. In the most general DA problem, this information of which covariates are parents may not be available. The information about causal relationships can be obtained for example from domain experts or from the causal discovery literature. The causal discovery problem is well studied and we refer the readers to Glymour et al. (2019) for a review of the causal discovery methods. We leave the most general mixed-causal-anticausal domain adaptation and the design of new DA methods to future work. Based on Assumption 3, we can introduce the intermediate random variables X(m) I := X(m) D (Id r BD) 1 BD-PX(m) P (Id r BD) 1 b Dω P X(m) P Y (m) I := Y (m) ω P X(m) P e XI := e XD (Id r BD) 1 BD-P e XP (Id r BD) 1 b Dω P e XP e YI := e Y ω P e XP. (37) The intermediate random variable can be seen as the residuals after regressing on the parent variables XP. We observe that the intermediate random variables satisfy the following SCMs " X(m) I Y (m) I = BD b D 0 0 " X(m) I Y (m) I " a(m) X,D a(m) Y " ε(m) X,D ε(m) Y " e XI e YI = BD b D 0 0 " e XI e YI + ea X,D ea Y + eεX,D eεY Using the intermediate random variables (X(m) I , Y (m) I ), the mixed-causal-anticausal DA problem can be reduced to the anticausal DA problem. Intuitively, it works as follows. First, regress Y and XD on the first r covariates XP. Second, create a transformed dataset with new covariates and new labels from the corresponding residuals like how we define the intermediate random variables. Third, apply the DA methods in the anticausal DA setting Domain adaptation under structural causal models to the transformed dataset. Finally, bring back the original covariates and labels in the final estimator. Based on the above intuition, we introduce the following estimators for the mixed-causalanticausal DA setting. DIP (m)-mean: the population domain invariant projection estimator for the mixedcausal-anticausal DA setting. f(m) DIP (x) := x " γ(m) Γ(m)β(m) DIP β(m) DIP + β(m) DIP ,0 β(m) DIP , β(m) DIP ,0 := arg min β,β0 E(XP,XD,Y ) P(m) YI X I β β0 2 s.t. E (XP,XD) P(m) X h X I β i = E(XP,XD) e PX h X I β i , (38) where YI := Y X P γ(m), XI := XD Γ(m) XP, and the regression weights γ(m) and Γ(m) are defined as γ(m), γ(m) 0 := arg min γ,γ0 Rr R E(XP,XD,Y ) P(m) Y X P γ γ0 2 , (39) Γ(m), Γ(m) 0 := arg min Γ,Γ0 Rr (d r) Rd r E (XP,XD) P(m) X XD Γ XP Γ0 2 Corollary 7 Under the data generation Assumption 3, M = 1 and the assumption of no intervention on Y i.e. a(1) Y = ea Y = 0, the target population risks of OLSTar and DIP (1)- mean satisfy R (f OLSTar) = σ2 1 + σ2b DΣ 1 D b D , (41) R f(1) DIP = σ2 1 + σ2b DΣ 1 2 D G(1) DIP Σ 1 2 D b D , (42) where G(1) DIP is a projection matrix with rank d 1. The proof of Corollary 7 is provided in Appendix C.4. The target risk results in Corollary 7 is very similar to those in Theorem 1. This is because we reduce the mixed-causal-anticausal DA problem without intervention on Y under Assumption 3 to the anticausal DA problem without intervention on Y under Assumption 1. According to Equation (42), the target population risk of DIP (1) is lower than the target risk of Causal (which equals to σ2). Based on the intuition of DIP , a similar strategy can be applied to extend CIP and CIRM to the mixed-causal-anticausal DA problems. The precise formulations of the extensions CIP and CIRM are introduced in Appendix A.1. Just as Corollary 7 serves as the equivalent of Theorem 1 in the mixed-causal-anticausal DA setting, the equivalent of Theorem 5 can be established for CIP and CIRM . Additionally, we can also introduce the weighted version of CIRM , called CIRM weigh, following the discussion of CIRMweigh. The details are omitted. Chen and B uhlmann 5. Numerical experiments In this section, we numerically compare DA methods in simulated, synthetic and real datasets. The experiments are used to validate our theoretical results in finite sample data, to illustrate DA failure modes when assumptions are violated and to provide ideas of adapting DA methods to scenarios where not all assumptions are satisfied. Section 5.1 formulates the finite-sample DA estimators from the population DA ones to make sure our implementations are reproducible. Section 5.2 contains simulation experiments from deterministic or randomly generated linear SCMs. Section 5.3 discusses the performance of DA estimators on the MNIST dataset with synthetic interventions. Finally, Section 5.4 illustrates through a real data experiment that DA can be difficult in practice when not much domain knowledge about the data generating model is available. Our code to reproduce all the numerical experiments is publicly available in the Github repository https://github.com/yuachen/Causal DA. 5.1 Finite-sample formulations and hyperparameter choices Here we introduce the regularized formulations of the finite-sample DIP, CIP and CIRM. The DIP matching penalty in Equation (10) and the conditional invariance penalty in Equation (11) are enforced via regularization terms. The finite-sample versions of their variants can be formulated similarly after translating the constraints to regularization terms. For the sake of space, they are presented in Appendix A.2. DIP(m)-mean-finite: the finite-sample formulation of the DIP(m)-mean estimator (10). The mean squared difference is used as distributional distance and is enforced via a regularization term, ˆf(m) DIP(x) := x ˆβ(m) DIP + ˆβ(m) DIP,0 ˆβ(m) DIP, ˆβ(m) DIP,0 := arg min β,β0 y(m) i x(m) i β β0 i=1 x(m) i β 1 where λmatch is a positive regularization parameter that controls the match between the covariate mean of the source and target environment. CIP-mean-finite: the finite sample formulation of the CIP-mean estimator (11). The conditional mean is matched across source environments and is enforced via a Domain adaptation under structural causal models regularization term, ˆf CIP(x) := x ˆβCIP + ˆβCIP,0 ˆβCIP, ˆβCIP,0 := arg min β,β0 y(m) i x(m) i β β0 i=1 z(m) cond,i i=1 z(k) cond,i where z(k) cond,i = x(k) i y(k) i y(k) 1 nk Pnk j=1 y(k) j 2 x(k) i , y(k) = 1 nk Pnk j=1 y(k) j and λCIP is a posi- tive regularization parameter that controls the strength of the conditional invariant penalty. In the finite-sample formulation, the conditional expectation in the constraint of population formulation (11) is computed via regressing X β on Y . As a result, z(k) cond,i s are the residuals after regressing x(k) i on y(k) i . CIRM(m)-mean-finite: the finite-sample formulation of the CIRM(m)-mean estimator (12). The residual after removing conditionally invariant components is matched between source and target environments. The matching is enforced via a regularization term. ˆf(m) CIRM(x) := x ˆβ(m) CIRM + ˆβ(m) CIRM,0 ˆβ(m) CIRM, ˆβ(m) CIRM,0 := arg min β,β0 y(m) i x(m) i β β0 i=1 z(m) res,i β 1 i=1 z res,iβ where z(m) res,i = x(m) i x(m) i ˆβCIP ˆϑCIRM with ˆϑCIRM defined as 1 M PM m=1 1 nm Pnm i=1 y(m) i y(m) x(m) i 1 M PM m=1 1 nm Pnm i=1 y(m) i y(m) x(m) i ˆβCIP 1 nm Pnm j=1 x(m) j ˆβCIP and λmatch is a positive regularization parameter similar to the one define in DIP. Just like the population CIRM depends on the population CIP, the finite-sample CIRM also depends on the finite-sample CIP with the same regularization parameter λCIP. Regularization parameter choices: Choosing the regularization parameters in finitesample DA formulations such as λmatch in DIP formulation (43) is a difficult subject. Because the DA setting here assumes no access to any target labels, one cannot get good estimates of the target performance easily. The classical model selection strategies based on a validation set are no longer applicable in domain adaptation. Chen and B uhlmann We make use of the fact that when DIP works as Theorem 1 predicts, the source population risk of DIP equals to the target population risk as shown in Corollary 3. The source finite-sample risk is close to the target finite-sample risk up to finite sample errors. So we would like to choose λmatch large enough so that the DIP matching penalty takes effect, but not too large so that the source finite risk remains reasonably small. We propose to choose the largest λmatch so that the source finite-sample risk is less than two times of the source finite-sample risk when λmatch is set to zero. The choice of two times is arbitrary here, the precise amount depends on the desired target finite-sample risk, the sample size and the variance of the source finite risk estimate. The precise amount is specified for each DA dataset separately. The regularization parameter λmatch in DIPweigh is chosen similarly. Next, we consider the λCIP choice in CIP formulation (44). Since CIP only uses source data and never touches the target data, we leave a small part of each source data out for validation and choose λCIP based on the validation source data. We choose λCIP so that the average source risk across source environment is small. Finally, the finite-sample CIRM formulation (45) requires the choice of both λmatch and λCIP. The λCIP in CIRM is the same as the λCIP in CIP. With λCIP fixed, we choose λmatch in CIRM as we did for DIP. 5.2 Linear SCM simulations In this section, we numerically compare DA estimators via simulations from linear SCMs. First, we consider seven simulations with data generated via linear SCMs with mean shift noise interventions. The first three simulations aim at illustrating the results in Theorem 1 and Theorem 5. The last four are to show the performance of DA estimators when at least one assumption is misspecified. Second, we consider two simulations where the interventions are variance shift noise interventions. Through these two simulations, we demonstrate the necessity of adapting the DIP matching penalties according to the type of interventions. The simulation settings are summarized in Table 4. 5.2.1 Linear SCM with mean shift noise interventions We consider seven simulations with data generated via linear SCMs with mean shift noise interventions. In all experiments in this subsection, we use large sample size n = 5000 for each environment, n = 5000 target test data for the evaluation of the target risk and we fix the regularization parameters λmatch = 10.0 and λCIP = 1.0 to focus the discussions on the choice of DA estimators. (i) Single source anticausal DA without Y intervention: We consider three datasets of dimension d = 3, 10, 20. In each dataset, there is one source environment and one target environment generated according to Assumption 1. The matrix B and the vector b is specified as follows For d = 3, the matrix B = 0, b = [1, 1, 3] and ω = 0. For d = 10, 20, the matrix B is upper triangular with diagonal zero and with each entry drawn i.i.d. from N(0, 0.25). Each entry of b i.i.d. from N(0, 1). ω = 0. Domain adaptation under structural causal models Sim Num # Src envs Causal Direction Interv X type Interv on Y? Has CIC? Better estimator(s) Baseline estimator(s) (i) single anticausal mean shift N - DIP(1) OLSSrc(1) (ii) multiple anticausal mean shift N - DIPweigh OLSPool DIP(1) (iii) multiple anticausal mean shift Y Y CIRMweigh OLSPool DIPweigh (iv) single causal mean shift N - - OLSSrc(1) (v) single mixed mean shift N - DIP (1) OLSSrc(1) (vi) multiple anticausal mean shift Y N - OLSPool (vii) multiple mixed mean shift Y Y CIRM weigh OLSPool CIRMweigh (viii) single anticausal var shift N - DIP-std+ DIP-MMD OLSSrc(1) DIP (ix) multiple anticausal var shift Y Y CIRMweigh-std+ CIRMweigh-MMD OLSPool Table 4: List of linear SCM simulations. The first three simulations are done under the correct assumptions according to the main theorems. Starting from Simulation (iv), some assumptions are modified to show the performance of DA methods under misspecified assumptions (highlighted in red). The last two simulations are done with variance shift noise intervention. Y means yes, N means no, - means the information is irrelevant in that setting. The interventions a(1) X and ea X are generated i.i.d. from N(0, Id) and stay the same for all data points in one environment. For each data point, ε(1) or ε is generated i.i.d. from the standard Gaussian distribution N(0, Id+1). For each dataset, the matrix B and the vector b of the SCM are generated once; the noise and interventions are generated i.i.d. 10 times with source and target sample size n = n = 5000. The boxplot of the 10 target risks is reported for OLSTar, OLSSrc(1), DIP(1) and DIPOracle(1) in Figure 7. The target risk of OLSTar is highlighted in red dashed horizontal line. The target risk of OLSSrc(1) is highlighted in blue dashed horizontal line. The first three plots in Figure 7 show that the target risk of DIP(1) is similar to that of DIPOracle(1) and it is very close to OLSTar as predicted by Theorem 1. The target risk performance of DA estimators depends on the how the matrix B and the vector b are generated. To make the comparison less dependent on the randomness in the matrix B and the vector b, we complement the boxplot with a scatterplot that compares the target risks of DIP(1) and OLSSrc(1) over 100 random generations of the matrix B for Chen and B uhlmann d = 10. The last plot in Figure 7 shows that in 97 out of 100 random data generations, DIP(1) has lower target risk than OLSSrc(1). DIPOracle[1] linear SCM: d=3 DIPOracle[1] linear SCM: d=10 DIPOracle[1] linear SCM: d=20 0.0 0.2 0.4 OLSSrc[1] 97% of pts below the diagonal Figure 7: The target risk comparison in simulation (i) single source anticausal DA without Y intervention (the lower the better). In all three datasets (d = 3, 10, 20) in the left three plots, DIP(1) has lower target risk than OLSSrc(1). The target risk of DIP(1) is close to that of DIPOracle(1). The last plot shows that in 98 out of 100 random coefficient B data generations (d = 10), DIP(1) has lower target risk than OLSSrc(1). (ii) Multiple source anticausal DA without Y intervention: We consider three datasets of covariate dimension d = 3, 10, 20. In each dataset, there is M 1 source environments and one target environment generated according to Assumption 1. The matrix B, the vector b, the interventions a and noises ε are generated as in simulation (i). The only difference is the availability of multiple (M 1) source environments. The boxplot of the 10 target risks are reported for OLSTar, DIP(1) and DIPweigh with increasing number of source environments in Figure 8. The constant ρ in DIPweigh formulation (24) is fixed to be 1000. To make the comparison less dependent on the randomness in the matrix B, we complement the boxplot with a scatterplot that compares the target risks of DIPweigh(M = 8) and DIP(1) over 100 random generations of the matrix B for d = 10. Figure 8 shows that in the anticausal DA setting without Y intervention, the more source environments the lower the target risk DIPweigh can achieve. DIPweigh(M=1) DIPweigh(M=3) DIPweigh(M=5) DIPweigh(M=7) DIPweigh(M=9) linear SCM: d=3, Mmax=9 DIPweigh(M=1) DIPweigh(M=3) DIPweigh(M=5) DIPweigh(M=7) DIPweigh(M=9) linear SCM: d=10, Mmax=9 DIPweigh(M=1) DIPweigh(M=3) DIPweigh(M=5) DIPweigh(M=7) DIPweigh(M=9) linear SCM: d=20, Mmax=9 0.0 0.1 0.2 DIP[1] M = 7, 89% of pts below the diagonal Figure 8: The target risk comparison in simulation (ii) multiple source anticausal DA without Y intervention (the lower the better). The left three plots show that with more number of source environments, DIPweigh perform better. The last plot shows that in 89 out of 100 simulations, DIPweigh with M = 8 has lower target risk than DIP(1). Domain adaptation under structural causal models (iii) Multiple source anticausal DA with Y intervention and with CICs: We fix the dimension d = 20 and the number of source environments M = 14. The source and target datasets are generated similarly to simulation (ii) with two exceptions: 1. there are interventions on Y for the target environment: ea Y follows N(0, 1) and it is generated once then it is fixed for the target environment. 2. there are conditionally invariant components (CICs): the interventions on X only apply to first 10 coordinates. That is, a(m) X and ea X have the last 10 coordinates equal to zero. Note that we don t explicitly assume Equation (27) in Assumption 2. Instead we expect that a large number of source environments will span the vector space with last 10 coordinates zero and make the assumption (27) satisfied. The left most plot in Figure 9 compares the boxplot of the target risks of OLSTar, Src Pool, DIPweigh, CIP and CIRMweigh with one time generation of B and 10 random generations of the interventions and the noises. CIRMweigh has lower target risk than Src Pool and DIPweigh. The right three plots in Figure 9 show the scatterplots comparing the pair-wise target risks in 100 random generations of B and the interventions for the following pairs: CIP vs DIPweigh, CIRMweigh vis Src Pool. CIRMweigh vs DIPweigh. The number of points below the diagonal are reported in the titles. Both CIP and CIRM have lower target risk than DIPweigh. DIPweigh loses target risk guarantees because of the intervention on Y . CIRMweigh also outperforms Src Pool. Note that due to finite sample errors, there are rare cases (about 5%) where CIRM does not outperform Src Pool or DIPweigh. box beyond y limit linear SCM: d=20, M=14 0 1 2 3 DIPweigh 88% of pts below the diagonal 0.00 0.05 0.10 Src Pool 94% of pts below the diagonal 0 1 2 3 DIPweigh 95% of pts below the diagonal Figure 9: Target risk comparison in simulation (iii) multiple source anticausal DA with Y intervention and with CICs (the lower the better). CIRMweigh has smaller target risk than Src Pool. In the presence of Y intervention, the target risk of DIPweigh can be much larger than that of Src Pool. The scatterplots are for 100 random coefficient B data generations. (iv) Single source causal DA without Y intervention: We consider three datasets of covariate dimension d = 3, 10, 20 similar to simulation (i) except that the prediction direction is changed from anticausal to causal: ω = 0 and b = 0. Figure 10 shows that DIP(1) can have larger target risk than OLSSrc(1). This simulation result makes it clear that DIP is not useful for causal DA in general. Example 3 in Section 3.5.3 is not just a pathological failure example of DIP. (v) Single source mixed DA without Y intervention: We consider two datasets of covariate dimension d = 10, 20 similar to simulation (i) except that the prediction direction is changed from anticausal to mixed-causal-anticausal: The odd coordinates of b are set to Chen and B uhlmann DIPOracle[1] linear SCM: d=3 DIPOracle[1] linear SCM: d=10 DIPOracle[1] linear SCM: d=20 0 100 200 OLSSrc[1] 0% of pts below the diagonal Figure 10: Target risk comparison in simulation (iv) single source causal domain adaptation without Y intervention (the lower the better). DIP(1) always has larger target risk than OLSSrc(1) in the causal domain adaptation problem over 100 random coefficient B data generations (d = 10). zero and the even coordinates are nonzero. For ω, it is the contrary: the odd coordinates of ω are nonzero and the even ones are zero. The matrix B is generated according to Assumption 3 with one block zero. The left two plots in Figure 11 show boxplots of target risks of OLSTar, OLSSrc(1), DIPOracle(1), DIP(1) and DIP (1) for random generation of the matrix B. DIP (1) uses the ground-truth knowledge of the nonzero coordinates of b. In mixed causal-anticausal DA, DIP(1) has larger risk than OLSSrc(1). With the groundtruth knowledge of the causal variables, DIP (1) still have lower risk than OLSSrc(1). The right two plots in Figure 11 confirms the result via scatterplot comparisons of OLSSrc(1), DIPOracle(1), DIP(1) and DIP (1) after 100 runs. DIPOracle[1] box beyond y limit linear SCM: d=10 DIPOracle[1] box beyond y limit linear SCM: d=20 0 20 40 OLSSrc[1] 12% of pts below the diagonal 0.0 0.2 0.4 0.6 OLSSrc[1] 93% of pts below the diagonal Figure 11: Target risk comparison in simulation (v) single source mixed DA without Y intervention (the lower the better). DIP(1) has larger target risk than OLSSrc(1) in the mixed causal anticausal DA setting. DIP (1) with ground-truth causal variables outperforms OLSSrc(1). The two scatterplots are over 100 random coefficient B data generations (d = 10). (vi) Multiple source anticausal DA with Y intervention and without CICs: We consider a simulation (d = 20, M = 14) similar to simulation (iii) except that there are no conditionally invariant components (CICs). Figure 12 shows that without CICs, CIRMweigh no longer outperforms Src Pool. Only in 64 of 100 simulations CIRMweigh has Domain adaptation under structural causal models lower target risk than Src Pool. Interestingly, CIRMweigh still has a large chance of having smaller target risk than DIPweigh. This may be due to the fact that even though there are no pure conditional invariant components, CIP may still be able to pick a combination of covariates that is relatively less sensitive to X interventions. 0.0 0.1 0.2 0.3 Src Pool 15% of pts below the diagonal 0 1 2 3 DIPweigh 82% of pts below the diagonal 0.0 0.1 0.2 0.3 Src Pool 64% of pts below the diagonal 0 1 2 3 DIPweigh 83% of pts below the diagonal Figure 12: Target risk comparison in simulation (vi) multiple source anticausal DA with Y intervention and without CICs (the lower the better). Without CICs, CIRMweigh no longer outperforms Src Pool. (vii) Multiple source mixed DA with Y intervention and with CICs: We consider a simulation (d = 20, M = 14) similar to simulation (iii) except that the prediction direction is changed from anticausal to mixed causal anticausal. The odd coordinates of b is set to zero and the even ones are nonzero. The even coordinates of ω is set to zero and the odd ones are nonzero. The first five even coordinates are set to be anticausal CICs. Additionally, we tuned down the variance of ε(m) Y to be 0.01 to make the causal part important for prediction. Figure 13 shows that in mixed causal and anticausal DA, CIRMweigh has lower target risk than Src Pool only in 64 of 100 simulations. However, with the true causal covariates, the oracle estimator CIRM weigh has lower target risk than Src Pool in 96 of 100 simulations. 0.0 0.1 0.2 0.3 0.4 Src Pool 64% of pts below the diagonal 0.0 0.1 0.2 Src Pool CIRM<>weigh 96% of pts below the diagonal 0.0 0.5 1.0 1.5 DIPweigh CIRM<>weigh 81% of pts below the diagonal 0.00 0.05 0.10 0.15 0.20 CIRMweigh CIRM<>weigh 77% of pts below the diagonal Figure 13: Target risk comparison in simulation (vii) multiple source mixed DA with Y intervention and with CICs (the lower the better). The scatterplots are over 100 random coefficient B data generations. CIRMweigh has lower target risk than Src Pool only in 64 of 100 simulations. However, knowing the indexes of the true causal covariates, the oracle estimator CIRM weigh has lower target risk than Src Pool in 96 of 100 simulations. CIRM weigh also outperforms DIPweigh and CIRMweigh. The scatterplots are over 100 random coefficient B data generations. Chen and B uhlmann 5.2.2 Linear SCM with variance shift noise interventions We consider two simulations with data generated via linear SCMs with variance shift noise interventions. Since DIP-std+ and DIP-MMD are involved, we have to adapt the regularization parameter choice strategy described at the beginning of Section 5. We apply all DIP variants with the regularization parameter λmatch ranging in the set 10 5, 10 4, , 104, 105 . We choose the largest λmatch such that the source risk is smaller than min(2r, r + 0.01) as the final parameter. Here r is the source risk of OLSSrc. Based on the average source risk, λCIP is still chosen to be 1.0. Since DIP-std+, DIP-MMD, CIRMweigh-std+ and CIRMweigh-MMD do not have closed form solutions, we use Pytorch s gradient descent to optimize these methods. Specifically, Adam s optimizer is used with step-size 10 4 and iteration number 2000 epochs to ensure convergence. (viii) Single source anticausal DA without Y intervention + variance shift noise intervention: We consider a simulation (d = 10) similar to simulation (i) except that the type of intervention is changed from mean shift noise intervention to variance shift noise intervention. The intervention affects the variance of the noise g(a(1) x , ε(1) X ) = a(1) X ε(1) X . The interventions a(1) X and ea X are still generated i.i.d. from N(0, Id) and stay the same for all data points in one environment. The left-most plot in Figure 14 shows the boxplot of the target risks of OLSTar, OLSSrc(1), DIP(1)-mean, DIP(1)-std+, DIP(1)-MMD. DIP(1)-mean has the same target risk and OLSSrc(1), because the DIP matching penalty on mean has no effect when the intervention is variance shift noise intervention. DIP(1)-std+ and DIP(1)- MMD improves upon DIP(1)-mean. The right three plots in Figure 14 show scatterplots of 100 random coefficient B data generations to confirm this observation. DIP[1]-mean DIP[1]-std+ linear SCM: d=10 0.0 0.2 0.4 OLSSrc[1] DIP[1]-mean 63% of pts below the diagonal 0.0 0.1 0.2 OLSSrc[1] DIP[1]-std+ 90% of pts below the diagonal 0.0 0.1 0.2 OLSSrc[1] 89% of pts below the diagonal Figure 14: Target risk comparison in simulation (viii) single source causal domain adaptation with variance intervention without Y intervention (the lower the better). Left: boxplots of the target risks. DIP-mean has the same target risk than Src[1]. DIP-std+ and DIP-MMD outperform Src[1]. Right: three scatterplots of 100 runs comparing DIP-mean, DIP-std+ and DIP-MMD with Src[1]. (ix) Multiple source anticausal DA with Y intervention + variance shift noise intervention: We consider a simulation (d = 20, M = 14) similar to simulation (iii) except that the type of intervention is changed from mean shift noise intervention to variance shift noise intervention. The variance shift noise intervention on X is as specified simulation Domain adaptation under structural causal models 0.0 0.1 0.2 0.3 Src Pool CIRMweigh-mean 16% of pts below the diagonal 0.00 0.05 0.10 Src Pool CIRMweigh-std+ 99% of pts below the diagonal 0.0 0.5 1.0 Src Pool DIPweigh-MMD 17% of pts below the diagonal 0.00 0.05 0.10 Src Pool CIRMweigh-MMD 77% of pts below the diagonal Figure 15: Target risk comparison in simulation (ix) multiples source causal domain adaptation with variance intervention with Y intervention and with conditionally invariant components (CICs) (the lower the better). CIRMweigh-std+ and CIRMweigh-MMD outperform Src Pool. CIRMweigh-mean does not outperform Src Pool, because it can not handle the variance shift noise intervention. DIPweigh-MMD performs much worse than Src Pool because of the intervention on Y . (viii). The intervention on Y is still mean shift noise intervention as in simulation (iii). The scatterplots in Figure 15 compares CIRMweigh-mean, CIRMweigh-std+, DIPweigh-MMD, CIRMweigh-MMD with Src Pool in 100 runs. CIRMweigh-std+ and CIRMweigh-MMD outperform Src Pool. CIRMweigh-mean does not outperform Src Pool, because it can not handle the variance shift noise intervention. DIPweigh-MMD performs much worse than Src Pool because of the intervention on Y . 5.3 MNIST experiments with synthetic image interventions In this section, we generate synthetic image classification datasets by modifying the images from the MNIST dataset (Le Cun, 1998). Then we compare the performance of various DA methods on this digit classification task. The interventions are added directly on the image pixels. The DA methods are applied to the last-layer features of a pre-trained CNN on the original MNIST dataset. Since our DA methods are only defined for regression problems in the previous sections, for this classification problem, we adapt the loss function to softmax loss and we use one-hot encoding of the label Y in the DIP, CIP and CIRM formulations. For the MNIST experiments with synthetic image interventions, a priori the linear SCM assumption is no longer valid and the interventions are not necessarily noise interventions. Here we show that the DA methods such as DIP, CIRM still have target performance as predicted by our theorems despite the likely violation of several assumptions. All the methods in this subsection are implemented via Pytorch (Paszke et al., 2019) and are optimized with Pytorch s stochastic gradient descent. Specifically, stochastic gradient descent (SGD) optimizer is used with step-size (or learning rate) 10 4, batchsize 500 and number of epochs 100. 5.3.1 MNIST with patch intervention MNIST DA with patch intervention without Y intervention: We take the original MNIST dataset with 60000 training samples and create two synthetic datasets (source and target) as follows. For the source dataset, each training image is masked by the mask (a) in Chen and B uhlmann Figure 16. That is, for each training image, the pixels at the white region of the mask (a) are set to white (maximum pixel value). For the target dataset environment, each training image is masked by the mask (b) in Figure 16. For each experiment, we take random 20% of samples from the source dataset as the source environment and random 20% of samples from the target dataset as the target environment. The task is to predict the labels of the images in the target environment without observing any target labels. This experiment is repeated 10 times and we report the boxplot of the 10 target classification accuracies for each DA method. We apply the DA methods on the last-layer features of a pre-trained convolutional neural network (CNN). The CNN has two convolutional layers and two fully connected layers similar to the Le Net-5 architecture. It is pre-trained on the original MNIST dataset with test accuracy on the separate 10000 test images from the MNIST dataset being 98.8%. A priori, it is not clear what type of interventions on the last-layer features happen when we only know the intervention is on the pixels. To fit into our theoretical and conceptual framework, these induced interventions on the last-layer features would need to be approximated by shift noise interventions. The following DA methods are applied: Original: CNN model pre-trained on the original MNIST dataset without any modification. Tar: oracle CNN model trained only on the target environment. It is similar to OLSTar because only the weights in the last layer are trained. We changed the name to Tar because the full model is a neural network. Src(1): CNN model trained only on the source environment. DIP(1): CNN model where DIP(1)-mean-finite is applied using the source label and the last-layer features of source and target environments. DIP(1)-MMD: oracle CNN model where where DIP(1)-MMD-finite is applied using the target label and the last-layer features of source and target environments. Based on the discussion on the regularization parameter choice at the beginning of Section 5, we vary the regularization parameter λmatch from the set 10k k= 5, ,4, and we choose the largest λmatch such that the source accuracy has not dropped too much (in this case no more than 1% of the source accuracy of Src) as the final regularization parameter. The left plot in Figure 17 shows that both DIP(1) and DIP(1)-MMD achieve better target accuracy than Original or Src(1). MNIST DA with patch intervention with Y intervention: We take the original MNIST dataset with 60000 training samples and create 12 synthetic datasets (M = 11 source environments and one target environment) as follows. For the m-th source dataset, each training image is masked by the m-th image (from left to right) in Figure 18. For the target dataset, each train or test image is masked by the right most image in Figure 18. The target dataset suffers additional Y intervention: for digits (3, 4, 5, 6, 8, 9) in the target dataset, 80% of the images in the MNIST dataset are removed from the target dataset. For experiment, we take random 20% of samples from the source datasets as the 11 source environments and 20% of samples from the target dataset as the target environment. Domain adaptation under structural causal models (a) mask (a) on 32 32 image (b) mask (b) on 32 32 image (c) data samples masked by mask (a) (d) data samples masked by mask (b) Figure 16: The two interventions in the MNIST single source domain adaptation without Y intervention. From left to right, mask (a), mask (b) and the corresponding data samples. Accuracy (%) DIPweigh-MMD CIRMweigh-MMD Accuracy (%) Figure 17: Left: Target accuracy comparison in MNIST experiment with patch intervention and single source without Y intervention. Right: Target accuracy comparison in MNIST experiment with patch intervention and multiple sources with Y intervention. This experiment is repeated 10 times and we report the boxplot of 10 target classification accuracies for each DA method. As the MNIST experiments above, we apply the DA methods on the last-layer features of the same pre-trained convolutional neural network (CNN). In addition to the methods in the MNIST experiments above, we also consider the following methods and their MMD variants: DIPweigh: CNN model where DIPweigh-mean-finite is applied on the last-layer features. CIP: CNN model where CIP-mean-finite is applied on the last-layer features. CIRMweigh: CNN model where CIRM-mean-finite is applied on the last-layer features. The regularization parameter λCIP is chosen to be 0.1 based on source risk. For the regularization parameter λmatch in DIPweigh and CIRMweigh, we first output the source envi- Chen and B uhlmann ronment index with the largest weight as the best source . Then we vary it from the set 10k k= 5, ,4, and we choose the largest λmatch such that the source accuracy of the best source is not dropped too much (in this case not more than 1% of the source accuracy of Src applied to the best source ) as the final regularization parameter. The right plot in Figure 17 compares the target accuracies of the DA methods in this setting. We observe that CIRMweigh and CIRMweigh-MMD outperforms Src Pool and Original. Due to the intervention on Y , the matching penalty of DIPweigh and DIPweigh MMD is not useful and the two methods perform worse than Src Pool. Figure 18: The 12 X interventions in the MNIST multiple source domain adaptation with Y intervention. From left to right, the first 11 are the source X interventions, the last is the target X intervention. The target environment suffers additional Y intervention. 5.3.2 MNIST with rotation intervention MNIST DA with rotation intervention without Y intervention: We take the original MNIST dataset with 60000 training samples and create three synthetic datasets (2 sources and 1 target) as follows. For each dataset, each training image is rotated by one of the angles {10, 30, 45} anti-clock-wise as shown in Figure 19. The three datasets are named Rotation 10 , Rotation 30 and Rotation 45 respectively. For the first experiment, we use Rotation 10 as the source environment and Rotation 45 as the target environment. For the second experiment, we use Rotation 30 as the source environment and Rotation 45 as the target environment. For each experiment, we take random 20% of samples from the source dataset as the source environment and random 20% of samples from the target dataset as the target environment. The task is to predict the labels of the images in the target environment without observing any target labels. Except for the change in the type of intervention, the other experimental settings are the same as in MNIST DA with patch intervention without Y intervention in Section 5.3.1. Figure 20a shows the boxplot of the target accuracies of Original, Tar, Src(1), DIP(1) and DIP(1)-MMD for the first experiment. DIP(1)-MMD achieves higher target accuracy than Src(1). Figure 20b shows the boxplot of 10 runs of Original, Tar, Src(1), DIP(1) and DIP(1)-MMD for the second experiment. DIP(1)-MMD achieves higher target accuracy than Src(1). Comparing Figure 20a and 20b, we also observe that the first experiment is a more difficult classification task than the second experiment as the accuracies achieved by our DA methods are lower in the first experiment. MNIST DA with rotation intervention with Y intervention: We take the original MNIST dataset with 60000 training samples and create 5 synthetic datasets (M = 4 source environments and one target environment) as follows. For m {1, , 4}, for the m-th source dataset, each training image is rotated by (m 15 30) clock-wise. Each image in the target dataset is rotated by 30 clock-wise. The target dataset suffers additional Y Domain adaptation under structural causal models (a) Rotation10 d (b) Rotation30 d (c) Rotation45 (d) samples 10 (e) samples 30 (f) samples 45 Figure 19: (a)(b)(c): the three interventions in the MNIST rotation intervention domain adaptation without Y intervention. (d)(e)(f): the corresponding data samples under rotation intervention. Accuracy (%) (a) src: Rotation 10 tar: Rotation 45 Accuracy (%) (b) src: Rotation 30 tar: Rotation 45 DIPweigh-MMD CIRMweigh-MMD Accuracy (%) (c) MNIST rotation interv. with Y interv. Figure 20: (a) Target accuracy comparison in MNIST experiment with rotation intervention and single source without Y intervention from Rotation 10% to Rotation 45%. (b) Target accuracy comparison in MNIST experiment with rotation intervention and single source without Y intervention from Rotation 30% to Rotation 45%. (c) Target accuracy comparison MNIST experiment with rotation intervention and multiple source with Y intervention. intervention: for digits (3, 4, 5, 6, 8, 9) in the target dataset, 80% of the images in the MNIST dataset are removed from the target dataset. For experiment, we take random 20% of samples from the source datasets as the 4 source environments and 20% of samples from the target dataset as the target environment. Except for the change in the type of intervention, the other experimental settings are the same as in MNIST DA with patch intervention with Y intervention in Section 5.3.1. Figure 20c shows the boxplot of the target accuracies of Original, Tar, Src Pool, DIP(1), DIPweigh, CIP, CIRMweigh, DIP-MMD, CIP-MMD and CIRMweigh-MMD. CIP and CIPMMD outperforms Src Pool and Original. Due to the intervention on Y, the matching penalty of DIPweigh and DIPweigh-MMD is not useful and the two methods perform worse than Src Pool. CIP, CIRMweigh and the corresponding MMD variants outperform Src Pool. 5.3.3 MNIST with random translation intervention We take the original MNIST dataset with 60000 training samples and create two synthetic datasets (source and target) as follows. For the source dataset, each training image is Chen and B uhlmann translated horizontally with a distance randomly selected from 0.2 image width to 0.2 image width as shown in Figure 21a. For the target dataset environment, each training image is translated vertically with a distance randomly selected from 0.2 image height to 0.2 image height as shown in Figure 21b. For each experiment, we take random 20% of samples from the source dataset as the source environment and random 20% of samples from the target dataset as the target environment. The task is to predict the labels of the images in the target environment without observing any target labels. Except for the change in the type of intervention, the other experimental settings are the same as in MNIST DA with patch intervention without Y intervention in Section 5.3.1. Figure 21c shows the boxplot of the target accuracies of Original, Tar, Src(1), DIP(1) and DIP(1)-MMD. DIP(1)-MMD still performs better than Src(1), but it barely improves over Original. Since the intervention is random rather than fixed for each environment, intuitively this experiment is more difficult for our DA methods to have high accuracy than the first two experiments with fixed intervention for each environment. (a) source samples after random horizontal translation (b) target samples after random vertical translation Accuracy (%) (c) MNIST translation interv. Figure 21: (a) Source samples after random horizontal translation. (b) Target samples after random vertical translation. (c) Target accuracy comparison in MNIST experiment with random translation intervention and single source without Y intervention. 5.4 Experiments on Amazon review dataset with unknown interventions In this subsection, we compare DA methods on the regression dataset Amazon Review Data (Ni et al., 2019). Unlike the simulated experiments or the MNIST experiment, neither the data generation process nor the type of interventions is known. This dataset contains product reviews and metadata from Amazon in the date range May 1996 - Oct 2018. In our experiment, we consider the task of predicting review rating from review text from various product categories (Automotive, Digital Music, Office Products etc.). For each review, the covariates X are the TF-IDF features generated from raw review text; the label Y is the review rating (score from 1 5). The different product categories are used as source and target environments. We take 15 product categories with the largest sample sizes, use 14 of them as source environments and leave the last one as the target environment. Domain adaptation under structural causal models The samples size n is 10000 in each source or target environment. The dimension depends on the TF-IDF feature extractor. Here we use both unigrams and bigrams, and build the vocabulary with terms that have a document frequency not smaller than 0.008. This results in feature dimension d = 482. A linear model with ℓ2 regularization is used on top of the TF-IDF features to predict ratings. Without explicit knowledge about the causal structure or the type of interventions, a priori it is no longer clear which DA method is the best. We apply Src Pool and the advanced DA methods, DIPweigh, CIP, CIRMweigh on the linear model. Figure 22 shows that the advanced DA methods do not outperform Src Pool except for target environment number 4. We did an additional experiment with a small portion of target labels revealed. We use the small portion of target labels to choose the best method out of Src Pool, DIPweigh, CIP and CIRMweigh based on the small portion of labeled target data. The last three boxes (labeled as best20, best40 and best60) in each subplot of Figure 22 show that with 20, 40, 60 target labels revealed, the best method based on the small portion of labeled target data is always not worse than Src Pool and can sometimes outperform Src Pool. We arrive at a conclusion that in general, without explicit knowledge about causal structure or the type of interventions, it is ambitious to expect advanced DA methods to always outperform Src Pool. However, with a small portion of labeled target data, we show that DA methods still lead to substantial improvements. Our empirical observation agrees with the concurrent empirical study of DA methods by Wiles et al. (2021). In a largerscale experimental framework with many real and synthetic datasets, they find that DA methods can outperform Src Pool in some settings but there is no single best method across all settings. Hence, knowledge about the distribution shift is necessary for successful DA with guarantees. target=3. Cell_Phones_and_Accessories target=4. Digital_Music target=7. Luxury_Beauty Figure 22: Target risk in Amazon review data prediction experiment (the lower the better). Without explicit knowledge about the type of interventions, it is no longer clear which domain adaptation method is the best. From left to right, depending on the target environment choice, the best methods are CIP, DIPweigh, CIRMweigh accordingly. Chen and B uhlmann 6. Discussion In this paper, we propose a theoretical framework using structural causal models (SCMs) to analyze and obtain insights on prediction performance of several popular domain adaptation methods. First, we show that under the assumption of anticausal prediction, linear SCM and no intervention on Y , the popular DA method DIP achieves a low target risk. This theoretical result is compatible with many previous empirical results on the usefulness of DIP for domain adaptation. Second, we derive several conditions where DIP fails to achieve a low target risk: when the prediction problem is causal or mixed-causal-anticausal; when there is label distribution perturbation. To tackle these difficult DA scenarios, we design a new DA method called CIRM, among other modifications. The theoretical analysis is complemented with simulation analysis and real data experiments. The theoretical extension to nonlinear SCMs is a challenging future direction. However our empirical results suggest that the linear SCM assumption provides useful insights even for cases where a linear SCM does not hold. The real data experiments show that it can be beneficial to use DA methods when the anticausal prediction assumption is satisfied but it can also be dangerous to blindly use DA methods when little is known about the data generation process. Thus, prior knowledge of the underlying causal structure and the types of interventions are often crucial. The prior knowledge of the causal structure can come from domain expert knowledge or causal studies on related datasets with the same variables. How to seamlessly combine causal studies about the causal structure and domain adaptation with appropriate uncertainty quantification is one promising future direction. In absence of prior information about the causal structure and the types and locations of interventions, one needs to select good DA methods. The assessment of the goodness of a model or algorithm is difficult in general but it can be done when having access to a small fraction of labeled target data. We show empirically that a small fraction of labeled target data substantially helps to select the best DA method. It remains an open question what is the minimal amount of labeled target data points in order to guarantee the best DA method selection. Acknowledgments Y. Chen and P. B uhlmann have received funding from the European Research Council under the Grant Agreement No 786461 (Causal Stats - ERC-2017-ADG). They both acknowledge scientific interaction and exchange at ETH Foundations of Data Science . They also thank Domagoj Cevid, Christina Heinze-Deml, Wooseok Ha, Jinzhou Li, Nicolai Meinshausen, Armeen Taeb, Fanny Yang and Andrii Zadaianchuk for fruitful discussions and for helpful suggestions on presentation and writing. Domain adaptation under structural causal models Appendix A. Summary of DA methods in this paper In this section, we provide a summary of the DA methods presented in this paper. Methods that do not fit in the main paper due to the space limitation are formally introduced here. A.1 Population DA methods First, we formulate the DIP variants that adapt to other types of intervention as mentioned in Section 4.1.3. DIP(m)-std: the population DIP estimator where the difference between the source and target standard deviations is used as distributional distance. For the m-th source environment, it is defined as f(m) DIP-std(x) := x β(m) DIP-std + β(m) DIP-std,0 β(m) DIP-std+, β(m) DIP-std,0 := arg min β,β0 E(X,Y ) P(m) Y X β β0 2 s.t. Var X P(m) X h X β i = Var X e PX h X β i . (46) DIP(m)-std+: the population DIP estimator where the differences between the source and target means, standard deviations and 25% quantiles are used as distributional distance f(m) DIP-std+(x) := x β(m) DIP-std+ + β(m) DIP-std+,0 β(m) DIP-std+, β(m) DIP-std+,0 := arg min β,β0 E(X,Y ) P(m) Y X β β0 2 s.t. E X P(m) X h X β i = EX e PX Var X P(m) X h X β i = Var X e PX ψ25% X(m) β = ψ25% e X β , (47) where ψ25% is the 25% quantile function which takes a random variable and returns its 25% quantile. DIP(m)-MMD: the population DIP estimator where the maximum mean discrepancy (MMD) is used as distributional distance. f(m) DIP-MMD(x) := x β(m) DIP-MMD + β(m) DIP-MMD,0 β(m) DIP-MMD, β(m) DIP-MMD,0 := arg min β,β0 E(X,Y ) P(m) Y X β β0 2 s.t. DMMD,H X(m) β, e X β = 0, (48) where the maximum mean discrepancy (MMD) Gretton et al. (2012) with respect to the reproducing kernel Hilbert space (RHKS) H between two random variable Z1 and Chen and B uhlmann DMMD,H (Z1, Z2) = sup h H |E [h(Z1)] E [h(Z2)]| . By default, the RKHS with Gaussian kernel is used throughout this paper. Second we introduce the weighted version of CIRM following DIPweigh (24). CIRMweigh-mean: the population CIRM estimator that weights the source environments based on the source risks. It is defined as follows f CIRMweigh(x) := 1 PM m=1 e η sm m=1 e η sm x β(m) CIRM + β(m) CIRM,0 sm := R(m) f(m) CIRM . (49) Here η > 0 is a constant. Choosing η to be is equivalent to choosing the source estimator with the lowest source risk. The rational behind the use of source risks to weigh the environments follows from the corollary below. Corollary 8 Under the data generation Assumption 2, the m-th source population risk (2) of CIRM(m)-mean satisfies R(m) f(m) CIRM = R f(m) CIRM + a(m) Y ea Y 2 1 + σ2b Σ 1 2 G(m) DIP Σ 1 The proof of Corollary 8 is provided in Appendix C.3. Comparing Corollary 8 with Corollary 3, the source risk of CIRM is no longer exactly equal to its target risk. The source risk has an additional term that depends on the intervention on Y . However, in the case where Σ has eigenvalues much smaller than σ, this additional term is negligible. In these scenarios, the source risk of CIRM still constitutes a good approximation of the target risk of CIRM. It is still possible to apply CIRM for each m {1, . . . , M} and pick the source environment with the lowest source population risk in order to reduce the target population risk. Based on the above ideas, we introduce the following weighted version of CIRM. Third we introduce the CIP and CIRM extensions to deal with mixed-causal-anticausal DA problems in Section 4.3. CIP -mean: the population conditional invariance penalty estimator for the mixed causal anticausal DA setting. f CIP (x) := x γ Γ βCIP βCIP + β(m) CIP ,0 βCIP , βCIP ,0 := arg min β,β0 k=1 E(XP,XD,Y ) P(k) YI XI β β0 2 s.t. E(XP,XD,Y ) P(m) h XI β | YI = y i = E(XP,XD,Y ) P(1) h XI β | YI = y i , y R, m {1, . . . , M} , (51) Domain adaptation under structural causal models where YI := Y X P γ , XI := XD Γ XP, γ , γ 0 := arg min γ,γ0 Rr R E(XP,XD,Y ) Pallsrc Y X P γ γ0 2 , Γ , Γ 0 := arg min Γ,Γ0 Rr (d r) Rd r E(XP,XD) Pallsrc X XD Γ XP Γ0 2 and Pallsrc denotes the uniform mixture of all source distribution. CIRM (m)-mean: the population conditional invariant residual matching estimator using m-th source environment for the mixed causal anticausal DA setting. f(m) CIRM (x) := x " γ Γ β(m) CIRM β(m) CIRM + β(m) CIRM ,0 β(m) CIRM , β(m) CIRM ,0 := arg min β,β0 E(XP,XD,Y ) P(m) YI X I β β0 2 s.t. E (XP,XD) P(m) X h β XI X I βCIP ϑCIRM i = E(XP,XD) e PX h β XI X I βCIP ϑCIRM i , (53) where YI := Y X P γ , XI := XD Γ XP with γ and Γ defined in Equation (52) and ϑCIRM := E(XP,XD,Y ) Pallsrc [XI (YI E[YI])] E(XP,XD,Y ) Pallsrc X I βCIP E[X I βCIP ] (YI E[YI]) , (54) with Pallsrc denote the uniform mixture of all source distributions. Finally, we summarize in Table 5 all the population DA methods introduced in this paper. A.2 Finite-sample formulation of DA methods We introduce the finite-sample formulations of the population DA methods in the previous subsection so that they can be implemented to reproduce the results in the numerical experiments in Section 5. The finite-sample DA formulations are summarized in Table 6. The finite-sample formulations of DIP-mean, CIP-mean and CIRM-mean are introduced in Section 5.1. Here we state the finite-sample formulations of weighted variants, matching penalty and mixed-causal-anticausal variants. To formulate the finite-sample versions of DIPweigh-mean (24) and CIRMweigh-mean (49), it is sufficient to replace the corresponding population estimators and the population source risks with finite-sample ones. The mixed-causal-anticausal variants of DIP, CIP and CIRM only requires two additional regressions. They are omitted for the sake of space. Next, we introduce the finite-sample formulations of the matching penalty variants for DIP. Chen and B uhlmann Made-ups to assist explanation Weighted variants Matching penalty variants Mixed variants DIP(m) (10) DIPAbs(m) (14) DIPOracle(m) (15) DIPweigh (24) DIP-std (46) DIP-std+ (47) DIP-MMD (48) DIP (m) (38) DIP weigh CIP (11) N/A N/A CIP-std CIP-std+ CIP-MMD CIP (51) CIRM(m) (12) N/A CIRMweigh (49) CIRM-std CIRM-std+ CIRM-MMD CIRM (m) (53) CIRM weigh Table 5: Summary of population DA methods introduced in this paper. The matching penalty variants of CIP and CIRM can be formulated similarly as those for DIP. They are omitted for the sake of space. DIP(m)-std-finite: the finite-sample formulation of the DIP(m)-std estimator (55). The difference between the source and target standard deviations is used as distributional distance, ˆf(m) DIP-std(x) := x ˆβ(m) DIP-std + ˆβ(m) DIP-std,0 ˆβ(m) DIP-std, ˆβ(m) DIP-std,0 := arg min β,β0 y(m) i x(m) i β β0 2 + λmatch δ(m) δ 2 , where δ(m) is the standard deviation of x(m) 1 β, x(m) 2 β, . . . , x(m) nm β , and δ is the standard deviation of x 1 β, x 2 β, . . . , x n β . DIP(m)-std+-finite: the finite-sample DIP estimator where the differences between the source and target means, standard deviations and 25% quantiles are used as distributional distance, V is linear and U is the singleton of identity mapping. For the Domain adaptation under structural causal models m-th source environment, it is defined as ˆf(m) DIP-std+(x) := x ˆβ(m) DIP-std+ + ˆβ(m) DIP-std+,0 ˆβ(m) DIP-std+, ˆβ(m) DIP-std+,0 := arg min β,β0 y(m) i x(m) i β β0 2 + λmatch µ(m) µ 2 + λmatch δ(m) δ 2 + λmatch ψ(m) 25% ψ25% 2 , (56) where µ(m), δ(m), ψ(m) 25% are respectively the mean, standard deviation and 25% quantile of x(m) 1 β, x(m) 2 β, . . . , x(m) nm β , and µ, δ, ψ25% are respectively the mean, standard deviation and 25% quantile of x 1 β, x 2 β, . . . , x n β . DIP(m)-MMD-finite: the finite-sample DIP estimator where mean squared difference is used as distributional distance, V is linear and U is the singleton of identity mapping. For the m-th source environment, it is defined as ˆf(m) DIP-MMD(x) := x ˆβ(m) DIP-MMD + ˆβ(m) DIP-MMD,0 ˆβ(m) DIP-MMD, ˆβ(m) DIP-MMD,0 := arg min β,β0 y(m) i x(m) i β β0 2 + λmatch DMMD,H Z(m), Z , where Z(m) is the set of predicted responses in the m-th source environment x(m) 1 β, . . . , xnm β , similarly Z = x 1 β, . . . , x n β and the maximum mean discrepancy (MMD) Gretton et al. (2012) with respect to the reproducing kernel Hilbert space (RHKS) H between these two sets is DMMD,H Z(m), Z = sup h H i=1 h(x(m) i β) 1 i=1 h( xi β) In all experiments, RKHS with Gaussian kernel is used. Appendix B. Proofs related to Theorem 1 In this section, we prove Theorem 1 and related corollaries. B.1 Proof of Theorem 1 At a high level, the proof of Theorem 1 goes by connecting the target population risk of DIP(1) with that of DIPOracle(1), and then bounding the target population risk of DIPOracle(1) as a regularized version of OLSTar. Chen and B uhlmann Original (finite) Weighted variants Matching penalty variants Mixed variants DIP-mean (43) DIPweigh-mean DIP-std (55) DIP-std+ (56) DIP-MMD (57) DIP (m)-mean DIP weigh-mean CIP-mean (44) N/A CIP-std CIP-std+ CIP-MMD CIP -mean CIRM-mean (45) CIRMweigh-mean CIRM-std CIRM-std+ CIRM-MMD CIRM (m)-mean CIRM weigh-mean Table 6: Summary of finite-sample formulations of the DA methods discussed in this paper. The suffixes -finite of the finite-sample DA methods are omitted in the table for brevity. The matching penalty variants of CIP and CIRM can be formulated similarly as those for DIP. The mixed-causal-anticausal variants of DIP, CIP and CIRM only requires two additional regressions. They are omitted for the sake of space. Using the linear SCM assumption in Assumption 1, each data point in the source environment is generated i.i.d. from the following equation X(1) = BX(1) + b Y (1) + a(1) X + ε(1) X , Y (1) = ε(1) Y . (58) Define H = (Id B) 1. Solving X(1) from Equation (58), we have X(1) = Hbε(1) Y + Ha(1) X + Hε(1) X . (59) For (β, β0) Rd R, the residual takes the following form Y (1) β X(1) β0 = 1 β Hb ε(1) Y β Ha(1) X + β0 β Hε(1) X . Taking expectation and using the fact that noise has zero mean, we obtain E h Y (1) β X(1) β0 i2 = 1 β Hb 2 σ2 + β Ha(1) X + β0 2 + β HΣH β. (60) Similarly, we obtain the target expected residual E h e Y β e X β0 i2 = 1 β Hb 2 σ2 + β Hea X + β0 2 + β HΣH β. (61) Domain adaptation under structural causal models Risk of OLSTar: Using the expression for the target residual, the OLSTar estimator in Equation (5) becomes the solution of the following quadratic program 1 β Hb 2 σ2 + β Hea X + β0 2 + β HΣH β. (62) Solving the quadratic program and with matrix inversion lemma, we obtain βOLSTar = σ2 (Id B) Σ + σ2bb 1 b = (Id B) σ2Σ 1b 1 + σ2b Σ 1b βOLSTar,0 = β OLSTar Hea X. The target population loss of OLSTar is R (f OLSTar) = σ2 σ4b Σ + σ2bb 1 b 1 + σ2b Σ 1b, (63) where the last equality follows from matrix inversion lemma. Risk of OLSSrc(1): The minimization problem of OLSSrc can be similarly solved by changing the target variables to source ones in Equation (62). We obtain β(1) OLSSrc = σ2 (Id B) Σ + σ2bb 1 b = (Id B) σ2Σ 1b 1 + σ2b Σ 1b β(1) OLSSrc,0 = β(1) OLSSrc Because of the difference in the intercept term, the target population risk of OLSSrc has one additional term R f(1) OLSSrc = σ2 1 + σ2b Σ 1b + σ2b Σ 1 a(1) X ea X 2 (1 + σ2b Σ 1b)2 . (64) Risk of DIP(1): Using Equation (59), for any β Rd, we have β X(1) = β Hb + β Ha(1) X + β Hε(1) X . Together with the similar equation on target, we obtain E h β X(1)i = E h β e X i + β H a(1) X ea X . (65) Chen and B uhlmann Combining Equation (60) and (65), we observe that the DIP(1) problem (10) becomes a constrained quadratic program 1 β Hb 2 σ2 + β Ha(1) X + β0 2 + β HΣH β s.t. β H a(1) X ea = 0. (66) Note that because of the constraint, this quadratic program is the same for the source data and for the target data. As a consequence, we have R f(1) DIP = R f(1) DIPOracle . Now we solve the constrained quadratic program in Equation (66). First, we observe that the minimization on β0 can be easily solved. Second, since H is invertible, we can reparametrize the original problem and obtain the following problem 1 γ b 2 σ2 + γ Σγ s.t. γ a(1) X ea = 0. (67) Define u = a(1) ea a(1) ea 2 . Using Gram-Schmidt orthogonalization, we can complete the vector u to form an orthonormal basis (u, q1, . . . , qd 1). Let QDIP Rd (d 1) be the matrix formed with i-th column being qi. Then the mapping constitute a bijection between Rd 1 and the set γ Rd | γ u = 0 . This bijection allows us to transform the constrained quadratic program (67) to the following unconstrained one 1 ζ Q DIPb 2 σ2 + ζ Q DIPΣQDIPζ. Solving the quadratic program by setting gradient to zero, we obtain the minimizer ζ = σ2 Q DIP σ2bb + Σ QDIP 1 Q DIPb β(1) DIP = σ2 (Id B) QDIP Q DIP(Σ + σ2bb )QDIP 1 Q DIPb = (Id B) σ2QDIP Q DIPΣQDIP 1 Q DIPb 1 + σ2b QDIP (Q DIPΣQDIP) 1 Q DIPb β(1) DIP,0 = β(1) DIP β(1) DIPOracle = β(1) DIP β(1) DIPOracle,0 = β(1) DIPOracle Domain adaptation under structural causal models Consequently, the target population can be obtained by replacing b with Q DIPb and Σ with Q DIPΣ in Equation (63) R f DIP(1) = R f DIPOracle(1) = σ2 1 + σ2b Σ 1 where GDIP = Σ1/2QDIP Q DIPΣQDIP 1 Q DIPΣ1/2 is a projection matrix with rank d 1. B.2 Proof of Corollary 2 Equation (19), (20) and (21) follow directly by plugging in Σ = σ2 ρ Id in the corresponding equations in Theorem 1. For the high probability bound, we use several tail inequalities. Let a = a(1) X ea X. Since a is generated randomly from N(0, τ 2Id), then for a fixed vector v Rd, a v follows N(0, τ 2v v). The standard Gaussian tail bound on a v gives, for t > 0, 2 exp t2/2 . (68) Since a 2 2 follows Chi-square distribution with d-degree of freedom, the standard chisquare tail bound (see e.g. Laurent and Massart (2000)) gives, for t > 0, a 2 2 τ 2d 1 t exp t2/8 . (69) Combining Equation (68) and (69), with probability at least 1 exp( t2/8) 2 exp( t2/2), we have a v 2 For t constant satisfying 0 < t d 2 , we have Plugging the above high probability bound into Equation (20) and (21) with v = b, with probability at least 1 exp( t2/8) 2 exp( t2/2), we have R f(1) OLSSrc σ2 1 + ρ b 2 2 + τ 2t2 b 2 2 1 + ρ b 2 2 2 , and R f(1) DIP σ2 1 + ρ 1 2t2 We conclude and obtain the form needed in the corollary by a change of variable from 2t2 Chen and B uhlmann B.3 Proof of Corollary 3 Corollary 3 shows that the source population risk of DIP(1) is the same as target population risk of DIP(1). For this, it suffices to observe that the source expected residual (60) and the target expected residual (61) only differ by the term β Ha(1) X and the term β Hea X. Since the DIP constraint enforces β H a(1) X ea X = 0 as shown in Equation (66), we obtain that R(1) f(1) DIP = R f(1) DIP . (70) B.4 Proof of Corollary 4 Using the linear SEM assumption in Assumption 1 and the additional intervention on Y , each data point in the source environment is generated i.i.d. from the following equation X(1) = BX(1) + b Y (1) + a(1) X + ε(1) X , Y (1) = a(1) Y + ε(1) Y . (71) For (β, β0) Rd R, the source residual takes the following form Y (1) β X(1) β0 = 1 β Hb ε(1) Y + 1 β Hb a(1) Y β Ha(1) X β0 β Hε(1) X . The DIP(1) problem (10) becomes a constrained quadratic program 1 β Hb 2 σ2 + 1 β Hb a(1) Y β Ha(1) X β0 2 + β HΣH β s.t. β H a(1) X + a(1) Y b = β H (ea X + ea Y b) . (72) Note that because of the intervention on Y , unlike in Theorem 1, this quadratic program is no longer the same for DIP(1) and DIPOracle(1). However, the constrained quadratic program (72) can be solved similarly as we did in Appendix B.1 around Equation (66) by introducing u = a(1) X + a(1) Y b ea X ea Y b a(1) X + a(1) Y b ea X ea Y b 2 and Q2 Rd d 1 is the matrix with columns formed by the vectors that complete the vector u to an orthonormal basis of Rd. Following the rest of the proof in Appendix B.1, we obtain β(1) DIP = (Id B) σ2Q2 Q 2 ΣQ2 1 Q 2 b 1 + b Q2 Q 2 ΣQ2 1 Q 2 b β(1) DIP,0 = 1 β(1) DIP Hb a(1) Y β(1) DIP Ha(1) X . (73) Domain adaptation under structural causal models Note that the intercept β(1) DIP,0 has an extra term due to the intervention on a(1) Y . For (β, β0) Rd R, the target residual takes the following form e Y β e X β0 = 1 β Hb eεY + 1 β Hb ea Y β Hea X β0 β HeεX. Plugging (β(1) DIP, β(1) DIP,0) of Equation (73) into the above residual, we obtain that the popu- lation target risk which has an extra term that depends on a(1) Y ea Y R f DIP(1) = σ2 1 + σ2b Σ 1 2 b + a(1) Y ea Y 2 , where G2 = Σ1/2Q2 Q 2 ΣQ2 1 Q 2 Σ1/2 is a projection matrix with rank d 1. Appendix C. Proofs related to Theorem 5 In this section, we prove Theorem 5 and related corollaries. C.1 Proof of Theorem 5 Using the linear SEM assumption in Assumption 2, each data point in the m-th source environment is generated i.i.d. from the following equation X(m) = BX(m) + b Y (m) + a(m) X + ε(m) X , Y (m) = a(m) Y + ε(m) Y . (74) Define H = (Id B) 1. For (β, β0) Rd R, the residual takes the following form Y (m) β X(m) β0 = 1 β Hb a(m) Y + ε(m) Y β Ha(m) X + β0 β Hε(m) X . (75) The target residual has a similar form e Y β e X β0 = 1 β Hb (ea Y + eεY ) β Hea X + β0 β HeεX. (76) Risk of OLSTar: Using the expression for the target residual, the OLSTar estimator in Equation (5) becomes the solution of the following quadratic program 1 β Hb 2 σ2 + 1 1 β Hb ea Y β Hea X β0 2 + β HΣH β. Solving the quadratic program by setting the gradient to zero, we obtain βOLSTar = σ2 (Id B) Σ + σ2bb 1 b βOLSTar,0 = 1 β OLSTar Hb ea Y β OLSTar Hea X. Despite the difference in the intercept term, the corresponding target risk is the same as in Theorem 1 R (f OLSTar) = σ2 1 + σ2b Σ 1b. Chen and B uhlmann Risk of CIP: Using the SEM (74), the constraint of CIP in Equation (11) becomes β Ha(m) X = β Ha(1) X , m {2, . . . , M} . Together with the residual expression in Equation (75), the CIP objective (11) is equivalent to 1 β Hb 2 σ2 + 1 1 β Hb a(m) Y β Ha(m) X β0 2 + β HΣH β s.t. β H a(m) X a(1) X = 0, m {2, . . . , M} . (77) First, the one-dimensional quadratic program on β0 can be solved easily by setting derivative to zero and we obtain 1 β Hb a(m) Y β Ha(m) X . Plugging the expression of β0 back to Equation (77), the minimization on β becomes 1 β Hb 2 σ2 + Y + β HΣH β s.t. β H a(m) X a(1) X = 0, m {2, . . . , M} , a(m) Y a Y 2 and a Y = 1 k=1 a(k) Y . (78) Second, since H is invertible, we can re-parametrize the optimization on β as follows 1 γ b 2 σ2 + Y ) + γ Σγ s.t. γ a(m) X a(1) X = 0, m {2, . . . , M} . (79) Let P Rd p be the matrix formed with an orthonormal basis of the p-dimensional subspace span a(2) a(1), . . . , a(m) a(1) . Let QCIP Rd (d p) be the matrix with columns formed by completing the columns of P to a basis of Rd via Gram-Schmidt orthogonalization. The following mapping constitutes a bijection between Rd p and the set γ Rd | P γ = 0 . With the change of variable, the constrained optimization in Equation (79) is equivalent to the unconstrained one 1 ζ Q CIPb 2 σ2 + Y + ζ Q CIPΣQCIPζ. Domain adaptation under structural causal models Solving the unconstrained quadratic program by setting gradient to zero, we obtain the minimizer ζ = σ2 + Y Q CIP σ2 + Y bb + Σ QCIP 1 Q CIPb = σ2 + Y QCIP QCIP ΣQCIP 1 Q CIPb 1 + (σ2 + Y ) b QCIP QCIP ΣQCIP 1 Q CIPb , where the last equality uses the matrix inversion lemma. Finally, transforming the variables back, the CIP estimator is βCIP = σ2 + Y (Id B) QCIP QCIP ΣQCIP 1 Q CIPb 1 + (σ2 + Y ) b QCIP QCIP ΣQCIP 1 Q CIPb (80) βCIP,0 = 1 βCIP Hb a Y βCIP Ha(m) X . According to the form of the target residual (76), the target population risk for (β, β0) takes the following form 1 β Hb 2 σ2 + 1 β Hb ea Y β Hea X β0 2 + β HΣH β. Plugging in the CIP estimator into the equation above, we obtain the target population risk of CIP R (f CIP) = σ2 + Y 1 + (σ2 + Y ) b Σ 1 2 b + (ea Y a Y )2 Y 1 + (σ2 + Y ) b Σ 1 where GCIP = Σ1/2QCIP Q CIPΣQCIP 1 Q CIPΣ1/2 is a projection matrix with rank d p. Risk of CIRM: First we derive ϑCIRM in Equation (13). From the CIP expression in Equation (80) and the SEM (74), we obtain the following expression for X(m) βCIP σ2 + Y ε(m) Y b Σ 1/2GCIPΣ 1/2b + ε(m) X Σ 1/2GCIPΣ 1/2b + a(m) X Σ 1/2GCIPΣ 1/2b 1 + (σ2 + Y ) b Σ 1/2GCIPΣ 1/2b . The deviation from its expectation is X(m) βCIP E h X(m) βCIP i σ2 + Y b Σ 1/2GCIPΣ 1/2b 1 + (σ2 + Y ) b Σ 1/2GCIPΣ 1/2b ε(m) Y + σ2 + Y ε(m) X Σ 1/2GCIPΣ 1/2b 1 + (σ2 + Y ) b Σ 1/2GCIPΣ 1/2b. Plugging the above equation into Equation (13), together with X(m) = Hb Y (m) + Ha(m) X + Hε(m) X (81) Chen and B uhlmann and Y (m) E Y (m) = ε(m) Y , we obtain 1 + σ2 + Y b Σ 1/2GCIPΣ 1/2b (σ2 + Y ) b Σ 1/2GCIPΣ 1/2b Note that the vector ϑCIRM is co-linear with the Y component in Equation (81). We remark that the idea of CIRM is to use e X βCIP as a proxy for the unobserved e Y to remove the e Y part in the covariates so that the DIP matching based ideas still can be applied. With the ϑCIRM expression (82) and the βCIP expression (80), the LHS of the CIRM constraint (12) becomes X(m) X(m) βCIP ϑCIRM Y (m) X(m) βCIP 1 + σ2 + Y b Σ 1/2GCIPΣ 1/2b (σ2 + Y )) b Σ 1/2GCIPΣ 1/2b Hb + Ha(m) X + Hε(m) X = β CIPHa(m) X + β CIPHε(m) X Hb + Ha(m) X + Hε(m) X (i) = β CIPHa(1) X + β CIPHε(m) X Hb + Ha(m) X + Hε(m) X , (83) where the last inequality (i) follows from the fact that the constraint in CIP (11) forces βCIP Ha(m) X = βCIP Ha(1) X , m {1, . . . , M} . An expression similar to Equation (83) can be obtained for the target environments by taking into account that ea X span a(1) X , . . . , a(M) X , e X e X βCIP ϑCIRM = Hb β CIPHa(1) X + β CIPHeεX + Hea X + HeεX. (84) Multiply Equation (83) and (84) by β and take expectation, we obtain a simplified form of the CIRM constraint (12) as follows β Ha(m) X = β Hea X. Note that the CIRM constraint is effectively matching the covariate interventions as DIP constraint did when Y is not intervened on. As a consequence, given βCIP and ϑCIRM, the CIRM(m) estimator (12) is very similar to DIP(m) (66) except that the intervention on Y still appears in the residual. The CIRM(m) estimator (12) can be written as follows 1 β Hb 2 σ2 + 1 β Hb a(m) Y β Ha(m) X β0 2 + β HΣH β s.t. β H a(m) X ea X = 0. (85) Domain adaptation under structural causal models After solving for β0, the quadratic program for β is exactly the same as in the DIP proof in Appendix B.1 around Equation (66). Thus we obtain β(m) CIRM = (Id B) σ2Q(m) DIP Q(m) DIP ΣQ(m) DIP 1 Q(m) DIP b 1 + σ2b Q(m) DIP Q(m) DIP ΣQ(m) DIP 1 Q(m) DIP b β(m) CIRM,0 = 1 β(m) CIRM Hb a(m) Y β(m) CIRM Ha(m) X , (86) the corresponding CIRM target population risk is R f CIRM(m) = σ2 1 + σ2b Σ 1 2 G(m) DIPΣ 1 a(m) Y ea Y 2 1 + σ2b Σ 1 2 G(m) DIPΣ 1 where Q(m) DIP and G(m) DIP are in the same way as Q(1) DIP and G(1) DIP in Theorem 1. G(m) DIP = Σ1/2Q(m) DIP Q(m) DIP ΣQDIPm 1 Q(m) DIP Σ1/2 is a projection matrix. Q(m) DIP Rd d 1 is the matrix with columns formed by the vectors that complete the vector u to an orthonormal basis where a(m) ea a(m) ea 2 , if a(m) = ea 0, otherwise. C.2 Proof of Corollary 6 Equation (31), (32) and (33) follow directly by plugging in Σ = σ2 ρ Id and intervention on Y equals to zero in the corresponding formula in Theorem 5. Recall that PCIP Rd p is the matrix with columns formed with the orthonormal basis of span a(2) X a(1) X , . . . , a(M) X a(1) X . Let A Rd (M 1) with (m 1)-th column a(m) X a(1) X . According the assumption that a(2) X a(1) X , . . . a(M) X a(1) X are generated independently from the standard Gaussian distribution, 1 d A A follows a multivariate Wishart distribution and its eigenvalue tail bound is well known (see e.g. Chapter 6 in Wainwright (2019)). We have exp dδ2/2 . exp dδ2/2 . Chen and B uhlmann It implies with probability at least 1 2 exp( dδ2/2), for δ < 1 d , we have λmin A A > d 2 and λmax A A < 3d This also implies that with the same probability A A is full rank and p = M 1. For a fixed vector v Rd, we have 2 = v A A A 1 A v. a(m) X a(1) X v is a sum of (M 1) i.i.d. Gaussian random variables with mean 0 and variance v v. Using the standard chi-square tail bound, we have 2 (M 1)v v 1 + t 2 (M 1)v v 1 t Combining the high probability bounds above, we have, with probability at least 1 2 exp dδ2/2 2 exp t2/8 , 3d v v P CIPv 2 given that δ < 1 2 q d and t M 1 2 . Note that it is possible to ensure 1 2 exp dδ2/2 2 exp t2/8 to be close to 1 when d and M are both large and M d. Taking t = M 1 2 and d = 6M, this probability is larger than 1 2 exp( d/32) 2 exp( M/32). The high probability bound for u b 2 can be obtained similarly as in the proof of Corollary 2. According to the proof of Corollary 2 in Appendix B.2, for 0 < t d/2, with probability at least 1 exp( t/16) 2 exp( t/4), we have Putting the two high probability bound together, we conclude Corollary 6. Domain adaptation under structural causal models C.3 Proof of Corollary 8 This corollary follows from the proof of Theorem 5 in Appendix C.1. According to Equation (85), the m-th source population risk can be written as 1 β Hb 2 σ2 + 1 β Hb a(m) Y β Ha(m) X β0 2 + β HΣH β. Similarly, the target population risk can be written as 1 β Hb 2 σ2 + 1 β Hb ea Y β Hea X β0 2 + β HΣH β. The CIRM constraint ensures that β H a(m) X ea X = 0 according to Equation (85). Hence the only difference between the source population risk and target population risk lies in the a(m) Y and ea Y dependent terms. Plugging the CIRM solution (86) into the source and target population risks, we obtain R(m) f(m) CIRM = R f(m) CIRM + 1 β(m) CIRM Hb 2 a(m) Y ea Y 2 = R f(m) CIRM + a(m) Y ea Y 2 1 + σ2b Σ 1 2 G(m) DIPΣ 1 C.4 Proof of Corollary 7 Risk of DIP : Since X(m) P is uncorrelated with ε(m) Y , regressing Y (m) on X(m) P gives back the coefficients in the data generation SCM. Solving Equation (39), we obtain Similarly, since X(m) P is uncorrelated with X(m) D , regressing of X(m) D on X(m) P gives back the coefficients in the data generation SCM. Solving Equation (40), we obtain Γ(m) = (Id r BD) 1 b Dω P + (Id r BD) 1 Bd-p. Use the definition of intermediate random variables in Equation (37), we observe that these variables satisfy the anticausal data generation Assumption 1 " X(m) I Y (m) I = BD b D 0 0 " X(m) I Y (m) I " a(m) X,D a(m) Y " ε(m) X,D ε(m) Y " e XI e YI = BD b D 0 0 " e XI e YI + ea X,D ea Y + eεX,D eεY Together with the assumption of no intervention on Y , we can apply Theorem 1 on the intermediate random variables to obtain the target risk of DIP . Chen and B uhlmann Risk of OLSTar: The data generation Assumption 3 gives the following equations e XP = HPea X,P + HPeεX,P e XD = HDb DY + HDBd-p e XP + HDea X,D + HDeεX,D e Y = ω P e XP + eεY , where HP = (Ir BP) 1, HD = (Id r BD) 1. The target population risk for (βP, βD, β0) becomes E e Y β P e XP β D e XD β0 2 = 1 β DHDωD 2 σ2 + E ω P e XP β P e XP β DHDb Dω P e XP β DHDBd-p e XP 2 + β DHDea X,D + β0 2 + β DHDΣDH DβD. The minimization on βP and β0 can be solved easily and it remains a quadratic program on βD. Since this quadratic program is similar to that in the proof of Theorem 1, we conclude by the referring to the part where we solve the quadratic program in Appendix B.1. Appendix D. Additional CIRM variants: RII, RIIRMI, CIRMI The target population risk of CIP and CIRM estimator discussed above still have dependence on a(1) Y ea Y . As we have explained after Theorem 5, this is because Assumption 2 does not rule out the unidentifiable scenario. When we add additional assumptions such as b / span a(2) X a(1) X , . . . , a(M) X a(1) X , we show that there are new methods which take advantage of this assumption to get rid of a(1) Y ea Y dependence in the risk. We introduce three new estimators. RII-mean: the population residual invariance and independent estimator where mean is matched across environments f RII(x) := x βRII + βRII,0 βRII, βRII,0 := arg min β,β0 k=1 E(X,Y ) P(k) Y X β β0 2 s.t. E(X,Y ) P(m) h Y X β i = E(X,Y ) P(1) h Y X β i and E(X,Y ) P(m) (Y E Y P(m) Y [Y ]) (Y X β) = 0, m {2, . . . , M} . The idea of matching the residual has appeared in the invariant causal prediction paper Peters et al. (2016) and in the anchor regression paper Rothenh ausler et al. (2021). The residual independence penalty is also the core idea in the invariant causal prediction Peters et al. (2016) and Greenfeld and Shalit Greenfeld and Shalit (2020). Domain adaptation under structural causal models RIIRMI(m)-mean: the population residual invariant independent residual matching estimator with additional residual independence using m-th source environment f(m) RIIRMI(x) := x β(m) RIIRMI + β(m) RIIRMI,0 β(m) RIIRMI, β(m) RIIRMI,0 := arg min β,β0 E(X,Y ) P(m) Y X β β0 2 s.t. E X P(m) X h β X X βRII ϑRIIRMI i = EX e PX h β X X βRII ϑRIIRMI i and E(X,Y ) P(m) (Y E Y P(m) Y [Y ]) (Y X β) = 0, (88) where, with Psource denote the uniform mixture of all source distributions, ϑRIIRMI := E(X,Y ) Psource [X (Y E[Y ])] EY Psource Y h (Y E[Y ])2i . (89) CIRMI(m)-mean: the population conditional invariant residual matching estimator with additional residual independence using m-th source environment f(m) CIRMI(x) := x β(m) CIRMI + β(m) CIRMI,0 β(m) CIRMI, β(m) CIRMI,0 := arg min β,β0 E(X,Y ) P(m) Y X β β0 2 s.t. E X P(m) X h β X X βCIP ϑCIRMI i = EX e PX h β X X βCIP ϑCIRMI i and E(X,Y ) P(m) (Y E Y P(m) Y [Y ]) (Y X β) = 0, (90) where, with Psource denote the uniform mixture of all source distributions, ϑCIRMI := E(X,Y ) Psource [X (Y E[Y ])] E(X,Y ) Psource [(X βCIP E[X βCIP]) (Y E[Y ])]. (91) Note that a common feature of the three estimators above is that they all have the residual independent constraint of the form E(X,Y ) P(m) (Y E Y P(m) Y [Y ]) (Y X β) = 0. This constraint takes advantage of the anticausal prediction setting and restricts the estimator to ignorant of the intervention on Y . Theorem 9 Under data generation Assumption 2 and the additional assumption b / span a(2) X a(1) X , . . . , a(M) X a(1) X , (92) Chen and B uhlmann the population target risk of RII-mean, RIIRMI(1)-mean and CIRMI(1)-mean satisfies R (f RII) = b Id P4P 4 Σ1/2 (Id G4) Σ1/2 Id P4P 4 b Id P4P 4 b 4 2 , (93) where P4 Rd p equals to PCIP which is the matrix formed by an orthonormal basis of the pdimensional subspace span a(2) X a(1) X , . . . , a(M) X a(1) X , G4 = Σ1/2Q4 Q 4 ΣQ4 1 Q 4 Σ1/2 is a projection matrix of rank d p 1, Q4 Rd (d p 1) is the matrix with columns formed by completing the columns of P4 and (Id P4P 4 )b (Id P4P 4 )b 2 to an orthonormal basis of Rd via Gram-Schmidt orthogonalization, R(f(m) RIIRMI) = R(f(m) CIRMI) = b Id uu Σ1/2 (Id G5) Σ1/2 Id uu b (Id uu ) b 4 2 , (94) where u = a(m) X ea X a(m) X ea X , G5 = Σ1/2Q5 Q 5 ΣQ5 1 Q 5 Σ1/2 is a projection matrix of rank d 2, Q5 Rd (d 2) is the matrix with columns formed by completing u and v to an orthonormal basis of Rd via Gram-Schmidt orthogonalization. We present a corollary that puts additional assumptions on how the interventions are positioned to make the results in Theorem 9 easier to understand. Corollary 10 In addition to Assumption 2 and the assumption in Equation (92), assume Σ = σ2 ρ Id with ρ > 0, then R (f RII) = σ2 ρ b 2 2 ρ P 4 b 2 (95) R(f(1) RIIRMI) = R(f(1) CIRMI) = σ2 ρ b 2 2 ρ (u b)2 , (96) Additionally, if a(2) X a(1) X , . . . , a(M) X a(1) X , ξ are generated independently from the standard Gaussian distribution and ea X a(1) = PCIPξ, then with high probability, we have dim span a(2) X a(1) X , . . . , a(M) X a(1) X = M 1. 3d b 2 2 R (f RII) σ2 d b 2 2 (97) R(f(m) RIIRMI) = R(f(m) CIRMI) σ2 d b 2 2 (98) where c is a constant. Domain adaptation under structural causal models Estimator Target population risk upper bound interventions under general position Intervention on Y (Corollary 10, M sources) d) b 2 2 + ea Y a(m) Y 2 1 + ρ(1 c(M 1) d ) b 2 2 + (ea Y a Y ) Y h 1 + ρ(1 c(M 1) d ) b 2 2 i2 ea Y a(m) Y 2 h 1 + ρ(1 c d) b 2 2 i2 RIIRMI(m) σ2 CIRMI(m) σ2 Table 7: Summary of target population risk for different estimators for anticausal domain adaptation under the assumptions of Corollary 10. Here c is a constant. For simplicity, we only compare the target population risk upper bound under high probability when the interventions are generated i.i.d. Gaussian. The target population risks of RII, RIIRMI, CIRMI are compared with CIRM under the assumptions of Corollary 10. Under the additional assumption (92), the new DA methods RIIRMI and CIRMI effectively remove the ea Y a(m) Y dependency when compared to CIRM. However, RIIRMI and CIRMI have slightly worse target population risk when there is no intervention on the label a(1) Y = a(2) Y = . . . = ea Y = 0. D.1 Proof of Theorem 9 Risk of RII: Using the SEM (74) and the residual expression (75), the RII objective (87) becomes 1 β Hb 2 σ2 + 1 1 β Hb a(m) Y β Ha(m) X β0 2 + β HΣH β s.t. 1 β Hb a(m) Y a(1) Y + β H a(m) X a(1) X = 0, m {2, . . . , M} and 1 β Hb 2 σ2 = 0. The second constraint above ensures that 1 β Hb = 0. Chen and B uhlmann This observation allows us to simplify the RII minimization problem above and obtain min β,β0 1 M 0 β Ha(m) X β0 2 + β HΣH β s.t. β H a(m) X a(1) X = 0, m {2, . . . , M} and β Hb = 1. (99) The objective (99) is a quadratic program with linear constraints. First, the one-dimensional quadratic program on β0 can be solved easily by setting the derivative to zero. m=1 β Ha(m) X = β Ha(1) X , where the last equality follows from the first constraint in Equation (99). Second, since H is invertible, we can re-parametrize the optimization on β as follows min γ Rdγ Σγ (100) s.t. γ a(m) X a(1) X = 0, m {2, . . . , M} and γb = 1. (101) Let P4 be the matrix with columns formed by an orthonormal basis of the p-dimensional subspace span a(2) a(1), . . . , a(M) a(1) . Define Id P4P 4 b Id P4P 4 b 2 . v is well defined because Id P4P 4 b 2 = 0 by assumption 92. By construction, we have P 4 v = 0 and also v Id P4P 4 b 2 ! b = 1 (102) . Let Q4 Rd (d p 1) be the matrix with columns formed by completing the columns of P4 and v to an orthonormal basis of Rd via Gram-Schmidt orthogonalization. Because of Equation (102), the following map ζ 7 v Id P4P 4 b 2 + Q4ζ Domain adaptation under structural causal models constitutes a bijection between Rd p 1 and the set γ Rd | P γ = 0, γ b = 1 . With the change of variable, the constrained quadratic program in Equation (100) is equivalent to the following unconstrained one min ζ Rd p 1 v Id P4P 4 b 2 + Q4ζ v Id P4P 4 b 2 + Q4ζ Solving the unconstrained quadratic program by setting gradient to zero, we obtain the minimizer Q4 ΣQ4 1 Q4 Σv Id P4P 4 b 2 . Transforming the variables back, the RII estimator is βRII = (Id B) Σ 1/2 (Id G4) Σ1/2v Id P4P 4 b 2 (104) βRII,0 = β RIIHa(1) X , where G4 = Σ1/2Q4 Q 4 ΣQ4 1 Q 4 Σ1/2 is a projection matrix of rank d p 1. According to the form of the target residual (76), the target population risk for (β, β0) takes the following form 1 β Hb 2 σ2 + 1 β Hb ea Y β Hea X β0 2 + β HΣH β. Plugging in the RII estimator into the equation above, because β RIIHb = 1 and ea X a(1) X span(a(2) X a(1) X , . . . , a(m) X a(1) X ), we obtain the target population risk of RII R(f RII) = v Σ1/2 (Id G4) Σ1/2v Id P4P 4 b 2 2 . Risk of RIIRMI: We start by deriving ϑRIIRMI defined in Equation (89). It calculates the correlation between X and Y . Using the SEM (74), we obtain ϑRIIRMI = Hb. (105) Note that just like in CIRM, the vector ϑRIIRMI is also co-linear with the Y component in the covariate expression in Equation (81). So the idea of RIIRMI is very similar to that of CIRM: it uses e X βRII as a proxy for the unobserved e Y to remove the e Y part in the covariates so that the DIP matching based ideas can still be applied. With the ϑRIIRMI expression (105) and the βRII expression (104), the LHS of the first RIIRMI constraint (88) becomes X(m) X(m) βRII ϑRIIRMI = Y (m) X(m) βRII Hb + Ha(m) X + Hε(m) X = 1 β RIIHb Y (m) β RIIHa(m) β RIIHε(m) Hb + Ha(m) X + Hε(m) X (i) = 0 β RIIHa(1) β RIIHε(m) Hb + Ha(m) X + Hε(m) X , (106) Chen and B uhlmann where the last inequality follows from the two constraints in the RII estimator (87). Similar expression can be obtained for the target environments by taking into account that ea X span a(1) X , . . . , a(M) X , e X e X βRII ϑRIIRMI = β RIIHa(1) + β RIIHeε Hb + Hea X + HeεX. (107) Multiply Equation (106) and (107) by β and take expectation, we obtain a simplified form of the first RIIRMI constraint (12) β Ha(m) X = β Hea X. Note that the first RIIRMI constraint is the same as the one in DIP or CIRM. With the above observation, RIIRMI(1) objective (88) becomes 0 β Ha(m) X β0 2 + β HΣH β s.t. β H a(m) X ea = 0 and β Hb = 1. (108) This objective is similar to the one (85) in the proof of CIRM, with the only difference that there is one additional constraint β Hb = 1. This objective is also similar to the one (99) in the proof of RII, with the difference that there is only one constraint of the type β H(a(m) X ea X) = 0. Following the proof of RII to solve the quadratic program with linear constraints, we define Id uu b (Id uu ) b 2 , where u = a(m) X ea X a(m) X ea X and define Q5 Rd (d 2) be the matrix with columns formed by completing u and v to an orthonormal basis of Rd via Gram-Schmidt orthogonalization, then the RIIRMI estimator is β(m) RIIRMI = (Id B) Σ 1/2 (Id G5) Σ1/2 Id uu b (Id uu ) b 2 2 (109) β(m) RIIRMI,0 = β(m) RIIRMI where G5 = Σ1/2Q5 Q 5 ΣQ5 1 Q 5 Σ1/2 is a projection matrix of rank d 2. According to the form of the target residual (76), the target population risk for (β, β0) takes the following form 1 β Hb 2 σ2 + 1 β Hb ea Y β Hea X β0 2 + β HΣH β. Plugging in the RIIRMI[1] estimator into the equation above, because β RIIHb = 1 and β Ha(m) X = β Hea X, we obtain the target population risk of RIIRMI[1] R(f(m) RIIRMI) = b Id uu Σ1/2 (Id G5) Σ1/2 Id uu b (Id uu ) b 4 2 . Domain adaptation under structural causal models Risk of CIRMI: The proof for CIRMI follows easily by combining parts of proofs in CIRM and RIIRMI. The CIRMI(1) estimator has one additional constraint compared to the CIRM(1) estimator in Equation (85) 1 β Hb 2 σ2 + 1 β Hb a(1) Y β Ha(1) X β0 2 + β HΣH β s.t. β H a(1) X ea = 0 and 1 β Hb = 0. (110) In fact, the above optimization problem is exactly the same as RIIRMI in Equation (108). Consequently, the CIRMI solution is the same as that of RIIRMI. β(1) CIRMI = β(1) RIIRMI (111) β(1) CIRMI,0 = β(1) RIIRMI,0, R(f(1) CIRMI) = R(f(1) RIIRMI). D.2 Proof of Corollary 10 The proof of Corollary 10 follows similarly as that of Corollary 6 in Appendix C.2. M Amini and Patrick Gallinari. Semi-supervised learning with an explicit label-error model for misclassified data. In Proceedings of the 18th International Joint Conferences on Artificial Intelligence, pages 555 560, 2003. Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations (ICLR), 2019. Mahsa Baktashmotlagh, Mehrtash T Harandi, Brian C Lovell, and Mathieu Salzmann. Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision, pages 769 776, 2013. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, pages 137 144, 2007. Shai Ben-David, Tyler Lu, Teresa Luu, and D avid P al. Impossibility theorems for domain adaptation. In International Conference on Artificial Intelligence and Statistics, pages 129 136, 2010. Chen and B uhlmann John Blitzer, Sham Kakade, and Dean Foster. Domain adaptation with coupled subspaces. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 173 181. JMLR Workshop and Conference Proceedings, 2011. Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92 100, 1998. Peter B uhlmann. Invariance, causality and robustness. Statistical Science, 35(3):404 426, 2020. Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C Duchi, and Percy S Liang. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems, pages 11190 11201, 2019. Olivier Chapelle, Bernhard Sch olkopf, and Alexander Zien. Semi-supervised learning. IEEE Transactions on Neural Networks, 20(3):542 542, 2009. Corinna Cortes and Mehryar Mohri. Domain adaptation in regression. In International Conference on Algorithmic Learning Theory, pages 308 323. Springer, 2011. Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103 126, 2014. Corinna Cortes, Mehryar Mohri, and Andr es Mu noz Medina. Adaptation algorithm and theory based on generalized discrepancy. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169 178, 2015. John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018. Frederick Eberhardt and Richard Scheines. Interventions and causal inference. Philosophy of Science, 74(5):981 995, 2007. Li Fei-Fei. Imagenet: crowdsourcing, benchmarking & other cool things. In CMU VASC Seminar, volume 16, pages 18 25, 2010. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran cois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(1):2096 2030, 2016. Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein distributional robustness and regularization in statistical learning. ar Xiv preprint ar Xiv:1712.06050, 2017. Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary Lipton. A unified view of label shift estimation. In Advances in Neural Information Processing Systems, volume 33, pages 3290 3300. Curran Associates, Inc., 2020. Muhammad Ghifary, David Balduzzi, W Bastiaan Kleijn, and Mengjie Zhang. Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(7):1414 1430, 2016. Domain adaptation under structural causal models Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10:524, 2019. Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2066 2073. IEEE, 2012. Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Sch olkopf. Domain adaptation with conditional transferable components. In International Conference on Machine Learning, pages 2839 2848, 2016. Ian Goodfellow, Patrick Mc Daniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Communications of the ACM, 61(7):56 66, 2018. Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recognition: An unsupervised approach. In 2011 International Conference on Computer Vision, pages 999 1006. IEEE, 2011. Daniel Greenfeld and Uri Shalit. Robust learning with the Hilbert-Schmidt independence criterionilbert-schmidt independence criterion. In International Conference on Machine Learning, pages 3759 3768. PMLR, 2020. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. Machine Learning, 110(2):303 348, 2021. Judy Hoffman, Mehryar Mohri, and Ningshan Zhang. Algorithms and theory for multiplesource adaptation. In Advances in Neural Information Processing Systems, pages 8246 8256, 2018. Peter J Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, pages 73 101, 1964. Fredrik D Johansson, David Sontag, and Rajesh Ranganath. Support and invertibility in domain-invariant representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 527 536. PMLR, 2019. Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning, pages 5468 5479. PMLR, 2020. Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000. Yann Le Cun. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998. Chen and B uhlmann Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain generalization via conditional invariant representations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Yitong Li, Michael Murias, Samantha Major, Geraldine Dawson, and David Carlson. On target shift in adversarial domain adaptation. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 616 625. PMLR, 2019. Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pages 3122 3130. PMLR, 2018. Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 10869 10879, 2018. Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In 22nd Conference on Learning Theory, COLT, 2009. Geoffrey J Mc Lachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007. Nicolai Meinshausen. Causality from a distributional robustness point of view. In 2018 IEEE Data Science Workshop (DSW), pages 6 10. IEEE, 2018. Krikamol Muandet, David Balduzzi, and Bernhard Sch olkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10 18, 2013. Jerzy Neyman. Sur les applications de la th eorie des probabilit es aux experiences agricoles: Essai des principes. Roczniki Nauk Rolniczych, 10:1 51, 1923. Jianmo Ni, Jiacheng Li, and Julian Mc Auley. Justifying recommendations using distantlylabeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188 197, 2019. Kamal Nigam, Andrew Mc Callum, and Tom Mitchell. Semi-supervised text classification using EM. Semi-Supervised Learning, pages 33 56, 2006. Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359, 2009. Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199 210, 2010. Domain adaptation under structural causal models Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8026 8037, 2019. Judea Pearl. Causality: models, reasoning and inference, volume 29. Springer, 2000. Judea Pearl and Elias Bareinboim. External validity: From do-calculus to transportability across populations. Statistical Science, pages 579 595, 2014. Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1406 1415, 2019. Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947 1012, 2016. Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. MIT Press, 2009. Aditi Raghunathan, Jacob Steinhardt, and Percy S Liang. Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, pages 10877 10887, 2018. Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In Conference on empirical methods in natural language processing, 1996. Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Youn es Bennani. A survey on domain adaptation theory. ar Xiv preprint ar Xiv:2004.11829, 2020. Mateo Rojas-Carulla, Bernhard Sch olkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research, 19(1):1309 1342, 2018. Dominik Rothenh ausler, Peter B uhlmann, Nicolai Meinshausen, et al. Causal dantzig: fast inference in linear structural equation models with hidden variables under additive interventions. Annals of Statistics, 47(3):1688 1722, 2019. Dominik Rothenh ausler, Nicolai Meinshausen, Peter B uhlmann, and Jonas Peters. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2):215 246, 2021. Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688, 1974. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211 252, 2015. Chen and B uhlmann Bernhard Sch olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on Machine Learning, pages 459 466, 2012. Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. Amos Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pages 3 28, 2009. Masashi Sugiyama and Motoaki Kawanabe. Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012. Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. Advances in Neural Information Processing Systems, 33, 2020. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135 153, 2018. Olivia Wiles, Sven Gowal, Florian Stimberg, Sylvestre Alvise-Rebuffi, Ira Ktena, and Taylan Cemgil. A fine-grained analysis on distribution shift. ar Xiv preprint ar Xiv:2110.11328, 2021. Garrett Wilson and Diane J Cook. A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST), 11(5):1 46, 2020. Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687 10698, 2020. Zikang Yuan, Dongfu Zhu, Cheng Chi, Jinhui Tang, Chunyuan Liao, and Xin Yang. Visualinertial state estimation with pre-integration correction for robust mobile augmented reality. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1410 1418, 2019. Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pages 7523 7532. PMLR, 2019. Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2005.