# optimizationbased_causal_estimation_from_heterogeneous_environments__c211b4a1.pdf Journal of Machine Learning Research 25 (2024) 1-44 Submitted 8/21; Revised 12/22; Published 4/24 Optimization-based Causal Estimation from Heterogeneous Environments Mingzhang Yin mingzhang.yin@warrington.ufl.edu Warrington College of Business University of Florida Gainesville, FL, 32611, USA Yixin Wang yixinw@umich.edu Department of Statistics University of Michigan Ann Arbor, MI, 48109, USA David M. Blei david.blei@columbia.edu Department of Computer Science and Department of Statistics Columbia University New York, NY, 10027, USA Editor: Pradeep Ravikumar This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association with the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes Co Co, an optimization algorithm that bridges the gap between pure prediction and causal inference. Co Co leverages the recently-proposed idea of environments (Peters et al., 2016; Arjovsky et al., 2019), datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments and ones that exhibit sufficient heterogeneity Co Co maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, Co Co provides more accurate estimates of the causal model and more accurate predictions under interventions. Keywords: Causal estimation, Robust prediction, Constrained optimization, Directional derivative, Interventional data 1. Introduction We consider the problem posed in Peters et al. (2016). An outcome y is generated according to a linear structural equation model (SEM) based on a set of covariates x (Pearl, 2009), y (β ) x + ε, x ε. (1) c 2024 Mingzhang Yin, Yixin Wang, David M. Blei. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/21-1028.html. Yin, Wang, Blei Here, x are the causal parents of the outcome y, which are independent of the unobserved noise ε, and the causal coefficient β represents the direct causal effects of x . In practice, we might measure a potentially large set of covariates x = {x1, , xp}, such that the causes x are a subset of x. More precisely, there is a subset S {1, 2, , p} with x = x S, but we do not know this set of causes S. Our goal is to infer the causal relationship between each covariate xj and the outcome in the data generating process (DGP), and estimate their direct causal effects. The challenge to solving this problem is spurious association. Though the causes x S are assumed to be exogenous, the other observed covariates might depend on the noise term with x\S ε. The dependency may arise because of confounding, where there is an omitted common cause of y and x\S, i.e., y ϵ x\S; the dependency may also arise from a collider, reverse causality, or selection bias, e.g., y x\S or y x1 x2, x1, x2 x\S. It leads to spurious association because the covariates x\S will not be the genuine causal parents of the outcome in the DGP, even though they correlate with the outcome. In the face of spurious association, typical inference methods, such as regression based on empirical risk minimization (ERM) or maximum likelihood estimation (MLE), will lead to a biased estimate of the causal effect and an incorrect interpretation of the causal relationships (Angrist et al., 1996; Peters et al., 2016; Efron, 2020). From the predictive perspective, a model that captures spurious association will not generalize to non-i.i.d. data under perturbations of the DGP (Arjovsky et al., 2019; Rothenh ausler et al., 2021). We will develop constrained causal optimization (Co Co), an optimization-based method to solve the spurious association problem. The key idea behind Co Co is to leverage datasets from multiple environments (Peters et al., 2016). The environments are a set of heterogeneous DGPs. In each, the causal mechanism of the outcome generation remains invariant but the distribution of the covariates changes from environment to environment. While classical ML methods cannot distinguish the direct causes from spurious association, simultaneously analyzing data from multiple environments allows us to triangulate on the correct causal coefficients. This work builds on recent research about multi-environment estimation, beginning with the seminal work of Peters et al. (2016) and continuing with the risk-based algorithms of Arjovsky et al. (2019). The method developed in this paper builds and improves on the risk-based algorithms. We will describe the method here and derive it in the subsequent sections. To begin, consider a single data-generating distribution p(x, y) = p(x)p(y | x). For instance, the conditional distribution of the outcome could come from Eq. (1). Consider a predictor ˆy(x; α) with coefficients α and define a non-negative loss function to measure the fidelity of a prediction, ℓ: Y Y R (e.g., squared loss). Finally, define the risk of the predictor to be the expectation of the loss relative to the data-generating distribution, R(α) = Ex,y p(x,y)[ℓ(ˆy(x; α), y)]. (2) Classical machine learning (ML) provides methods that analyze data from p(x, y) to minimize the risk. The pure prediction methods leverage all information in data, causal and non-causal, to reduce the predictive error quantified by the risk function. The spurious associations improve predictions, but they bias the resulting estimates of the coefficients away from the true causal coefficients. Optimization-based Causal Estimation We consider distributions of data from a set of environments E. The data from environment e form a joint distribution pe(x, y) = pe(x)p(y | x). In this joint, the distribution of covariates pe(x) changes from environment to environment, but the conditional expectation of the outcome given its causal parents E(y | Pa(y)) is the same across environments, governed by the same SEM. Notice each environment is associated with its own risk Re(α), since the risk is an expectation with respect to the per-environment distribution. Given a set of datasets from multiple environments, the Co Co algorithm solves the following optimization, αcoco = arg min α 1 |E| Re(α) α 2 , (3) where is the Hadamard product and each environment s risk Re(α) can be approximated by its empirical estimate. As we describe below, this objective stems from the idea of the directional derivative (Rudin et al., 1964; Marban, 1969), and its role in the first-order conditions for the optimizer of each environment s risk function. Under invariance assumptions, the solution to Eq. (3) is the intersection across environments of all points that minimize each term Re(α) α 2. When the environments are sufficiently heterogenous that is, when there is enough variety in different pe(x) and outcome noise this optimization is solved by the causal coefficients. Eq. (3) is the basic optimization problem behind Co Co. The rest of this paper presents its theoretical foundations, the assumptions under which its solution is the true causal coefficients, algorithms building on and solving Eq. (3) from multi-environment data, and studies about the performance of Co Co on several simulated and real datasets, both with linear predictors and with nonlinear predictors. When compared to classical ML and Invariant Risk Minimization (IRM) (Arjovsky et al., 2019)-related methods, Co Co improves the estimation accuracy of causal coefficients and the predictive accuracy in new environments. Broadly, Co Co represents progress towards multi-environment optimization for causal estimation, and it helps explain the empirical the success of IRM when data is linear-Gaussian or linear-Bernoulli. Compared to IRM, Co Co requires fewer environments to identify causal coefficients and enjoys a more stable training procedure. Practically, Co Co is compatible with general graph structures, can be used with flexible ML tools such as deep neural networks, scales well to high-dimensional problems, and is easy to implement. The paper is organized as follows. 2 situates this paper in the broader landscape of the research literature on multi-environment analysis. 3 describes the methodology and theoretical basis for Co Co. 4 presents the connections between Co Co and IRM in mathematical detail. 5 studies the identification properties in detail, particularly in the setting of a linear SEM, and 6 extends Co Co to nonlinear models. 7 presents a study of Co Co with synthetic, semi-synthetic and real-world data. 2. Related work Formulating the goals of data analysis, and their relationship to optimization, has long been discussed in the research literature in statistics and machine learning. As early as Wright (1921), researchers recognized that the goals of prediction and estimation are not Yin, Wang, Blei always aligned in optimization. Shmueli et al. (2010) and Efron (2020) provide insightful discussions about the differences between prediction problems, association problems, and causal estimation problems. This paper builds on the body of research around causal estimation with multiple environments. Methodologies developed in this area rest on the invariance property of causality, also known as autonomy (Haavelmo, 1944), modularity (Sch olkopf et al., 2012) and stability (Dawid et al., 2010). In a pioneering work, Peters et al. (2016) uses statistical hypothesis testing to estimate causal structures by exploiting invariance across multiple environments. Subsequent research extends this idea, for example, to nonlinear models (Heinze-Deml et al., 2018) and sequential data (Pfister et al., 2019). See B uhlmann et al. (2020) for a comprehensive review. To improve its flexibility and scalability, researchers have begun to use optimization over multi-environment data for causal estimation (Rothenh ausler et al., 2019, 2021). One influential paper along these lines is Arjovsky et al. (2019), which introduces invariant risk minimization (IRM) as a method that adapts modern predictive models to this task. The objective function of IRM includes an additional penalty term to empirical risk function, which encourages a predictor to be invariant across environments. The work presented here provides a contribution to optimization-based causal estimation. Since its introduction, IRM has been extended in several ways. It has been formulated as game theory problem (Ahuja et al., 2020), combined with meta-learning methods (Bae et al., 2021), applied to reinforcement learning (Zhang et al., 2020), and applied to causal inference (Shi et al., 2020; Lu et al., 2021). Its conditions and limitations have been studied (Rosenfeld et al., 2020; Kamath et al., 2021; Guo et al., 2021). Besides IRM, other objectives also aim to improve predictive accuracy by encouraging invariance of empirical risks across environments. These objectives are often built on equal noise variance assumption, which is a strong version of invariance (Peters and B uhlmann, 2014). Some objectives regularize the variance of the empirical risks (Xie et al., 2020; Krueger et al., 2020; Heinze-Deml and Meinshausen, 2021) and control the worst case risk across training environments (Sagawa et al., 2019). When the condition of equal noise variance is met, such regularizations can be combined with Co Co. The idea of invariance and environments has also been adapted to causal discovery (Tian and Pearl, 2001; Yu et al., 2019a,b; Brouillard et al., 2020; Mooij et al., 2020; M uller et al., 2020). Invariance enables the discovery of causal structures within Markov equivalence classes (Ghassami, 2020), which cannot be reconstructed from traditional single environment data (Spirtes et al., 2000). Nevertheless, existing methods might encounter problems of limited model flexibility and high computational cost. For example, some methods rely on linear data generating process (Ghassami et al., 2018; Huang et al., 2019, 2020), require regression over multiple subsets of covariates (Ghassami et al., 2017), and/or involve multiple independence testings (Ghassami et al., 2017; Huang et al., 2020; Brouillard et al., 2020). Co Co provides an optimization-based method to assist causal discovery by identifying direct causes for observed variables. Lastly, Co Co is loosely related to the variable selection literature (Hastie et al., 2009). It can be viewed as selecting causal variables or learning a causal representation from the observed covariates, but the motivation and the targeted problems are different. Variable selection methods, such as LASSO (Tibshirani, 1996) and best subset selection (Bertsimas Optimization-based Causal Estimation et al., 2016; Yin et al., 2020), focus on the situations where the number of data points is small compared to the number of the observed covariates. The model selection under the traditional variable selection setting can generally recover the true sparsity pattern asymptotically (Zhao and Yu, 2006). In contrast, the estimation bias problem due to the spurious association considered in this paper cannot be solved by increasing the number of data points even to infinity. Unlike classic regularized regression with the regularizers often imposing shrinkage on the parameter space (Golub et al., 1999), the proposed method imposes regularizations in the derivative space with the restricted derivative directions. 3. Causal optimization from heterogeneous environments We discuss Co Co, a method that estimates causal effects via optimization. In 3.1, we set up the problem and assumptions. In 3.2, we introduce an idealized optimization objective, which produces causal coefficients as the solution, but is intractable. In 3.3, we derive a relaxed objective function; it is tractable with observable data but contains extra solutions besides the causal coefficients. In 3.4, we aggregate the relaxed objective over multiple environments to whittle down the set of optima to the causal coefficients. 3.1 Setup and assumptions Consider an observed multi-environment dataset. Denote E as a discrete set of environments. Each environment e E specifies a DGP similar to Eq. (1), ye β xe + ϵe, xe pe(xe 1, , xe p). (4) We absorb the intercept term into xe and β and do not write it explicitly. Comparing to Eq. (1), here βS = β , βj = 0 for j S, and β\S = 0 where S {1, 2, , p}. It means only xe S = x are the true causes for the set S. The SEM in Eq. (4) includes all the observed covariates to reflect the fact that we do not know the set of causes S a priori. We overload the notation to denote β as the causal coefficient and call the set S as the causal structure. There could be potential endogeneity with x ϵ. For each environment, the observed data De = (Xe, Ye) consists of ne i.i.d. data points, where Ye Rne are the outcomes, and Xe = [Xe 1, , Xe p] Rne p are the covariates. Each column Xe j Rne, j {1, 2, , p}, contains the observations of the j-th covariate for ne units. Assumptions for each environment. First, we specify the assumptions for the data in each environment. For notational simplicity, we suppress the superscript e for now and state the assumptions for any environment e E. The support set and magnitude of the causal coefficients β are unknown. For the covariates x, we do not specify the DGP and allow an arbitrary functional form of the joint distribution p(x). We will specify the additional assumptions on the joint distribution for identification in 5. We assume the noise ϵ is zero-mean, and the covariates and noise have finite variance. The noise term is assumed to be independent of the observed direct causes x S. To summarize, the assumptions for each environment are Assumption 1 (i) Linear DGP as Eq. (4); (ii) Moment conditions: E[ϵ] = 0, Var[ϵ], Var[xj] < for all j {1, 2, , p}; (iii) Independence: the observed direct causes x S ϵ. Yin, Wang, Blei Remark. Assumption 1 (i) assumes linearity; we will study the nonlinear causal models in 6. Assumption 1 (ii) is a standard regularity assumption. Assumption 1 (iii) assumes x S ϵ, where the noise ϵ incorporates the unobserved causes of y as long as they are independent of the observed direct causes x S. The noise and the observed covariates x\S, however, can be dependent, i.e., x\S ϵ. Hence, Assumption 1 (iii) does not imply that the covariates x as a whole satisfy the back-door criterion (Pearl, 2009). Accordingly, standard inference strategies, such as regression adjustment for all covariates, may produce biased causal estimates (Elwert and Winship, 2014). Our goal is to develop a method that can automatically identify the set S as the causal structure, estimate the causal coefficient β, and make prediction based on the causes of the outcome. Assumption 1 (iii) distinguishes the setting of this paper from that of instrumental variables (IV) which allows the dependency between x S and ϵ (Angrist et al., 1996). However, we allow the index of endogenous covariates in x to be unknown; such information is usually required by the IV methods such as the two-stage least square. Assumption 1 (iii) is a common assumption in the invariance-based causal inference (Arjovsky et al., 2019, Theorem 9) (Rojas-Carulla et al., 2018; Pfister et al., 2019; Krueger et al., 2020), and causal discovery literature (Peters et al., 2016, Assumption 1)(Ghassami et al., 2017; Yu et al., 2019a; Brouillard et al., 2020). Assumptions across environments. Now, we consider the assumption across environments. The key property of causality we exploit from multi-environment data is the invariance (Peters et al., 2016; Arjovsky et al., 2019). Invariance means that conditional on the same value of the direct causes, the expectation of the outcome is the same across environments. Assumption 2 (Invariance) The index set of direct causes of ye is the same across environments. And given a possible value c of the direct causes, E[ye|Pa(ye) = c] = E[ye |Pa(ye ) = c], (5) for all e, e E. The invariance condition in Eq. (5) is weaker than that in Peters et al. (2016), which requires strong invariance over the conditional outcome distribution p(ye|Pa(ye) = c) = p(ye |Pa(ye ) = c). (6) When the distributions of covariates xe and noise ϵe change across environments, we call environments E heterogeneous. The heterogeneity will be important to the method we derive. 3.2 Idealized causal optimization We start by considering the data set of a single environment, such as one that is generated by Eq. (1). Suppose for the moment that we do not know the causal coefficients β, but we do know which covariates are direct causes of the outcome, i.e., the set S. A key observation is that among the models that share the true causal structure (i.e. the support set of coefficients as S), the causal model (i.e. the model with causal coefficients) is the best predictive model. This observation, though conceptually straightforward, makes the goals of Optimization-based Causal Estimation causal estimation and optimization consistent. We can then obtain the causal coefficients by solving a constrained optimization problem. Lemma 1 (Causal Optimality) Under Assumptions 1, for the squared risk function R(α) = E[(1/2)(ˆy(x; α) y)2], and linear predictor ˆy(x; α) = α x, the following constrained optimization problem s.t. αj = 0, j / S (7) has the causal coefficients α = β as the unique solution. The proof is in Appendix A. Lemma 1 is the ideal situation where the set of causes S in Eq. (7) is known. Of course, in practice, we do not know which covariates are causal and which are not. This paper aims to build on this idealized optimization problem in Eq. (7) to construct a tractable objective for causal estimation. We first introduce a tractable objective with observed data. Then we aggregate this objective over multiple environments to isolate the causal coefficients. 3.3 Derivation of a tractable objective In this section, we derive an optimization objective for causal estimation. It only involves the observable data and is a relaxation of the idealized optimization in Eq. (7). In this relaxation, the causal coefficient β is one of the optima, though it is not the only one. We will first review directional derivatives and feasible directions, which are used to characterize the extreme points of the idealized optimization in Eq. (7). We then relax the optimization problem to a practical one and characterize its extreme points. Directional derivatives and feasible directions. The directional derivative Dv in the direction of a unit-length vector v is defined as the change rate of a function in that direction (Rudin et al., 1964). It can be computed as the inner product of the gradient and the direction vector Dv R(α) := limt 0(R(α + tv) R(α))/t = R(α), v , where , denotes an inner product. Directional derivatives provide the first-order condition for the optima of a constrained optimization. Denote the constraints in Eq. (7) as gj(α) = αj = 0 for j / S. (Recall S is the support set of β, the indices of the non-zero causal coefficients.) The points that satisfy these constraints form a surface in Rp. To find a stationary point while obeying to the constraints, an algorithm should search the parameters in the directions tangent to this surface rather than an arbitrary direction. Hence, the tangent directions are called the feasible direction. The first-order condition for optimality is that the directional derivative in all feasible directions vanishes (Marban, 1969). Intuitively, satisfying the condition means the algorithm cannot improve the objective value, at least locally, without violating the constraints. Optimality condition for causal optimization. We derive the feasible directions for the optimization problem in Eq. (7). We then derive an important property of the causal coefficients from the first-order condition of the directional derivative. Yin, Wang, Blei Geometrically, the feasible directions are perpendicular to the normal vectors of the surface defined by the constraints. The normal vectors point in the directions that violate the constraints at the maximum rate, which can be computed by taking the gradient of the constraints dgj(α)/dα = ej, j / S. (8) The feasible directions are perpendicular to the vector space spanned by the basis vectors in Eq. (8). Therefore, all the feasible directions form a linear space U = span{ej : j S}. The first-order condition for the causal optimization in Eq. (7) requires the directional derivative vanishing in all the feasible directions, that is, Dv = 0 for each v U. This condition can be equivalently stated using the basis vectors of U as Dej R(α) = R(α), ej = 0, for j S. (9) This is because any v U is a linear combination of the basis {ej}j S. More compactly, since βj = 0 for j S, Eq. (9) can be written equivalently as R(α) β 2 = 0 (10) where is the Hadamard product. Traditional empirical risk minimization (ERM) searches for α with R(α) = 0, or equivalently R(α) 1 2 = 0. A problem of the gradient-based ERM is its greedy nature: the optimization follows a derivative pointing at the steepest descent direction. Such greedy optimization may leverage the spurious associations and fails to converge to the causal coefficient. In comparison, Eq. (10) reveals that the key to establish the optimality for the causal model is to regularize the derivative s direction by restricting it to the subspace U. The regularization of the derivative direction distinguishes Co Co from the classic regression regularizations that often control the parameter magnitudes. The first order condition in Eq. (10) is an explicit condition that the causal coefficients β satisfy. That s to say, solving minα R(α) β 2 would return the causal coefficients for convex risk functions. However, there is no free lunch. Similar to the idealized causal optimization in Eq. (7) with an unknown set S, the condition in Eq. (10) includes β, the unknown causal coefficients. Hence Eq. (10) is not a tractable objective yet. Relaxing the first-order condition. Lemma 1 states that α = β is an optimum of the problem in Eq. (7). Therefore, the vector β must satisfy the first-order condition of Eq. (10). Plugging α = β into Eq. (10), we have R(β) β 2 = 0. This fact reveals that the causal coefficient β is an optimum of the following optimization problem, min α R(α) α 2 . (11) Notice that Eq. (11) is an optimization problem based entirely on the observational data. It states that the causal coefficient β satisfies β arg min α R(α) α 2 . (12) Optimization-based Causal Estimation Eq. (11) is an important step towards an optimization-based causal estimation. The derivation based on the directional derivative theory also reveals an intrinsic connection and distinction between the Co Co objective and the invariance risk minimization (IRM) objective (Arjovsky et al., 2019), which we will discuss in 4. We call the set of points that minimizes Eq. (11) the plausible set F. We have shown that causal coefficients belong to it. What other points are in F? For each j {1, 2, . . . , p} the objective of Eq. (11) is minimized when either αj = 0 or R(α)j = 0. Accordingly, two additional points are the all-zero vector 0 and the ERM solution for which R(α) = 0. Finally, the plausible set contains the points in-between these solutions, those that set the parameters to zero at a subset of indices, and set the elements of the gradient vector to zero at the remaining indices. For convex risk functions, there are at most 2p such solutions, one for each subset; note that the causal coefficient β is one of them. In summary, we began with a constrained optimization in Eq. (7). Its single optimum is the causal coefficient, but it requires the knowledge of which covariates are causal. We relaxed that optimization to Eq. (11), which only relies on the observable data. The causal coefficient remains to be an optimum of Eq. (11), but there are other solutions too, those between (and including) the 0-vector and the ERM solution. Hence, solving Eq. (11) alone does not identify the causal coefficient yet. In the next section, we will restore identifiability by appealing to the invariance property of causality under interventions (Peters et al., 2016; Arjovsky et al., 2019; Sch olkopf et al., 2021). With data from multiple environments each one coming from a different intervention on the DGP we can define an optimization problem that whittles down the plausible set to one that only contains the causal coefficient. 3.4 Optimization with multiple environments In this section, we leverage multi-environment data to restore the uniqueness of the causal coefficient as the solution of an optimization problem. Narrowing down the optima set by environments. As discussed in 3.3, the causal coefficient is nonidentifiable by optimizing Eq. (11) with i.i.d. data. We turn to the setting where we observe data from multiple environments to achieve identifiability. We assume the invariance property as in Eq. (5) across environments and that the environments are heterogeneous. Heterogeneous environments can be constructed by (hard) interventions which actively fix a variable at a specific value during the data generation. They can also be constructed by (soft) interventions where the changes in the DGP are passively observed rather than actively executed (Eberhardt and Scheines, 2007). For example, the heterogeneity might come from varied physical factors such as space and time. When studying the effect of health measurements on the chance of cancer, the environments can be different hospitals from which the data are collected (Winkler et al., 2019). Consider the relaxed causal optimization problem in Eq. (11). Due to the invariance of the conditional E(Y | Pa(Y )), for each environment, its feasible set include the same causal coefficient β. But the other optima that utilize the spurious associations will be different across environments because the joint distribution of the covariates pe(x) is not invariant. Yin, Wang, Blei (a) IRM optima (Env.#1) (b) IRM optima (Env.#2) (c) IRM optima (Two envs.) (d) Co Co optima (Two envs.) (e) IRM & Co Co optima (Two envs.) Figure 1: Geometry of the analytic optima sets for the IRM regularization and Co Co objectives in the 3D space. The causal coefficient is β = (3, 2, 0). (a),(b): the optima of the IRM regularization for each environment form a 3D quadric surface; (c): the optima of the IRM regularization with two environments is the intersection of the two surfaces, which forms two elliptic curves; (d): The optima of Co Co objective is a discrete finite set for each environment. The Co Co optima over two environments is the intersection consisting of the zero point and the causal coefficient (the overlap of the black triangle and blue star points); (e): The top view of the IRM regularization and Co Co optima for the two environments. The optima set by Co Co (α {(0, 0, 0), (3, 2, 0)}) is a strict subset of that by the IRM regularization (the dashed orange elliptic curves). Better viewed in color. Optimization-based Causal Estimation Algorithm 1 Co Co with known exogenous variables input : Data De = {Ye, Xe}, Xe Rne p; the risk function Re for each environment e E; the set of known non-descendant variables C; the predictor f( ). output : Coefficient estimation α with causal interpretation. Initialize α randomly while not converged do for e in E do Compute the gradient of the empirical risk: i=1 Re(α; ye i , ˆye i ), ˆye i = f(xe i; α) Set α = α (1 1C) + 1C Compute the optimization objective: Le(α) = ge(α) α 2 end Update α α η e E Le(α) with step size η end The invariance property motivates us to aggregate the optimization problems across environments, min α f E(α) := 1 Re(α) α 2 . (13) Eq. (13) is the Co Co objective. Denote the optima of each single environment objective as Fe := arg minα Re(α) α 2. The optima of the Co Co objective Eq. (13) consist of the intersection of all Fes, FE := arg min α f E(α) = \ as long as the intersection is not empty. Assumption 1,2 guarantee the nonemptiness because the causal coefficient β Fe for all e. Because FE is expressed as an intersection, its size shrinks with an increasing number of environments, i.e., |FE1| |FE2| if E2 E1. The multiple environments and heterogeneity therein induce differences between the optima sets of each environment and, as a result, narrow down the optima set FE of the Co Co objective. For instance, the sets Fe and FE are visualized with an example DGP in Fig. 1. Fig. 1 (d) illustrates how two environments and the intersection of single-environment optima sets help with the identification of the Co Co objective. Removing the non-informative solution from the optima set. While the environments help remove the points except for the causal coefficient from the optima of Co Co objective, the zero vector remains a solution. We propose two modifications of the objective to avoid the zero vector being an optimum. Yin, Wang, Blei We first consider the scenario when certain covariate Xj is known as an exogenous variable independent of the random noise of the outcome, i.e., Xj ϵ. Under Assumption 1, any ancestor variable of the outcome is a proper exogenous variable. With such information, we can modify the Co Co objective Eq. (13) to be min α 1 |E| e E Re(α) α] 2 , (14) where αj = 1 and αj = αj for j = j . In other words, the elements in α corresponding to the known exogenous variables are fixed as one. The following lemma demonstrates the properties of Eq. (14) s optima set. Lemma 2 The optima set of Eq. (14) contains the causal coefficient and is a subset of Eq. (13) s optima set. If βj = 0, The vector 0 is not an optimum of Eq. (14) almost surely. The proof of Lemma 2 is in Appendix A. When there is more than one known exogenous variable, we generalize α as α = α (1 1C) + 1C where C is the set of known exogenous variables. It further reduces the number of non-causal optima for Eq. (14). The algorithm is summarized in Alg. 1. We will discuss what conditions guarantee its output to be the causal coefficients in 5 when the environments are sufficiently heterogeneous. We find in theory ( 5) and simulation ( 7) that the optima set of Eq. (14) can shrink to the causal coefficient. In the scenarios when no variables are known as exogenous, an alternative method is to use the risk function as an additive regularization of Eq. (13), min α 1 |E| Re(α) α 2 + λr Re(α) , (15) where λr 0 controls the regularization strength. Note that when there is only a single environment |E| = 1, the ERM solution minimizes both terms in Eq. (15), so the minimizers of Eq. (15) are identical to the minimizers of the ERM. The mechanism behind the regularizer in Eq. (15) is that, though the causal coefficient and the 0 vector both minimize Eq. (13), the risk of the 0 vector is higher than that of the causal coefficient. So the risk function introduces inductive bias to disfavor the 0 vector. To see this, for the DGP in Eq. (4) and the squared risk, under Assumption 1, the risk Re(0) = β E[xx ]β + E[ϵ2] + 2E[x βϵ], and E[x βϵ] = E[x S βSϵ] = 0 since β\S = 0 and x S ϵ; and we have Re(β) = E[ϵ2]. Thus, Re(0) Re(β), which means the risk regularizer in Eq. (15) would encourage the convergence to the causal coefficient β rather than the 0 vector. Optimizing Eq. (15) with the risk regularizer yields a predictive model that is robust to perturbations of spurious associations. However, the risk regularizer may introduce potential estimation bias when the focus is causal estimation. In practice, this bias can be reduced by annealing the strength parameter λr during the optimization. We summarize the complete algorithm in Alg. 2. 4. Connections to invariant risk minimization Arjovsky et al. (2019) introduces IRM that can learn robust representation in the presence of spurious associations between the covariates and outcome. In particular, IRM considers a Optimization-based Causal Estimation predictor f(x; α) : Rp 7 R with parameter α. In a setting similar to Co Co , it considers a set of heterogeneous environments E and a risk function Re(α; ˆy) for each e E (for notational clarity, we explicitly write the prediction ˆy in the notation of the risk function). Based on the intuition that invariant predictor induces invariant features, IRM proposes the following objective to find an invariant model e E Re(α; w(f(xe i; α))) (16) s.t. w arg min w Re(α; w(f(xe i; α))), for all e E, where w( ) is a mapping from the range of f( ) to ˆy. For tractable computation, Arjovsky et al. (2019) further introduces the IRMv1 objective: h Re(α; f(xe i; α)) | {z } Empirical risk +λ w|w=1.0Re(α; w f(xe i; α)) 2 | {z } IRM regularization where λ > 0 and w is simplified as a dummy scalar variable. The IRMv1 objective in Eq. (17) consists of an empirical risk term and an IRM regularization term that encourages invariance. Since the empirical risk does not produce the correct causal estimate, we focus on analyzing the IRM regularization term. Analytic and geometric connections. The IRM regularization and Co Co are connected by the constrained optimization problem in Eq. (7). Specifically, the IRM regularization can be derived from optimization in Eq. (7) for several common models. In 3.3, we use the directional derivative to obtain the set of feasible directions U = span{ej : j S} of the idealized causal optimization and the first-order optimality condition in Eq. (10). An interesting observation is that the causal parameter itself is a feasible direction with β U. Since β is a feasible direction and is also an optimum of the optimization in Eq. (7), Lemma 1 implies that the directional derivative with direction β vanishes at the causal coefficient β, which leads to the following lemma. Lemma 3 For any partition P of the set {1, 2, , p}, the optimality of the causal coefficient β is an optimum of A P ( R(α)A, αA )2. (18) Specifically, β is an optimum of min α ( R(α), α )2. (19) Lemma 3 gives the general form of optimization that admits causal optimality. The proof of Lemma 3 is in Appendix A. When the outcome model is Linear-Gaussian or Linear-Bernoulli, minimizing the IRM regularization is equivalent to solving Eq. (19). To see this, suppose a linear DGP as in Eq. (4), linear predictor ˆye = α xe, and a squared risk function, then w|w=1.0Re(α; wα xe) 2 = (E[(ye ˆye)α xe])2 = ( Re(α; ˆy), α )2. (20) Yin, Wang, Blei ERM IRM regularization Co Co minα αRe(α) 1 2 minα( αRe(α) α)2 minα αRe(α) α 2 Table 1: Comparison of the ERM (by gradient methods), IRM regularization (for the linear Gaussian and Bernoulli models), and Co Co objectives for one environment. The left side of Eq. (20) is the invariant regularization term and the right side is Eq. (19). Similarly, suppose the outcome is generated by ye Bernoulli(σ(β xe)), the predictor is ˆye = σ(α xe) where σ(x) = 1/(1+exp( x)) is the sigmoid function, and the risk function is the cross entropy loss Re(α; ˆye) = E[ye log(ˆye) + (1 ye) log(1 ˆye)]. Then1 w|w=1.0Re(α; σ(wα xe)) 2 = (E[(ˆye ye)α xe])2 = ( Re(α; ˆy), α )2. (21) The comparison is summarized in Table 1. The connections in Eqs. (19) to (21) reveal that IRM regularization implicitly imposes a first order condition based on the directional derivative with the direction α, which offers an explanation to the mechanism behind IRMv1. Lemma 3 and Eqs. (19) to (21) can prove that, for linear Gaussian or Bernoulli models, the causal coefficient β is a minimizer of the IRM regularization for one environment. The data from multiple environments narrow down the optima set to the causal coefficient under the invariance assumption. The additional empirical risk term in the IRMv1 objective plays the role of further regularization by encouraging the solutions with high predictive performance. The connection also illustrates the sub-optimality of the IRM regularization. Geometrically, the IRM regularization, rewritten as the inner product between the gradient and parameter vectors, only considers a single feasible direction β for the constrained optimization problem Eq. (7). In contrast, Co Co finds the optimum across all the feasible directions that form a (p |S|) dimensional space U. The spectrum in Eq. (18) shows that the objective of Co Co corresponds to the finest partition, imposing the strongest constraint, and has the smallest optima set. Algebraically, we have p R(α) α 2 2 ( R(α), α )2 by the Cauchy Schwarz inequality. The inequality indicates that IRMv1 potentially minimizes a loose lower bound. The following proposition shows that the Co Co objective is guaranteed to have a better identification result than the IRM regularization. Proposition 4 For linear-Gaussian or linear-Bernoulli outcome generating distribution, if minimizing the IRM regularization in Eq. (19) identifies the causal coefficient (excluding point 0) over environments E, then minimizing the Co Co objective Eq. (11) will identify the causal coefficient (excluding point 0). The inverse statement does not hold. Because of an excessive number of optima in the IRM regularization for a single environment, it has a high requirement on the number of environments and the type of 1. The form Re(α; σ(wα xe)) follows the IRM implementation of binary classification loss = mean nll(logits * scale, y) at https://github.com/facebookresearch/Invariant Risk Minimization/blob/main/code/ colored_mnist/main.py. Optimization-based Causal Estimation heterogeneity to sieve out the causal coefficient, or relies on the causal model to have the lowest empirical risk among all the optima. These requirements are especially challenging for high-dimensional problems. In a nutshell, connecting IRM regularization with causal optimization in Eq. (7) indicates the potential looseness of IRMv1 objective in the linear Gaussian and linear-Bernoulli DGPs. Another potential issue with the IRM regularization is that it might not optimize in a feasible direction for other types of DGPs, for example, with a categorical outcome. A case study. We present a brief case study to illustrate the connections. We consider a linear-Gaussian DGP as the Case 1 of Table 3. The two environments correspond to parameters (m(1) 1 , m(1) 2 , γ(1)) = (2, 0.5, 2), (m(2) 1 , m(2) 2 , γ(2)) = (3, 1, 0.5) in the DGP. The invariant causal coefficient is β = (3, 2, 0). Fig. 1 visualizes the optima set of Co Co and the IRM regularization where all the optima are computed analytically. Fig. 1 (a) (b) show the set of optima for IRM regularization in each of the two environments; each set forms a quadric surface. When considering the two environments together, the IRMv1 optima are the intersection of two 3D surfaces in Fig. 1 (c). Geometrically, as shown in Fig. 1 (e), the intersection forms two continuous elliptic curves consisting of an infinite number of points. The IRMv1 objective cannot identify the causal coefficient unless β has the lowest empirical risk among all the optima of the IRM regularization, which, as we will see, is not the case here. In contrast, Fig. 1 (d) shows the optima set of Co Co that contains only a finite number of points per environment. With two environments, the optima of Co Co becomes the zero point and the causal coefficient β. The causal coefficient can be identified by the modified objectives in Eqs. (14) and (15) that remove the zero point. We examine this case study empirically by implementing Co Co in Eq. (14) and IRMv1 in Eq. (17) with a large sample size N = 105 of each environment. Co Co objective with j = 1 converges to ˆαCo Co = (3.001, 2.004, 0.000) with random initialization. IRMv1 with λ = 1 converges to ˆαIRM 1 = ( 0.842, 0.390, 0.621) with initialization by a random seed and converges to ˆαIRM 2 = (2.932, 1.847, 0.017) with an idealized initialization manually set at the true causal coefficient β = (3, 2, 0). Furthermore, ˆαIRM 2 has the empirical risk as 0.97634 and the IRM regularization as 0.00028, while those of β are 0.99852 and 0.00036, respectively. Thus, among the optima of the IRM regularization for the two environments, the causal coefficient β does not have the lowest risk, and the minimizer of the IRMv1 objective is a biased causal estimate. We provide detailed computation and additional results with different tuning parameters of IRMv1 for this case study in the Appendix C.1. 5. Identification with heterogeneous environments Now we establish the causal identification for Co Co with the linear SEM in Eq. (4). Identification requires the causal quantity of interest to be expressed as a functional of the observed data distribution. This functional is also known as the causal identification strategy. In the context of Co Co, we consider the functional that maps the joint distributions p(xe, ye) over a set of environments to the risk function, then to the optima of Co Co objective. To establish identification, we prove that the optimum of the Co Co objective is unique and equals the causal coefficient of interest. In 5.1 and 5.2, we present two assumptions on the environment heterogeneity that guarantee the identification. In 5.3, we provide two sufficient conditions implying the identification assumption. Yin, Wang, Blei Algorithm 2 Co Co with risk regularization input : Data De = {Ye, Xe}, Xe Rne p, the risk function Re for each environment e E; predictor fα( ); regularizer coefficients λr, λw; annealing scheme ANNEAL( ) output : Predictor fα( ) that is robust to interventions Initialize α randomly while not converged do for e in E do Compute the gradient of the empirical risk: i=1 Re(α; ˆye i ), ˆye i = f(xe i; α) Compute: Le(α) = ge(α) α 2 (Optional step:) add weak condition to the objective: Le(α) += λw( ge(α), α )2 Add risk function as a regularization term: Le(α) += λr 1 ne ( i=1 Re(α; ˆye i )) end Update α α η α P e E Le(α) with step size η λr ANNEAL(λr) end 5.1 Identification of the causal coefficients The causal identification of an optimization-based approach is to demonstrate the existence and uniqueness of its optimum, and it equals the causal coefficient. By fully characterizing the optima set of the modified Co Co objective in Eq. (14), we explore the assumption on the environments that bestows identification. The optimum of the Co Co objective has a unique characteristic. It is a point α for which Re(α)H = 0 and α\H = 0 with certain H {1, 2, , p}. Denote the risk function with a subset of covariates xe H as the predictor by Re H(αH) = 1 2E[(y α Hxe H)2]. We have the following proposition. Lemma 5 For the linear DGP as Eq. (4), a linear predictor and squared risk, if Re(α)H = 0 and α\H = 0, then Re H(αH) = 0; if Re H(αH) = 0, then α = (αH, α\H = 0) satisfies Re(α)H = 0. Lemma 5 demonstrates that a Co Co optimum is an ERM minimizer over a subset of covariates. When restricted to this subset of covariates, the ERM minimizer is shared by all the environments. Lemma 5 offers an interpretation of Co Co optimum. To keep the notation consistent, denote C as the set of exogenous covariates and S as the unknown set of direct causes for the outcome. For any set H with C H {1, 2, , p}, we fit a regression model on Xe H in Optimization-based Causal Estimation each environment, and collect the regression coefficients as {ˆαe H}e E. We call the set H an invariant set, if the estimates ˆαe H = ˆαe H := ˆαH, e, e E. (22) If H is an invariant set, we define a length p vector as an invariant vector by equating it to ˆαH when restricting to the set H and padding it with zeros at other elements. By Lemma 5, the optima of the objective Eq. (14) consist of all the invariant vectors. Based on the interpretation of Co Co optimum, we introduce the following assumption for sufficiently heterogeneous environments. Assumption 3 (Effectiveness) There is only one invariant vector (defined in Eq. (22)) for all the sets H with C H {1, 2, , p}. C is the set of known exogenous variables. Then we have the following identification result. Theorem 6 Under Assumptions 1, 2 and 3, assuming the Gram matrix We = E[xe(xe) ] 0 for all e E, the causal coefficient β is identifiable, and is given by β = arg min α 1 |E| e E Re(α) α] 2 , (23) which is the optimum of the objective in Eq. (14). The invariance Assumption 2 ensures the existence of a point that reaches the minimum value of the objective Eq. (14), and this point is the causal coefficient, while the effectiveness Assumption 3 induces uniqueness of the solution. The following corollary provides practical guidance in collecting multi-environment data. Corollary 7 Assume the environment sets E1 E2, then in Theorem 6, (i) if the invariance Assumption 2 holds for E2, it holds for E1; (ii) if the effectiveness Assumption 3 holds for E1, it holds for E2. 5.2 Identifying the effects of the exogenous variables The effectiveness assumption can be checked with the observed data, but it needs regression over the power sets of covariates across the environments, which can be computationally expensive. In practice, we might be interested in estimating the causal effect of specific variables instead of the whole causal coefficient. This motivates us to explore identification results under a relaxed Assumption 3. Suppose the goal is to estimate the causal effect of a known exogenous variable xj . For example, such a variable could be the standard treatment variable under the unconfoundedness assumption (Imbens and Rubin, 2015). Though xj is assumed exogenous, i.e.,xj ϵ, regressing over xj alone may produce biased estimation because xj may correlate with other causes x S\j in the DGP. We adopt the following notations. Recall that S is the set of direct causes, C is the known exogenous variables, j C, and H is a set C H {1, 2, , p}. Denote WAB to be a sub-matrix of the Gram matrix W = E[xx ] with rows in the index set A and columns Yin, Wang, Blei in the index set B. For the environments E = {e1, , em}, denote WE H R(m |H|) |H| as a stacking matrix that stacks W e HH by row, i.e. WE H := [W e1 HH|W e2 HH| |W em HH] . Assumption 3 excludes a non-causal invariant vector when regressing over a set of covariates xe H. Here, we analyze the property of an invariant vector. By Proposition 5, the invariant vector can be computed by zeroing out the gradient αHRH(αH) = W e HH(αH βH) W e HHcβHc se H, (24) where Hc is the complement of H in {1, 2, , p} and se j := E[xe jϵe] = cov(xe jϵe). Denote a stacking vector θE H Rm |H| as θE H := [W e1 HHcβHc + se1 H, , W em HHcβHc + sem H ] . By setting the gradient in Eq. (24) to zero and by Proposition 5, if WE Hδ = θE H (25) holds for δ = 0, H is an invariant set with invariant vector α, ˆαH = βH + δ, ˆα\H = 0. This violates Assumption 3 because ˆα and β are two different invariant vectors that both minimize the Co Co objective. We formulate the negating statement as the following assumption. Assumption 4 (Weak effectiveness) H = S, C H {1, 2, , p}, if the stacking vector θE H = 0, it is not in the column space of the stacking matrix WE H, i.e., θE H / C(WE H). In sum, we have shown ( Assumption 4) ( Assumption 3), thus the contrapositive implies (Assumption 3) (Assumption 4). On the other hand, cases exist when Assumption 4 hold but Assumption 3 doesn t, so Assumption 4 is a weaker assumption. For example, if two subsets of causes Si S, i = 1, 2, C Si, and xe Si xe \Si, then both S1 and S2 are invariant sets when E[xe \Si] = 0. Hence it violates Assumption 3 due to the existence of multiple invariant sets. But in such a case, Assumption 4 can still hold because setting H = Si in Eq. (25) makes θE H = 0 and δ = 0. The Co Co optimum might be (βS1, 0) and (βS2, 0), which share the common unbiased part βC. With the weak effectiveness assumption, we have the following identification result. Theorem 8 Under Assumptions 1, 2 and 4, the causal effect of x C on y is identifiable and is given by the optimum of objective Eq. (14). That is, for α arg min α 1 |E| e E Re(α) α] 2 , (26) α C = βC, where β is the causal coefficient. Theorem 6 guarantees the identification of the whole causal coefficient vector β. In comparison, Theorem 8 identifies the effects of the known exogenous variables βC based on a weaker effectiveness assumption. In the next section, we leverage the explicit analytic form of Eq. (25) and discuss several scenarios where weak effectiveness can be guaranteed. 5.3 Sufficient conditions for identification We show two approaches that ensure sufficient heterogeneity of the environments. One is by actively taking do-interventions, and the other is by passively checking the rank conditions. Optimization-based Causal Estimation Effectiveness by do-interventions. Consider the do-intervention (Pearl, 2009) as Xe j ae j, ae j p(a), j Ie, where e is environment index, e E, Ie is the index set of intervened variables, and p(a) is a continuous distribution defined on R. do(Xe j = ae j) means all the samples of Xj in the environment e are fixed at the constant ae j during the data generation. Suppose the intervention on one variable is independent of other interventions and variables. The following corollary gives a sufficient condition that guarantees Assumption 4. Corollary 9 For the linear SEM in Eq. (4) and predictor in Lemma 1, and for dointerventions, suppose j {1, 2, , p}, e, e E s.t. Ie = Ie = {j}, and k C, E[Xk] = 0, then the optimum α of the optimization in Eq. (14) satisfies α C = βC, where β are the causal coefficients. Corollary 9 is closely related to the identification results of ICP in Peters et al. (2016, Theorem 2). Comparing to Corollary 9, ICP asks for one less do-intervention on each covariate, but it crucially relies on the invariance assumption in Eq. (6) that is stronger than Assumption 2 of Co Co. Effectiveness by rank checking. In many cases, we collect data from environments where the heterogeneity is introduced not by the manual interventions but by the natural factors such as spatial and temporal differences. We propose a checking method to ensure a sufficient but necessary condition for the Assumption 4, which is easy to compute. As 5.2, denote C P = {1, 2, , p} as the known exogenous variables. For each environment e E, the Gram matrix using the observed data is We = E[(Xe)T Xe] Rp p and the submatrix We CP is the rows of We with index in C. Let WE CP R(|E| |C|) p be a matrix that stacks We CP by the rows for all e E. Assumption 4 assumes the linear system WE Hδ = θE H for any δ = 0. A key observation is that for j C, E[Xjϵ] = 0 due to exogeneity, so se C = 0 for the gradient computed in Eq. (24). Therefore, Assumption 4 is guaranteed if δ = 0, WE CHδ = WE CHcβHc, which can be further guaranteed if the homogeneous linear system WE CPv = 0 only has trivial solution for the variable v. This linear system only depends on observed data. Then we have the following corollary. Corollary 10 Assumption 4 is guaranteed if the linear system WE CPv = 0 only has the trivial solution v = 0. In practice, we first collect data from environments E where the heterogeneity may come from do-interventions or soft interventions that change the distribution of the covariates (Eberhardt and Scheines, 2007). Then we check Corollary 10 by equivalently checking if the matrix WE CP has full column rank. If the rank is full, we proceed to Algorithm 1 with environment set E; otherwise, we collect data from a new environment. The algorithm is summarized in Alg. 3. 6. Extensions to the nonlinear model So far, we focus on the linear SEMs. Here we generalize these results to a nonlinear SEM and a predictor that maps linear combinations of covariates to the outcome, i.e. y = f(Ax). For Yin, Wang, Blei Algorithm 3 Co Co with heterogeneity checking input : The environmental set E, the set of known non-descendant variable C output : Coefficients α with causal interpretation Initial E = repeat Randomly choose a valid intervention Let E E {e} with e as the index of new environment Collect data in environment e, De = {ye i , xe i}ne i=1 Update WE CP until rank(WE CP) = p; Run Algorithm 1 with {E, C, {De}e E, {Re( )}e E} where Re( ) is the risk function. example, the fully connected neural network is a special case in such a functional form. The generalization of the algorithm to the nonlinear model closely follows the derivation in 3. The first step is to build a constrained optimization problem similar to causal optimality in Lemma 1 and show that it admits the causal coefficient as a solution. The analysis presented in 3.3 based on the directional derivative can then be applied to nonlinear models. Suppose we have a collection of environments E, and for each e E, we observe i.i.d. data for variables (xe, ye), xe Rp, ye R. Suppose the underlying DGP is ye f( B xe S; γ ) + ϵe (27) where S {1, 2, , p}, ϵe xe S and E[ϵe] = 0. f : RK R is an arbitrary function mapping with causal parameters β = ( B , γ ) where B RK |S| and γ RM. When K = 1 and f( ) is an identity mapping, Eq. (27) reduces to the linear SEM. Eq. (27) can represent a flexible DGP that generates the outcome by a fully-connected neural network, where K and B are the width and weights of the first hidden layer, respectively. Assume the nonlinear predictor is ˆye = f(Bxe; γ), (28) where B RK p, γ RM and α = (B, γ) are the parameters to optimize. We can re-write B xe S = B Λxe where Λ R|S| p has the i-th row as e i if i S and as 0 p if i / S. Let B = B Λ where the j-th column of B is 0K if j / S. Then for square error Re(α) we have the following proposition. Proposition 11 (Causal Optimality, Nonlinear) With the DGP in Eq. (27) and with Assumptions 1 (ii) (iii), the causal coefficient α = (B, γ) = (B , γ ) is an optima of the following problem min B,γ R(B, γ), ˆy = f(Bx; γ) s.t. Bkj = 0 if B kj = 0, 1 k K, 1 j p γm = 0 if γ m = 0, 1 m M. Optimization-based Causal Estimation Proposition 11 greenlights the analysis in 3. The Co Co objective in Eq. (15) can then be used for nonlinear models in Eqs. (27) and (28), min B,γ 1 |E| Re(α) α 2 + λr Re(α) , (30) where α = (B, γ). When B = B , the multiplication Bx zeros out the non-causal covariates x\S so that the prediction ˆy is independent of the spurious covariates. In situations when the causes and non-causes are not disentangled in the covariate space, there could be a representation z(x) of the covariates where z S(x) are the causes of the outcome and z\S(x) are the non-causes. Based on the representation z, the DGP becomes ye f( B z S(xe); γ ) + ϵe, and the predictor is ˆye = f(Bz(xe; γ1); γ2) with parameters α = (B, γ1, γ2). As will be shown in 7.3 and 7.4, empirically we find Co Co in such situations can produce predictions not depending on the spurious associations. For example, 7.3 studies the Color-MNIST data set where the input is a digit image x R14 14 2 with 14 14 pixels and 2 color channels. Across the data points, the causal variables (the digit pixels) change their positions in the image and color channels. The position change makes the selection of causes infeasible in the covariate space but possible in the representation space. We find that for such cases, Co Co can still learn a robust predictor relying on the causal information such as the digit pixels. We leave the theoretical analysis of causal representation learning by Co Co as an interesting future direction. In the nonlinear regime, due to the high flexibility of the predictor, identification in the high-dimensional parameter space can be difficult. Different parameterizations can represent a similar mapping from the input to the output. Thus the same data generation might correspond to an equivalent class of points in the parameter space (Heinze-Deml et al., 2018; Christiansen et al., 2020). Consequently, the optima of Co Co may not be unique. However, the optima points enjoy robust predictive properties. Proposition 12 (Local optimality) Suppose α minimizes Co Co objective Eq. (13) with f E(α ) = 0. Suppose a new environment ι satisfies pι(x, y) = P e E wepe(x, y), P e E we = 1, then απ Rι(α)|α=α = 0, π = supp(α ). Proposition 11 and Assumption 2 guarantee that the causal coefficient β satisfies f E(β) = 0 in Eq. (13). Therefore, if there exists another global optima α under these assumptions, it satisfies the condition f E(α ) = 0 in Proposition 12. Proposition 12 shows that if the data distribution of a new environment is a mixture of those for the training environments, the optimum by Co Co already minimizes the predictive risk of this new environment locally for its nonzero elements. This means Co Co optima can transfer the predictive accuracy, quantified by the risk value, from the training environments to their mixtures. As we will show in empirical studies, a predictor optimized by Co Co generalizes the predictive accuracy to non-i.i.d. data in the new environments where the spurious associations change. 7. Empirical studies In the empirical study, we aim to answer the following questions: (1) Can Co Co accurately estimate causal effects when some covariates are spuriously associated with the outcome? Yin, Wang, Blei Figure 2: The graphs for the simulation studies in 7.1. The case ID of each graph is in the rectangle box. The blue arrow represents a path whose parameter varies across environments, the blue circle of a covariate means its distribution given parents varies across environments, and the blue circle of outcome means the variance of its additive noise varies across environments. Invariance Eq. (5) holds in all cases. The shaded nodes are the variables that are observed. (2) Can Co Co make an accurate prediction in new environments by relying on causal information? (3) How sensitive is Co Co to its assumptions and tuning parameter? (4) Are the empirical results aligned with theoretical analysis? To answer these questions, we study Co Co and its comparable methods on simulated data, semi-synthetic data, and real data. Code implementations for the empirical studies are available at https://github. com/mingzhang-yin/Co Co. 7.1 Causal inference on the synthetic data In this section, we apply Co Co for causal inference with linear synthetic data. Consider the scenario where we know one exogenous variable, but the other variables are of unknown status; some might be spurious variables, some might be direct causes, some might be neither. As discussed in 3, running ERM with such data will end up with biased estimates of the causal coefficients. But if we have data from multiple environments, we can use Co Co to estimate the causal coefficients. We generate data from five different graphs in Fig. 2 with SEMs in Table 3. In cases 1 to 5, variables x3, x4, x4, x4, x2 are spuriously associated with the outcome, respectively. The mapping from the causes to the outcome is linear with additive noise. We specify x1 as a known exogenous variable (for the use of the method in Alg. 1) and run Co Co, IRM, and ERM to estimate the causal coefficients. To generate data from different environments, we set the parameter γe in SEMs of Table 3 with environment-specific parameters me 1, me 2 Unif(0, 1), me Unif(1, 2). We generate two environments with γe {0.5, 2.0}, each environment with 10,000 data points. As required, the DGPs leave the causal coefficient invariant. Optimization-based Causal Estimation Table 2: The mean absolute error of the estimations for causal parameters β (lower the better). The estimate by Co Co (this paper) is close to the true causal coefficients across the DGPs. Co Co has a more accurate estimation comparing to RVP (Xie et al., 2020), V-REx (Krueger et al., 2020), Causal Dantzig (Rothenh ausler et al., 2019), IRM (Arjovsky et al., 2019) and ERM. The reported mean and the standard deviation in the parentheses are computed across ten independent trials. Case 1 2 3 4 5 ERM 0.31 (0.06) 0.16 (0.00) 0.32 (0.00) 0.19 (0.03) 0.38 (0.01) V-REx 0.16 (0.06) 0.11 (0.01) 0.44 (0.01) 0.13 (0.04) 0.06 (0.10) RVP 0.10 (0.04) 0.10 (0.01) 0.43 (0.01) 0.11 (0.04) 0.05 (0.04) Dantzig 0.54 (0.62) 3.23 (2.64) 4.95 (3.06) 0.43 (0.05) 0.20 (0.01) IRMv1 2.12 (0.70) 0.01 (0.00) 0.02 (0.01) 2.17 (0.65) 0.72 (0.35) Co Co 0.01 (0.00) 0.02 (0.01) 0.01 (0.01) 0.01 (0.01) 0.01 (0.00) The five graphs test different scenarios: (1) independent causes; (2) observed mediator; (3) observed confounder and mediator; (4) observed confounder and unobserved mediator; (5) unobserved confounding with an omitted common cause. The data generation covers the following circumstances: variables except the outcome might be generated from nonlinear models (Case 2, 3); the distribution of the causes of the outcome might shift across environments (Case 1, 4); the variance of the outcome distribution conditional on its causes might vary across environments (Case 3). Whether a method can produce accurate estimation in all of these situations reflects its generalizability. We evaluate with the mean absolute error (MAE) between the estimation α and the true coefficients β. We compare Co Co with ERM, IRM (Arjovsky et al., 2019), V-REx (Krueger et al., 2020), RVP (Xie et al., 2020) and Causal Dantzig (Rothenh ausler et al., 2019). The properties and assumptions of these methods are summarized in Table 5 in Appendix B for comparison. For the algorithms with tuning parameter λ, we report the best result for IRM with λ {2, 20, 200}, for V-REx and RVP with λ {10, 102, 103, 104}. We choose stepsize from {0.01, 0.1} that produces the lowest objective for each method. For all methods, the algorithm is considered to converge if the mean absolute difference between the parameters in consecutive iterations is less than 10 3 and the total iterations are over 104. The results are presented in Table 2. The mean MAE shows that the estimates of pure prediction ERM are biased when the covariates have spurious associations with the outcome. IRMv1 with proper hyper-parameter performs well in cases 2, 3, while it has large MAEs in other cases. It suggests the IRMv1 performance is affected by a limited number of environments and the type of intervention. The MAE by IRM has a high standard deviation across random trials, indicating IRM converges to different global optima. This echoes the discussion in 4 that the IRMv1 objective is over-relaxed and has excessive non-causal optima. V-REx and RVP perform better than ERM except in Case 3 when the variance of the exogenous noise of outcome is not invariant. This means their performance largely relies Yin, Wang, Blei on whether a strong assumption of invariance in Eq. (6) is satisfied. Causal Dantzig has large MAE in these cases which might be because the DGPs do not satisfy the inner-product invariance it requires (Rothenh ausler et al., 2019). In comparison to these methods, Co Co estimates have the lowest or equally lowest error uniformly over all the cases. The MAEs with four environments are presented in Fig. 8 in Appendix. Table 3: SEMs for the simulation study in 7.1. The environments are indexed by e. γe, me 1, me 2, me are fixed scalars in an environment. Case 1 Case 2 Case 3 xe 2 N(me 2, (γe)2) xe 1 N(me 1, (γe)2) ye 3xe 1 + 2xe 2 + N(0, 1) xe 3 γeye + N(0, (γe)2) xe 2 N(1, (1 xe 1 xe 2 + Unif( 1, 1) xe 3 sin(xe 1) + N(0, (1 ye 2xe 1 + 1.5xe 3 + N(0, 1) xe 4 γeye + N(0, 1) xe 2 N(1, (1 xe 1 xe 2 + Unif( 1, 1) xe 3 sin(xe 1) + N(0, (1 ye 2xe 1 + xe 2 + 1.5xe 3 + N(0, (γe)2) xe 4 γeye + N(0, 1) Case 4 Case 5 xe 2 N(1, (1 xe 1 xe 2 + Unif(0, me) xe 3 xe 1 + xe 2 + N(0, (1 ye xe 2 + 2xe 3 + N(0, 1) xe 4 γeye + N(0, 1) xe 1 N(1, 1 ye 2xe 1 + ϵe xe 2 0.5γeϵe + (0.5 + γe)xe 1 + N(0, 1) As an ablation study, we replace the elementwise produce in Co Co objective by the inner product and minimize P e E( Re(α), α )2. We call this method Naive-Co Co, which is an intermediate between the Co Co objective Eq. (14) and the IRMv1 objective in Eq. (20). Naive Co Co has high MAEs 1.17, 0.53, 0.51, 1.31, 0.03 across the five cases, respectively. It means the key element for the Co Co objective is the Hadamard product and the norm derived from optimizing in all the feasible directions ( 3.3), instead of knowing a variable is exogenous. To test the model checking method proposed in 5.3, we generate heterogeneous data with γe Unif(0, 5) for each environment. When the set of known non-descendant of outcome is C = {1}, Cases 1, 4, 5 pass the checking condition with the number of environments 3, 3, 2 respectively, while cases 2, 3 do not pass. We further check the nonidentifiable case in Appendix C.2. It cannot pass the checking step with any number of environments. In practice, Co Co can accurately estimate the causal coefficients with two environments, as Optimization-based Causal Estimation shown in Table 2. It validates our analysis in 5.3 that the checking step is a sufficient condition for identification but is not a necessary condition. Model Mismatch. Using data from Case 5, we further study the performance of ERM and Co Co when the predictive model does not exactly match the data-generating model. The data in Case 5 is generated from a linear model. We compare two predictors, one is a linear model that matches the DGP, and the other is a nonlinear neural network. Both models are trained with ERM and Co Co. In Case 5, X1 is the cause, and X2 is a predictive but non-causal covariate. As shown in Fig. 3, when the model is specified correctly as linear, the predictor trained by ERM fail to generalize to new values of X2, but the model trained by Co Co can generalize to any input (X1, X2). When the predictor is nonlinear, differing from the DGP, ERM prediction is only accurate for the data interpolating the training points. In contrast, Co Co can make accurate predictions for the inputs with new values of spurious covariate X2. From the robust prediction view, Co Co prediction can generalize to new environments by avoiding the spurious association for both linear and nonlinear predictors. (a) ERM, linear (b) Co Co, linear (c) ERM, nonlinear (d) Co Co, nonlinear Figure 3: Prediction accuracy for Co Co and ERM with linear and nonlinear predictors. The heatmap is the prediction error (ˆy E[y|x])2, the x-axis, y-axis are the values of input x1 and x2. The orange points are training data from two environments. Co Co has better out-of-sample generalization with a wider region of low error (blue region) than ERM for both linear and nonlinear predictors. 7.2 Gaussian mixture example We study a multi-class classification problem with a categorical outcome when the inputs contain non-causal covariates. We modify a Gaussian mixture model (GMM) to simulate the data set. The observed covariates are (xe 1, xe 2) and the outcome is ye, where e is the environment index. For each environment e, the data are generated with SEM xe 1 PK k=1 1 K N(µk, I) ye Categorical(p1, , p K) xe 2 (1 pe)δue ye + peδue k1, (31) Yin, Wang, Blei where pk = N(xe 1; µk, I)/ PK k =1 N(xe 1; µk , I), k1 Multinomial(1/K, , 1/K). Among the covariates, the mapping from the causes xe 1 to label ye is invariant across all e, whereas xe 2 is predictive to ye due to spurious associations. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Noise scale (a) Noise scale 1 2 3 4 5 6 7 8 Number of Environments (b) # of environments 0 20 40 60 80 1/ r (c) Regularization strength Figure 4: The change of test predictive error of Co Co with different levels of invariance, number of environments, and the hyperparameter. The dashed line is the ERM error rate for reference. The error bar is the standard deviation over 5 trials. 0 1000 2000 3000 4000 5000 Iteration 0 5000 10000 15000 20000 25000 30000 Iteration Accuracy(noised label) Co Co_Training Co Co_Testing IRM_Training IRM_Testing ERM_Training ERM_Testing (b) Colored MNIST Figure 5: Trace plot of training and testing accuracy for Co Co, IRM and ERM on GMM and Colored MNIST data. In panel (b), the accuracy is measured on predicting the noised label ye. Co Co has the highest prediction accuracy in a new environment. The observed data are generated from Eq. (31). The GMM component centers µk = 1.5Kek RK. To generate the non-causal covariates xe 2, we first generate K random vectors {ue k}K k=1 with ue k Q[k/2] i=1 U(0, 1) for environment e. Then for a data point in the component ye, xe 2 equals ue ye with probability 1 pe and equals a random vector from {ue k}K k=1 otherwise. By doing so, xe 2 is associated with ye, but the association varies across environments when ue 1:K changes with environment e. The DGPs that generate the environments are characterized by the values of ue 1:K and pe. We set the training environments with K = 5 and pe {0.01, 0.02, , 0.05} in Eq. (31). For a validation/test environment f we generate a new set of {uf k}K k=1 and set pf = 0. Optimization-based Causal Estimation We evaluate the test performance by averaging the accuracy over ten testing environments. If the predictor learns to predict based on the causes xe 1 instead of the spurious covariates xe 2, it can accurately predict ye in training and testing environments. We use a fully connected neural network with two hidden layers as the predictor and train the model with Alg. 2. For both Co Co and IRM, the penalty weight is chosen from ten values equally spaced from 1 to 100 on a log-scale using the validation environments. The weight on the empirical risk term is reduced to 0 after 5k iterations. The results are shown in Figs. 4 to 6 and Table 4. Fig. 5 is the trace plot for the predictive accuracy. The testing accuracy increases for all the methods in the early stage of training but drops in the later stage for ERM and IRM. We hypothesize that ERM and IRM first improve the prediction by utilizing all the covariates, including the causal ones. But in the later stage of training, it relies more heavily on the spurious associations to boost the training accuracy, which harms the accuracy at the test time. Fig. 6 shows the weight matrix that connects the input and the first hidden layer. The model trained by Co Co zeros out the weights associated with the spurious covariates x2 (the right block), aligned with the analysis in Proposition 11. In comparison, the weights obtained by IRM and ERM are mostly nonzero, passing the spurious association from the input to the subsequent hidden layers and outputs. Table 4 summarizes the numerical results. Co Co and V-REx have the highest prediction accuracy in the test environments. In this example, the strong invariance Eq. (6) is satisfied due to the DGP in Eq. (31), which meets V-REx assumption. Hence, V-REx has a high accuracy close to Co Co. When the data satisfies the strong invariance, it is potentially beneficial to add the V-REx regularization of equal noise variance to the Co Co objective. 0 1 2 3 4 5 6 7 Input layer First hidden layer 0 1 2 3 4 5 6 7 Input layer First hidden layer 0 1 2 3 4 5 6 7 Input layer First hidden layer Figure 6: The heatmap for the first layer weight matrix of the neural networks trained by Co Co, IRM and ERM on the GMM data. The matrix dimension is 10 8 where the input dimension is 8, and the first hidden layer dimension is 10. In the input, the first five elements are x1, and the last three elements are x2. Comparing to IRM and ERM, Co Co solution has the weights related to non-causal input x2 (the right block) close to 0. Sensitivity to the assumption and hyper-parameter. In Fig. 4, we study how Co Co performs if the invariance assumption is violated and the hyperparameters change. In Yin, Wang, Blei panel (a), we construct the training environments by changing the cluster centers {µk}K k=1 in Eq. (31) to {µk + ϵe k}K k=1, ϵe k N(0, σ2IK). The noise ϵe k changes the mapping from the covariates xe 1 to the label ye across the environments, deviating from the invariance assumption. The noise scale σ2 reflects the magnitude of deviation. Panel (a) shows that the testing predictive accuracy increases as the invariance tends to hold. Under a moderate deviation from the invariance assumption, the test prediction by Co Co remains more accurate than that by ERM. In panel (b), we study how the test accuracy varies with the number of environments M in training. Training environments are constructed by setting the vectors ue k Q[k/2] i=1 U(0, 1) for all k and pe {0.01, 0.02, , 0.01M} in Eq. (31). We find a growing number of environments reduces the testing error monotonically due to the increased environments heterogeneity. In panel (c), we study how the testing error changes with the penalty weight λr in Co Co objective Eq. (15). When λr is large, the objective is close to the empirical risk, and the test error is high; when λr is small, the parameters might collapse to the point 0. Between the two extremes, Co Co can learn a model that makes robust predictions in new environments. 7.3 Colored MNIST (CMNIST) CMNIST is a semi-synthetic data set for binary classification introduced in Arjovsky et al. (2019). Based on the MNIST data set, the image of hand-written digits 0-4 and 5-9 are labeled as y = 0 and y = 1, respectively. For each environment, the outcome ye is generated with probability 0.75 as y and with probability 0.25 as 1 y. We call y the clean labels and ye the noised labels. The digit is colored green with probability pe if ye = 1 and with probability 1 pe if ye = 0; if not colored green, it is colored red. The DGPs across the environments differ in the value of pe. Environments are constructed for the training with pe {0.1, 0.2}, for the validation with pe = 0.5 and for the testing with pe = 0.9. The predictor takes the colored digit image as the input and the noised label ye as the target. The input x R14 14 2 where the image has 14 14 pixels and two color channels. The relationship between the digit shape and ye is genuinely causal, while the relationship between the color and ye is spurious. The goal is to learn a predictor that makes the predictions based on the digit shape rather than the color. A predictor using color information cannot accurately predict the noised label ye at the test time and the clean label y during training and testing. Empirical results. We optimize the predictor, a fully connected neural network with two hidden layers, by Alg. 2. For the algorithms in 7.1, we compare with ERM, IRM (Arjovsky et al., 2019), and V-REx (Krueger et al., 2020) that can generalize to nonlinear models. More baseline results are reported in Table 7 in Appendix. The weight of the risk term for Co Co and V-REx objectives is chosen on the validation environment from 2 {10 1, , 10 5}, and is reduced by a factor of 10 after 15k iterations (half of the total iterations). We use Adam optimizer (Kingma and Ba, 2014) with learning rate 10 4. For IRM, we reduce the learning rate of the public code to 10 4 to ensure stability over long iterations, and we use all the other hyperparameters and annealing strategies provided in the author s code 2. 2. https://github.com/facebookresearch/Invariant Risk Minimization Optimization-based Causal Estimation Table 4: Predictive accuracy in training and testing environments on the GMM, CMNIST, and Wildlife data. For GMM, the Oracle results are the predictions with causes xe 1 instead of (xe 1, xe 2). For CMNIST, the prediction accuracy is reported for both clean label y and noised labels ye; the Oracle is the same function fitted on grey-scale images by ERM. GMM CMNIST Wildlife Train Test Train ( y) Test ( y) Test (ye) Train Test ERM 99.4 51.0 75.8 44.4 31.1 99.6 58.4 IRM 95.9 75.9 81.4 70.3 46.5 83.4 84.9 V-REx 92.6 91.4 75.2 49.5 31.8 96.2 67.3 Co Co (this paper) 91.9 91.6 93.0 92.9 74.7 86.1 85.2 Random Guess 20 20 50 50 50 50 50 Oracle 92.3 91.8 99.3 97.9 74.8 - - Fig. 9 in Appendix D visualizes the weight matrix that connects the input x and the first hidden layer. The weight learned by ERM connects the hidden layer to all the inputs, hence encoding non-causal information in the latent representation. In contrast, the weight learned by Co Co sets multiple columns close to zero and is symmetric over the two color channels, potentially removing the dependence of the hidden layer on some non-causal pixels. The numerical results are shown in Table 4. The trace plot for the noised label is in Fig. 5 (b) and the trace plot for the clean label is in Fig. 7 (a) of Appendix D. The results show that ERM predicts labels with accuracy close to 1 in the training but has the lowest accuracy at the testing. The reason might be that its prediction largely depends on the color information rather than the digit shape, whereas the association between color and label ye changes from training to testing environments. The testing accuracy of IRM increases in the early stage of training but drops in the later stage. We hypothesize that the model at first improves the prediction by utilizing all the information including that of digit shape, but later it relies more on the color information, which reduces the accuracy at the test time. Similar patterns appear in the prediction of clean label y, as shown in Fig. 7. 7.4 Natural image classification In this example, following Cloudera (2020), we adapt the i Wild Cam 2019 dataset (Beery et al., 2019) that contains wildlife images taken in the wild. The images are collected from different cameras, each at one of 143 locations. The goal is to classify coyotes and raccoons in images. The data collected in different locations and time usually follow different distributions due to varying physical factors such as landscape, season, vegetation, illumination conditions, etc. Therefore, the images taken from different cameras can be considered data from heterogeneous environments. The physical factors reflected in the image background might be predictive of the species but due to spurious association. Our goal is to learn a predictor Yin, Wang, Blei that can make accurate predictions in a new environment by training on the data from a limited number of environments. This goal can be achieved if the predictor manages to recognize coyotes and raccoons based on animal pixels instead of using spurious associations from the backgrounds. Based on the setting of Cloudera (2020), we use images from two locations as the training data and images from another location as the test data. We use images from an additional location as the validation data. The inputs are 512-dimensional features extracted from Res Net18 (He et al., 2016), a pre-trained model on the Image Net dataset (Deng et al., 2009). The predictor is a fully connected neural network with one hidden layer of size 10. Co Co is trained by Alg. 2. In this example, we find that adding the weak condition (19) in Alg. 2 with weight λw improves convergence stability with random initialization3. The hyperparameters are selected on the validation environment. We set the weak condition weight λw = 10 4 and the risk regularizer weight λr = 1. λr is reduced to 10 5 after 100 epochs. The risk regularizer is an inductive bias to encourage nonzero solutions. After the optimizer is sufficiently away from the zero point, annealing the risk regularizer prevents the algorithm from minimizing the objective by reducing the risk function, hence preventing it from using the spurious association. Co Co is compared with ERM, IRM (Arjovsky et al., 2019), and V-REx (Krueger et al., 2020). All methods are trained by ADAM with a learning rate 10 3. The result is summarized in Table 4 and Fig. 7 (b) in Appendix D. It demonstrates that ERM has high accuracy in the training but low accuracy during testing. Co Co accuracy is slightly higher than IRM and is much higher than V-REx and ERM. Compared to ERM, prediction by Co Co has a slight drop in training accuracy but has significantly higher testing accuracy. Co Co has the smallest performance gap between training and testing, indicating that it largely avoids predicting animal labels via information from image backgrounds, i.e., information that varies across environments. 8. Conclusion This paper formulates causal estimation as a constrained optimization problem. Applying directional derivative methods to this optimization problem, we propose the Co Co objective, a computationally tractable optimization method for estimating causal coefficients with datasets from multiple environments. Theoretically, we discussed the necessary and sufficient conditions by which the causal coefficients are identified by optimizing the Co Co objective. We discuss the mathematical connection between Co Co and IRM. In empirical studies, we find that Co Co produces accurate causal estimation and distributionally robust predictions. Co Co is applicable to high-dimensional data and to linear and nonlinear models. Looking ahead, we think several problems are worthy of further exploration. One direction is to consider the situations when there is an unobserved confounder as a direct cause. In this case, a potential approach is to connect environments and instrumental variables. Another direction is to further understand the interplay between the type of interventions, the number of environments, and the identification of causal coefficients, especially for nonlinear models. Such understanding can enable causal estimation with a minimal number of environments by methods like active learning. 3. Adding a general condition in Eq. (18) to the strong condition of Eq. (11) does not change the optima set, but it may improve the smoothness of the optimization landscape. Optimization-based Causal Estimation Acknowledgments The authors thank the Editor, Action Editors, and anonymous reviewers for helpful and constructive comments. Mingzhang Yin thanks the support from the Data Science Institute and the Irving Institute for Cancer Dynamics, Columbia University. The authors acknowledge the fundings from Warrington College of Business, Office of Naval Research under grant number N00014-23-1-2590, the National Science Foundation under Grant No. 2231174 and No. 2310831, NSF IIS-2127869, NSF DMS-2311108, ONR N000142412243, and Simons Foundation. Appendix A. Proofs In this section, we present proofs for the results in the main paper. First, we prove the causal optimality results of the proposed optimization problems. Proof [Lemma 1] Let the random vector x = (x1, , xp) denote the covariates. The expected mean square error is =E[(α x β x ϵ)2] =(α β) E[xx ](α β) 2E[(α β) xϵ] + E[ϵ2]. Since supp(α) supp(β), the (α β) x is a linear combination of the true causes as P j supp(β)(αj βj)xj which is independent of ϵ by the SEM, thus E[(α β) xϵ] = 0. Since E[xx ] is assumed to be positive definite, the unique optima of the square error is α = β. Proof [Lemma 2] Recall that Xj is a known exogenous variable with Xj ϵ. We first prove that the optima set of the modified Eq. (14) is a subset of that for the non-modified Eq. (13), but it still keeps the causal coefficient. This is because Re(α)j = 0 implies Re(α)j αj = 0, hence a minimizer of Eq. (14) must be a minimizer of Eq. (13). Next we prove that the causal coefficient β minimizes Eq. (14) to zero. It is sufficient to show that Re(β)j = 0 since we already know Re(β) β 2 = 0. It is true because for the DGP in Eq. (4), Re(β)j = E[xj ϵ] = 0. Finally, we show that the vector 0 is not an optimum of Eq. (14) almost surely when βj = 0. The zero vector minimizes Eq. (14) if and only if Re(0)j = 0. For the linear model, it requires P j S E[xe jxe j ]βj = 0 for all e E. By the independent causal mechanisms principle (Sch olkopf, 2019; Sch olkopf et al., 2021), the causal coefficients β (the mechanism) are independent of the distribution of causes x S. It means for the zero vector to be an optimum, βS has to fall in the intersection of |E| hyperplanes in R|S| which has measure zero, hence the probability is zero. Proof [Lemma 3] 3.3 shows the feasible directions U = span{ej : j S}. Therefore, the causal parameter β itself is a feasible direction with β U. The first order condition implies the optima of the Eq. (7) sets the directional derivative to zero in the direction of β, i.e., R(α), β = 0. By Lemma 1, plugging β into α produces Yin, Wang, Blei R(β), β = 0, which means β arg min α ( R(α), α )2. (32) Similarly, any partition P of the set {1, 2, , p} gives a necessary condition that admits β arg min α A P ( R(α)A, αA )2. (33) Proof [Proposition 4] The statement follows directly from the Cauchy Schwarz inequality X e E p Re(α) α 2 2 X e E ( Re(α), α )2 0 Proof [Lemma 5] By Eq. (37), the sub-vector of the gradient is R(α)H = WH(α β) s H (34) By Eq. (24), αHRH(αH) = WHH(αH βH) WHHcβHc s H. (35) By Eq. (34), suppose α is a Co Co objective optima with R(α)H = 0 and α\H = 0, α also satisfies αHRH(αH), so its nonzero elements are the ERM solution when regressing with input x H. On the other hand, suppose x H is ERM solution with input x H, α = (αH, 0) will be a Co Co objective optima by Eq. (35). The following proofs are for the identification results in main paper 5. Proof [Theorem 6] Let se j = E[Xe j ϵ] = cov(Xe j , ϵ), se = (se 1, , se p)T . By the data generating process, se j = 0 for j {1, , K}. Let ge(α) = Re(α) α] 2 , f(α) = 1 e E ge(α). (36) where f(α) is Co Co objective. Direct computation shows Re(α) = W e(α β) se (37) Notice f(α) 0 and by the structual equation model, due to independence of the exogenous noise ϵ and causes Pa(Y ), we have se β = 0. Hence for α = β, f(α ) = 0. This guarantees the existence of a solution as causal coefficient β. To prove the identification, it is sufficient to prove that for all α = α , f(α) > 0. We use proof by contradiction. Optimization-based Causal Estimation Let H = supp( α) and Hc as its compoment set in {1, 2, , p}. We assume f(α) = 0 and α = β and deduce a contradiction. Since f(α) = 0, for all e, ge(α) = 0. Since ge(α) = Re(α) α, it means Re(α)H = 0, for all e. However, by the characterization of the feasible set in Section 3.3, Assumption A2) implies that there does not exist α Rp, α = β, such that Re(α)H = 0, e E. Otherwise, the set H is an invariant set and both α and β are invariant estimations, which violates Assumption A2). Hence for α = β, there exists an environment e E with Re (α)H = 0. This yields a contradiction. Proof [Corollary 7] The claim i) is trivial. To see why the claim ii) holds, we prove its equivalent contrapositive statement. If assumption (A2) does not hold for E2, it means there exists α, Re(α)H = 0 for all e E2, which also applies to all e E1 since E1 E2. Hence the assumption (A2) does not hold for E1. Proof [Theorem 8] We prove the statement by contradiction. Using notations in Eq. (36), suppose for α , f(α ) = 0 and α C = βC. Then let set H = supp(α ) C. Since f(α ) = 0, we have Re(α )H = 0, for all e. By Eq. (24), this means W e HH(α H βH) = W e HHcβHc + se H, e (38) Denoting δ = α H βH, we have δ = 0. Then Eq. (38) contradicts with the assumption (A2 ). Proof [Corollary 9] Since the do intervention satisfies validity assumption (A1), to prove the identification of the treatment effect for the variable of interest, say xj , by Theorem 8 it is sufficient to show that it satisfies the weak effectiveness assumption (A2 ). We suppose the environments E violate assumption (A2 ), that is H {1, 2, , p}, δ R|H|, δ = 0 such that WE Hδ = θE H, and yield a contradiction. By For notation convenience, denote δσ(j) as the element of δ associated with the column of WE H that consists of the elements {W e ij; e E, i H}; here σ : H 7 {1, 2, , |H|} is a bijection by definition. Suppose E[Xj ] = 0. Consider the set of variables Sδ = {j : j H, xσ(j) = 0} {j : j Hc, βj = 0}. Since δ = 0, the set Sδ is non-empty. We consider the youngest node Xj, j Sδ that there is no direct path from Xj to any other node with index in Sδ. There are two possible cases. When j H Sδ, for e, e E with Ie = Ie = {j}, we have the linear equation W e j Hδ = θe j in the linear system WE Hδ = θE H as k H δσ(k)E[Xe j Xe k] = X t Hc βt E[Xe j Xe t ] + E[Xe j ϵe]. (39) For the do intervention Xe j ae j, we have E[Xe j Xe k] = ae j E[Xe k] = ae jµe k, and since Xj is the youngest node, Eq. (39) becomes δσ(j)ae j + X k H,k =j δσ(k)µk = X t Hc βtµt (40) Yin, Wang, Blei for environment e , we have δσ(j)ae j + X k H,k =j δσ(k)µk = X t Hc βtµt (41) Since ae j = ae j , Eqs. (40), (41) are inconsistent and yield a contradiction. When j Hc Sδ, analogously we have the linear equation W e j Hδ = θe j as k H δσ(k)Wj k = X t Hc,t =j βt Wj t + βjµj ae j (42) and equation W e j Hδ = θe j as k H δσ(k)Wj k = X t Hc,t =j βt Wj t + βjµ jae j (43) Since ae j = ae j , Eqs. (42), (43) yield a contradiction. Proof [Corollary 10] Suppose Assumption 4 does not hold, then by Eq. (25) there exists H {1, 2, , p}, C H and δ R|H|, δ = 0, such that We HHδ + We HHc( βHc) = se H for all e E. Since xi ϵ, for i C, we know se i = 0. Letting v H = δ, v Hc = βHc, we have WE CPv = WE CHv H + WE CHcv Hc = s E C = 0, (44) which cannot pass the checking step since v = 0. Proof [Proposition 11] By the construction of Λ, B = AΛ is a matrix where the j-th column B j = 0 if j / S. Similar to the proof of Lemma 1, we can compute the L2 risk as =E[(fγ(Bx) fγ (B x) ϵ)2] =E[(fγ(Bx) fγ (B x))2] 2E[((fγ(Bx) fγ (B x))ϵ] + E[ϵ2]. Due to the constraints, Bj = B j = 0, Bx ϵ, B x ϵ, therefore the second term is zero. Then the L2 risk reaches its minimum as E[ϵ2] when B = B , γ = γ . Proof [Proposition 12] Since f E(α ) = 0, e E || Re(α ) α ||2 = 0. Let π = supp(α ), we have ( Re(α ))π = 0. This means for j π, 0 = αj Re(α ) = Z pe(x, y) αj L(y, ˆy(x; α))|α=α dxdy Optimization-based Causal Estimation αj Ex,y pγ(x,y)L(y, ˆy(x; α)) αj L(y, ˆy(x; α))dxdy. Plug in α = α we have αj Rγ(α)|α=α = 0. Appendix B. A Summary of algorithms on multiple environments We summarize the properties of Co Co and several representative causal algorithms that leverage data from multiple environments in Table 5. Table 5: Comparing causal algorithms with multiple environments. Gnr. interv.: allow a general type of intervention as long as the invariance in Eq. (5) or Eq. (6) is satisfied. nl. model: has been applied to the nonlinear predictive function. scalability: computational efficiency in scaling up to high dimensional problems. uneq. variance: allow the variance of exogenous noise of the outcome to vary across environments. unm. cf.: allow unmeasured confounding between the true causes and the outcome. gnr. interv. nl. model scalability uneq. variance unm. cf. Co Co (this paper) À IRM (Arjovsky et al., 2019) À V-REx (Krueger et al., 2020) À RVP (Xie et al., 2020) À group-DRO (Sagawa et al., 2019) À ICP (Peters et al., 2016) Ä Causal Dantzig (Rothenh ausler et al., 2019) À LRE (Ghassami et al., 2017) Ä MC (Ghassami et al., 2018) Ä Appendix C. Analytic case studies This section include two concrete cases. One analytically demonstrate how optimizationbased methods can estimate causal coefficients, and compare Co Co, IRM and ERM. The other case is an instance for ineffective interventions. Yin, Wang, Blei C.1 An example of optimization-based estimation To illustrative the discussion in 4, we study ERM, IRM and Co Co on a specific example. The DGP follows the Case 1 of Table 3, xe 2 N(me 2, (γe)2) xe 1 N(me 1, (γe)2) ye 3xe 1 + 2xe 2 + N(0, 1) xe 3 γeye + N(0, (γe)2). We consider the DGP of two environments corresponding to parameters (m(1) 1 , m(1) 2 , γ(1)) = (2, 0.5, 2), (m(2) 1 , m(2) 2 , γ(2)) = (3, 1, 0.5). Consider the predictor ˆy = α xe and the risk function Re(α) = 1 2E[(ye α xe)2]. The gradient of the risk function is αRe(α) = (α1 3)E[x2 1] + (α2 2)E[x1x2] + α3E[x1x3], (α2 2)E[x2 2] + (α1 3)E[x1x2] + α3E[x2x3], α3E[x2 3] + (α1 3)E[x1x3] + (α2 2)E[x2x3] γ . And the moments are E[x1x3] = 3γ(m2 1 + γ2) + 2γ(m1m2), E[x2x3] = 3γ(m1m2) + 2γ(m2 2 + γ2), E[x2 1] = m2 1 + γ2, E[x2 2] = m2 2 + γ2, E[x1x2] = m1m2, E[y2] = 9E[x2 1] + 12E[x1x2] + 4E[x2 2] + 1, E[x2 3] = γ2(E[y2] + 1). The Co Co optima for each environment is given by solving α with the system of equations α1(α1 3)E[x2 1] + α1(α2 2)E[x1x2] + α1α3E[x1x3] = 0, α2(α2 2)E[x2 2] + α2(α1 3)E[x1x2] + α2α3E[x2x3] = 0, α2 3E[x2 3] + α3(α1 3)E[x1x3] + α3(α2 2)E[x2x3] γα3 = 0. The optima of the IRM regularization for each environment forms the quadric surface α1(α1 3)E[x2 1] + α1(α2 2)E[x1x2] + α1α3E[x1x3] + α2(α2 2)E[x2 2] + α2(α1 3)E[x1x2] + α2α3E[x2x3] + α2 3E[x2 3] + α3(α1 3)E[x1x3] + α3(α2 2)E[x2x3] γα3 = 0. Empirically, we sample 105 data points of each environment. The ERM solution for the two environments is αERM = (2.815, 1.778, 0.043). The empirical solutions of IRMv1 with various λ values are in Table 6, which is obtained with a fixed random seed. The stepsize is Optimization-based Causal Estimation λ Random initialization Idealized initialization at β 0.1 (-0.372, -0.104, 0.547) (2.923, 1.842, 0.019) 1 (-0.842, -0.390, 0.621) (2.932, 1.847, 0.017) 10 (-0.877, -0.411, 0.626) (2.933, 1.848, 0.017) 1000 (-0.830, -0.300, 0.613) (2.954, 1.896, 0.012) Table 6: The convergence points of IRMv1 for the case study. 0.01. We consider random initialization or an idealized initialization at the causal coefficient β. For λ {0.1, 1, 10}, both initializations converges within 3 104 iterations. For a large λ = 1000, the convergence is slow with 8.8 105 iterations for the random initialization and 1.7 105 iterations for the initialization at β, likely due to the weak regularization of the empirical risk and the identification issue as discussed in 4. The observation that the idealized initialization does not stay at β indicates β does not have the lowest empirical risk among the optima set of the IRM regularization. For λ = + , the optimization will stay at β for the idealized initialization since β belongs to the optima set of the IRM regularization, but such initialization is not feasible in practice. The empirical solution of Co Co is (3.001, 2.004, 0.000) which is close to the true causal coefficient (3, 2, 0). C.2 An example of the ineffective intervention Consider environments indexed by γe {1, 2, 3}, and SEM as: xe 2 N(0, (γe/2)2) xe 1 xe 2 + U( γe, γe) + 1 ye 2x1 + 1.5x2 + N(0, 1) xe 3 0.5 ye + N(0, 1). The predictor is linear as: ˆye(α) = α1xe 1 + α2xe 2 + α3xe 3. (46) Ideally, we want to identify the causal coefficients β = (2, 1.5, 0). However, in this example, a straightforward calculation shows the point ˆα = (1.6, 1.2, 0.4) minimizes the risk function E[(1/2)(ye ˆye)2] for each environment. This means Re(α)|α=ˆα = 0, and hence ˆα minimizes the objective Eq. (14). Both β and ˆα belong to the set of optima of objective (14), which cannot be distinguished under given interventions. Loosely speaking, β and ˆα are equally invariant over these three environments. Appendix D. Additional simulation results This section contains experimental results additional to 7 in the main paper. Fig. 8 contain the MAEs for the linear synthetic data ( 7.1). It shows the mean and minimal MAEs over ten random trials for two and four environments. Fig. 9 shows the weight parameters learned Yin, Wang, Blei Table 7: Test accuracy (percent) for noised label ye on the Colored MNIST. Results are reported from the original papers. The baseline methods compared with Co Co (this paper) are V-IRMG, F-IRMG (Ahuja et al., 2020), Re Bias (Bahng et al., 2020), Dec Aug (Bai et al., 2021), MM-REx (Krueger et al., 2020), PAIR (Chen et al., 2022), Fishr (Rame et al., 2022). V-IRMG F-IRMG Re Bias Dec Aug MM-REx PAIR Fishr Co Co Test accu. 49.06(3.4) 59.91(2.7) 29.40(0.3) 69.60(2.0) 66.1(1.5) 68.4(1.1) 73.8(1.0) 74.7(0.3) by Co Co and ERM on CMNIST ( 7.3), which provides explanations of why Co Co could learn a causal representation. Table 7 contains more baseline results for comparison on the Color MNIST data. Fig. 7 shows the trace plot of the predictive accuracy for the clean label y prediction on the CMNIST data ( 7.3) and for the prediction on the i Wild Cam data ( 7.4). 0 5000 10000 15000 20000 25000 30000 Iteration Accuracy(clean label) (a) Colored MNIST 250 500 750 1000 1250 1500 1750 2000 Iteration Co Co-training Co Co-testing IRM-training IRM-testing ERM-training ERM-testing (b) Wildlife Figure 7: Trace plot of training and testing accuracy for Co Co, IRM and ERM on Color MNIST and Wildlife data. In panel (a), the accuracy is measured on predicting the clean label y. Co Co has high accuracy in both training and testing environments. Optimization-based Causal Estimation (a) Two environments (b) Four environments Figure 8: The boxplot for the mean absolute error (MAE) of the estimations for causal parameters β (lower the better) over two (left) and four environments (right). Co Co estimation is close to the true causal coefficients across all settings. Co Co has a more accurate estimation comparing to RVP (Xie et al., 2020), V-REx (Krueger et al., 2020), IRM (Arjovsky et al., 2019) and ERM. Each case is run with ten independent trials. (a) Weights by Co Co (b) Weights by Co Co (stacked color channels) (c) Weights by ERM Figure 9: The weight matrix that connects the input and the first hidden layer for the CMNIST data. (a) The weights learned by Co Co set multiple columns close to zero, removing the dependence of the hidden layer on a set of pixels. (b) Stacking the weights corresponding to the two image channels (the left and right half of (a)). Co Co selects the same pixels over two color channels hence removing the color information from the hidden layer. (c) The weights by ERM connect the hidden layer to all the inputs, so the prediction depends on the spurious association. Yin, Wang, Blei Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. Invariant risk minimization games. In International Conference on Machine Learning, pages 145 155. PMLR, 2020. Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434): 444 455, 1996. Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Jun-Hyun Bae, Inchul Choi, and Minho Lee. Meta-learned invariant risk minimization. ar Xiv preprint ar Xiv:2103.12947, 2021. Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In International Conference on Machine Learning, pages 528 539. PMLR, 2020. Haoyue Bai, Rui Sun, Lanqing Hong, Fengwei Zhou, Nanyang Ye, Han-Jia Ye, S-H Gary Chan, and Zhenguo Li. Decaug: Out-of-distribution generalization via decomposed feature representation and semantic augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6705 6713, 2021. Sara Beery, Dan Morris, and Pietro Perona. The i Wild Cam 2019 challenge dataset. ar Xiv preprint ar Xiv:1907.07617, 2019. Dimitris Bertsimas, Angela King, and Rahul Mazumder. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813, 2016. Philippe Brouillard, S ebastien Lachapelle, Alexandre Lacoste, Simon Lacoste-Julien, and Alexandre Drouin. Differentiable causal discovery from interventional data. Advances in Neural Information Processing Systems, 33, 2020. Peter B uhlmann et al. Invariance, causality and robustness. Statistical Science, 35(3): 404 426, 2020. Yongqiang Chen, Kaiwen Zhou, Yatao Bian, Binghui Xie, Kaili Ma, Yonggang Zhang, Han Yang, Bo Han, and James Cheng. Pareto invariant risk minimization. ar Xiv preprint ar Xiv:2206.07766, 2022. Rune Christiansen, Niklas Pfister, Martin Emil Jakobsen, Nicola Gnecco, and Jonas Peters. A causal framework for distribution generalization. ar Xiv e-prints, pages ar Xiv 2006, 2020. Cloudera. Causality for machine learning, 2020. URL https://ff13.fastforwardlabs.com. A Philip Dawid, Vanessa Didelez, et al. Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview. Statistics Surveys, 4:184 231, 2010. Optimization-based Causal Estimation Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255. Ieee, 2009. Frederick Eberhardt and Richard Scheines. Interventions and causal inference. Philosophy of Science, 74(5):981 995, 2007. Bradley Efron. Prediction, estimation, and attribution. Journal of the American Statistical Association, 115(530):636 655, 2020. Felix Elwert and Christopher Winship. Endogenous selection bias: The problem of conditioning on a collider variable. Annual review of sociology, 40:31 53, 2014. Amir Emad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learning causal structures using regression invariance. In Neural Information Processing Systems, pages 3015 3025, 2017. Amiremad Ghassami. Causal discovery beyond Markov equivalence. Ph D thesis, University of Illinois at Urbana-Champaign, 2020. Amir Emad Ghassami, Negar Kiyavash, Biwei Huang, and Kun Zhang. Multi-domain causal structure learning in linear systems. In Neural Information Processing Systems, pages 6269 6279, 2018. Gene H Golub, Per Christian Hansen, and Dianne P O Leary. Tikhonov regularization and total least squares. SIAM journal on matrix analysis and applications, 21(1):185 194, 1999. Ruocheng Guo, Pengchuan Zhang, Hao Liu, and Emre Kiciman. Out-of-distribution prediction with invariant risk minimization: The limitation and an effective fix. ar Xiv preprint ar Xiv:2101.07732, 2021. Trygve Haavelmo. The probability approach in econometrics. Econometrica: Journal of the Econometric Society, pages iii 115, 1944. Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. Machine Learning, 110(2):303 348, 2021. Christina Heinze-Deml, Jonas Peters, and Nicolai Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2), 2018. Yin, Wang, Blei Biwei Huang, Kun Zhang, Mingming Gong, and Clark Glymour. Causal discovery and forecasting in nonstationary environments with state-space models. In International Conference on Machine Learning, pages 2901 2910. PMLR, 2019. Biwei Huang, Kun Zhang, Mingming Gong, and Clark Glymour. Causal discovery from multiple data sets with non-identical variable sets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10153 10161, 2020. Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. Pritish Kamath, Akilesh Tangella, Danica J Sutherland, and Nathan Srebro. Does invariant risk minimization capture invariance? ar Xiv preprint ar Xiv:2101.01134, 2021. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). ar Xiv preprint ar Xiv:2003.00688, 2020. Chaochao Lu, Yuhuai Wu, Jo se Miguel Hern andez-Lobato, and Bernhard Sch olkopf. Nonlinear invariant risk minimization: A causal approach. ar Xiv preprint ar Xiv:2102.12353, 2021. Jorge A Marban. Directional derivatives in classical optimization. Ph D thesis, University of Florida, 1969. Joris M Mooij, Sara Magliacane, and Tom Claassen. Joint causal inference from multiple contexts. Journal of Machine Learning Research, 21(99):1 108, 2020. Jens M uller, Robert Schmier, Lynton Ardizzone, Carsten Rother, and Ullrich K othe. Learning robust models using the principle of independent causal mechanisms. ar Xiv preprint ar Xiv:2010.07167, 2020. J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009. Jonas Peters and Peter B uhlmann. Identifiability of gaussian structural equation models with equal error variances. Biometrika, 101(1):219 228, 2014. Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947 1012, 2016. Niklas Pfister, Peter B uhlmann, and Jonas Peters. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264 1276, 2019. Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. In International Conference on Machine Learning, pages 18347 18377. PMLR, 2022. Optimization-based Causal Estimation Mateo Rojas-Carulla, Bernhard Sch olkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research, 19(1):1309 1342, 2018. Elan Rosenfeld, Pradeep Ravikumar, and Andrej Risteski. The risks of invariant risk minimization. ar Xiv preprint ar Xiv:2010.05761, 2020. Dominik Rothenh ausler, Peter B uhlmann, and Nicolai Meinshausen. Causal dantzig: fast inference in linear structural equation models with hidden variables under additive interventions. The Annals of Statistics, 47(3):1688 1722, 2019. Dominik Rothenh ausler, Nicolai Meinshausen, Peter B uhlmann, and Jonas Peters. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2):215 246, 2021. Walter Rudin et al. Principles of mathematical analysis, volume 3. Mc Graw-hill New York, 1964. Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Bernhard Sch olkopf. Causality for machine learning. ar Xiv preprint ar Xiv:1911.10500, 2019. Bernhard Sch olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. ar Xiv preprint ar Xiv:1206.6471, 2012. Bernhard Sch olkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634, 2021. Claudia Shi, Victor Veitch, and David Blei. Invariant representation learning for treatment effect estimation. ar Xiv preprint ar Xiv:2011.12379, 2020. G. Shmueli et al. To explain or to predict? Statistical Science, 25(3):289 310, 2010. Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. Springer, 2000. Jin Tian and Judea Pearl. Causal discovery from changes. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 512 521, 2001. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. Julia K Winkler, Christine Fink, Ferdinand Toberer, Alexander Enk, Teresa Deinlein, Rainer Hofmann-Wellenhof, Luc Thomas, Aimilios Lallas, Andreas Blum, Wilhelm Stolz, et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA dermatology, 155(10):1135 1141, 2019. Yin, Wang, Blei Sewall Wright. Correlation and causation. Journal of agricultural research, 20:557 580, 1921. Chuanlong Xie, Fei Chen, Yue Liu, and Zhenguo Li. Risk variance penalization: From distributional robustness to causality. ar Xiv preprint ar Xiv:2006.07544, 2020. Mingzhang Yin, Nhat Ho, Bowei Yan, Xiaoning Qian, and Mingyuan Zhou. Probabilistic Best Subset Selection by Gradient-Based Optimization. ar Xiv e-prints, 2020. Kui Yu, Lin Liu, and Jiuyong Li. Learning markov blankets from multiple interventional data sets. IEEE transactions on neural networks and learning systems, 31(6):2005 2019, 2019a. Kui Yu, Lin Liu, Jiuyong Li, Wei Ding, and Thuc Duy Le. Multi-source causal feature selection. IEEE transactions on pattern analysis and machine intelligence, 42(9):2240 2256, 2019b. Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, and Doina Precup. Invariant causal prediction for block MDPs. In International Conference on Machine Learning, pages 11214 11224. PMLR, 2020. Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541 2563, 2006.