# optimizationbased_causal_estimation_from_heterogeneous_environments__c211b4a1.pdf

Journal of Machine Learning Research 25 (2024) 1-44 Submitted 8/21; Revised 12/22; Published 4/24

Optimization-based Causal Estimation from Heterogeneous Environments

Mingzhang Yin mingzhang.yin@warrington.ufl.edu Warrington College of Business University of Florida Gainesville, FL, 32611, USA

Yixin Wang yixinw@umich.edu Department of Statistics University of Michigan Ann Arbor, MI, 48109, USA

David M. Blei david.blei@columbia.edu Department of Computer Science and Department of Statistics Columbia University New York, NY, 10027, USA

Editor: Pradeep Ravikumar

This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association with the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes Co Co, an optimization algorithm that bridges the gap between pure prediction and causal inference. Co Co leverages the recently-proposed idea of environments (Peters et al., 2016; Arjovsky et al., 2019), datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments and ones that exhibit suﬃcient heterogeneity Co Co maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its eﬀectiveness on simulated and real datasets. Compared to classical ML and existing methods, Co Co provides more accurate estimates of the causal model and more accurate predictions under interventions. Keywords: Causal estimation, Robust prediction, Constrained optimization, Directional derivative, Interventional data

1. Introduction

We consider the problem posed in Peters et al. (2016). An outcome y is generated according to a linear structural equation model (SEM) based on a set of covariates x (Pearl, 2009),

y (β ) x + ε, x ε. (1)

c 2024 Mingzhang Yin, Yixin Wang, David M. Blei.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/21-1028.html.

Yin, Wang, Blei

Here, x are the causal parents of the outcome y, which are independent of the unobserved noise ε, and the causal coeﬃcient β represents the direct causal eﬀects of x . In practice, we might measure a potentially large set of covariates x = {x1, , xp}, such that the causes x are a subset of x. More precisely, there is a subset S {1, 2, , p} with x = x S, but we do not know this set of causes S. Our goal is to infer the causal relationship between each covariate xj and the outcome in the data generating process (DGP), and estimate their direct causal eﬀects. The challenge to solving this problem is spurious association. Though the causes x S are assumed to be exogenous, the other observed covariates might depend on the noise term with x\S ε. The dependency may arise because of confounding, where there is an omitted common cause of y and x\S, i.e., y ϵ x\S; the dependency may also arise from a collider, reverse causality, or selection bias, e.g., y x\S or y x1 x2, x1, x2 x\S. It leads to spurious association because the covariates x\S will not be the genuine causal parents of the outcome in the DGP, even though they correlate with the outcome. In the face of spurious association, typical inference methods, such as regression based on empirical risk minimization (ERM) or maximum likelihood estimation (MLE), will lead to a biased estimate of the causal eﬀect and an incorrect interpretation of the causal relationships (Angrist et al., 1996; Peters et al., 2016; Efron, 2020). From the predictive perspective, a model that captures spurious association will not generalize to non-i.i.d. data under perturbations of the DGP (Arjovsky et al., 2019; Rothenh ausler et al., 2021). We will develop constrained causal optimization (Co Co), an optimization-based method to solve the spurious association problem. The key idea behind Co Co is to leverage datasets from multiple environments (Peters et al., 2016). The environments are a set of heterogeneous DGPs. In each, the causal mechanism of the outcome generation remains invariant but the distribution of the covariates changes from environment to environment. While classical ML methods cannot distinguish the direct causes from spurious association, simultaneously analyzing data from multiple environments allows us to triangulate on the correct causal coeﬃcients. This work builds on recent research about multi-environment estimation, beginning with the seminal work of Peters et al. (2016) and continuing with the risk-based algorithms of Arjovsky et al. (2019). The method developed in this paper builds and improves on the risk-based algorithms. We will describe the method here and derive it in the subsequent sections. To begin, consider a single data-generating distribution p(x, y) = p(x)p(y | x). For instance, the conditional distribution of the outcome could come from Eq. (1). Consider a predictor ˆy(x; α) with coeﬃcients α and deﬁne a non-negative loss function to measure the ﬁdelity of a prediction, ℓ: Y Y R (e.g., squared loss). Finally, deﬁne the risk of the predictor to be the expectation of the loss relative to the data-generating distribution,

R(α) = Ex,y p(x,y)[ℓ(ˆy(x; α), y)]. (2)

Classical machine learning (ML) provides methods that analyze data from p(x, y) to minimize the risk. The pure prediction methods leverage all information in data, causal and non-causal, to reduce the predictive error quantiﬁed by the risk function. The spurious associations improve predictions, but they bias the resulting estimates of the coeﬃcients away from the true causal coeﬃcients.

Optimization-based Causal Estimation

We consider distributions of data from a set of environments E. The data from environment e form a joint distribution pe(x, y) = pe(x)p(y | x). In this joint, the distribution of covariates pe(x) changes from environment to environment, but the conditional expectation of the outcome given its causal parents E(y | Pa(y)) is the same across environments, governed by the same SEM. Notice each environment is associated with its own risk Re(α), since the risk is an expectation with respect to the per-environment distribution. Given a set of datasets from multiple environments, the Co Co algorithm solves the following optimization,

αcoco = arg min α 1 |E|

Re(α) α 2 , (3)

where is the Hadamard product and each environment s risk Re(α) can be approximated by its empirical estimate. As we describe below, this objective stems from the idea of the directional derivative (Rudin et al., 1964; Marban, 1969), and its role in the ﬁrst-order conditions for the optimizer of each environment s risk function. Under invariance assumptions, the solution to Eq. (3) is the intersection across environments of all points that minimize each term Re(α) α 2. When the environments are suﬃciently heterogenous that is, when there is enough variety in diﬀerent pe(x) and outcome noise this optimization is solved by the causal coeﬃcients.

Eq. (3) is the basic optimization problem behind Co Co. The rest of this paper presents its theoretical foundations, the assumptions under which its solution is the true causal coeﬃcients, algorithms building on and solving Eq. (3) from multi-environment data, and studies about the performance of Co Co on several simulated and real datasets, both with linear predictors and with nonlinear predictors. When compared to classical ML and Invariant Risk Minimization (IRM) (Arjovsky et al., 2019)-related methods, Co Co improves the estimation accuracy of causal coeﬃcients and the predictive accuracy in new environments. Broadly, Co Co represents progress towards multi-environment optimization for causal estimation, and it helps explain the empirical the success of IRM when data is linear-Gaussian or linear-Bernoulli. Compared to IRM, Co Co requires fewer environments to identify causal coeﬃcients and enjoys a more stable training procedure. Practically, Co Co is compatible with general graph structures, can be used with ﬂexible ML tools such as deep neural networks, scales well to high-dimensional problems, and is easy to implement. The paper is organized as follows. 2 situates this paper in the broader landscape of the research literature on multi-environment analysis. 3 describes the methodology and theoretical basis for Co Co. 4 presents the connections between Co Co and IRM in mathematical detail. 5 studies the identiﬁcation properties in detail, particularly in the setting of a linear SEM, and 6 extends Co Co to nonlinear models. 7 presents a study of Co Co with synthetic, semi-synthetic and real-world data.

2. Related work

Formulating the goals of data analysis, and their relationship to optimization, has long been discussed in the research literature in statistics and machine learning. As early as Wright (1921), researchers recognized that the goals of prediction and estimation are not

Yin, Wang, Blei

always aligned in optimization. Shmueli et al. (2010) and Efron (2020) provide insightful discussions about the diﬀerences between prediction problems, association problems, and causal estimation problems. This paper builds on the body of research around causal estimation with multiple environments. Methodologies developed in this area rest on the invariance property of causality, also known as autonomy (Haavelmo, 1944), modularity (Sch olkopf et al., 2012) and stability (Dawid et al., 2010). In a pioneering work, Peters et al. (2016) uses statistical hypothesis testing to estimate causal structures by exploiting invariance across multiple environments. Subsequent research extends this idea, for example, to nonlinear models (Heinze-Deml et al., 2018) and sequential data (Pﬁster et al., 2019). See B uhlmann et al. (2020) for a comprehensive review. To improve its ﬂexibility and scalability, researchers have begun to use optimization over multi-environment data for causal estimation (Rothenh ausler et al., 2019, 2021). One inﬂuential paper along these lines is Arjovsky et al. (2019), which introduces invariant risk minimization (IRM) as a method that adapts modern predictive models to this task. The objective function of IRM includes an additional penalty term to empirical risk function, which encourages a predictor to be invariant across environments. The work presented here provides a contribution to optimization-based causal estimation. Since its introduction, IRM has been extended in several ways. It has been formulated as game theory problem (Ahuja et al., 2020), combined with meta-learning methods (Bae et al., 2021), applied to reinforcement learning (Zhang et al., 2020), and applied to causal inference (Shi et al., 2020; Lu et al., 2021). Its conditions and limitations have been studied (Rosenfeld et al., 2020; Kamath et al., 2021; Guo et al., 2021). Besides IRM, other objectives also aim to improve predictive accuracy by encouraging invariance of empirical risks across environments. These objectives are often built on equal noise variance assumption, which is a strong version of invariance (Peters and B uhlmann, 2014). Some objectives regularize the variance of the empirical risks (Xie et al., 2020; Krueger et al., 2020; Heinze-Deml and Meinshausen, 2021) and control the worst case risk across training environments (Sagawa et al., 2019). When the condition of equal noise variance is met, such regularizations can be combined with Co Co. The idea of invariance and environments has also been adapted to causal discovery (Tian and Pearl, 2001; Yu et al., 2019a,b; Brouillard et al., 2020; Mooij et al., 2020; M uller et al., 2020). Invariance enables the discovery of causal structures within Markov equivalence classes (Ghassami, 2020), which cannot be reconstructed from traditional single environment data (Spirtes et al., 2000). Nevertheless, existing methods might encounter problems of limited model ﬂexibility and high computational cost. For example, some methods rely on linear data generating process (Ghassami et al., 2018; Huang et al., 2019, 2020), require regression over multiple subsets of covariates (Ghassami et al., 2017), and/or involve multiple independence testings (Ghassami et al., 2017; Huang et al., 2020; Brouillard et al., 2020). Co Co provides an optimization-based method to assist causal discovery by identifying direct causes for observed variables. Lastly, Co Co is loosely related to the variable selection literature (Hastie et al., 2009). It can be viewed as selecting causal variables or learning a causal representation from the observed covariates, but the motivation and the targeted problems are diﬀerent. Variable selection methods, such as LASSO (Tibshirani, 1996) and best subset selection (Bertsimas

Optimization-based Causal Estimation

et al., 2016; Yin et al., 2020), focus on the situations where the number of data points is small compared to the number of the observed covariates. The model selection under the traditional variable selection setting can generally recover the true sparsity pattern asymptotically (Zhao and Yu, 2006). In contrast, the estimation bias problem due to the spurious association considered in this paper cannot be solved by increasing the number of data points even to inﬁnity. Unlike classic regularized regression with the regularizers often imposing shrinkage on the parameter space (Golub et al., 1999), the proposed method imposes regularizations in the derivative space with the restricted derivative directions.

3. Causal optimization from heterogeneous environments

We discuss Co Co, a method that estimates causal eﬀects via optimization. In 3.1, we set up the problem and assumptions. In 3.2, we introduce an idealized optimization objective, which produces causal coeﬃcients as the solution, but is intractable. In 3.3, we derive a relaxed objective function; it is tractable with observable data but contains extra solutions besides the causal coeﬃcients. In 3.4, we aggregate the relaxed objective over multiple environments to whittle down the set of optima to the causal coeﬃcients.

3.1 Setup and assumptions

Consider an observed multi-environment dataset. Denote E as a discrete set of environments. Each environment e E speciﬁes a DGP similar to Eq. (1),

ye β xe + ϵe, xe pe(xe 1, , xe p). (4)

We absorb the intercept term into xe and β and do not write it explicitly. Comparing to

Eq. (1), here βS = β , βj = 0 for j S, and β\S = 0 where S {1, 2, , p}. It means only xe S = x are the true causes for the set S. The SEM in Eq. (4) includes all the observed covariates to reﬂect the fact that we do not know the set of causes S a priori. We overload the notation to denote β as the causal coeﬃcient and call the set S as the causal structure. There could be potential endogeneity with x ϵ. For each environment, the observed data De = (Xe, Ye) consists of ne i.i.d. data points, where Ye Rne are the outcomes, and Xe = [Xe 1, , Xe p] Rne p are the covariates. Each column Xe j Rne, j {1, 2, , p}, contains the observations of the j-th covariate for ne units.

Assumptions for each environment. First, we specify the assumptions for the data in each environment. For notational simplicity, we suppress the superscript e for now and state the assumptions for any environment e E. The support set and magnitude of the causal coeﬃcients β are unknown. For the covariates x, we do not specify the DGP and allow an arbitrary functional form of the joint distribution p(x). We will specify the additional assumptions on the joint distribution for identiﬁcation in 5. We assume the noise ϵ is zero-mean, and the covariates and noise have ﬁnite variance. The noise term is assumed to be independent of the observed direct causes x S. To summarize, the assumptions for each environment are

Assumption 1 (i) Linear DGP as Eq. (4); (ii) Moment conditions: E[ϵ] = 0, Var[ϵ], Var[xj] < for all j {1, 2, , p}; (iii) Independence: the observed direct causes x S ϵ.

Yin, Wang, Blei

Remark. Assumption 1 (i) assumes linearity; we will study the nonlinear causal models in 6. Assumption 1 (ii) is a standard regularity assumption. Assumption 1 (iii) assumes x S ϵ, where the noise ϵ incorporates the unobserved causes of y as long as they are independent of the observed direct causes x S. The noise and the observed covariates x\S, however, can be dependent, i.e., x\S ϵ. Hence, Assumption 1 (iii) does not imply that the covariates x as a whole satisfy the back-door criterion (Pearl, 2009). Accordingly, standard inference strategies, such as regression adjustment for all covariates, may produce biased causal estimates (Elwert and Winship, 2014). Our goal is to develop a method that can automatically identify the set S as the causal structure, estimate the causal coeﬃcient β, and make prediction based on the causes of the outcome.

Assumption 1 (iii) distinguishes the setting of this paper from that of instrumental variables (IV) which allows the dependency between x S and ϵ (Angrist et al., 1996). However, we allow the index of endogenous covariates in x to be unknown; such information is usually required by the IV methods such as the two-stage least square. Assumption 1 (iii) is a common assumption in the invariance-based causal inference (Arjovsky et al., 2019, Theorem 9) (Rojas-Carulla et al., 2018; Pﬁster et al., 2019; Krueger et al., 2020), and causal discovery literature (Peters et al., 2016, Assumption 1)(Ghassami et al., 2017; Yu et al., 2019a; Brouillard et al., 2020).

Assumptions across environments. Now, we consider the assumption across environments. The key property of causality we exploit from multi-environment data is the invariance (Peters et al., 2016; Arjovsky et al., 2019). Invariance means that conditional on the same value of the direct causes, the expectation of the outcome is the same across environments.

Assumption 2 (Invariance) The index set of direct causes of ye is the same across environments. And given a possible value c of the direct causes,

E[ye|Pa(ye) = c] = E[ye |Pa(ye ) = c], (5)

for all e, e E.

The invariance condition in Eq. (5) is weaker than that in Peters et al. (2016), which requires strong invariance over the conditional outcome distribution

p(ye|Pa(ye) = c) = p(ye |Pa(ye ) = c). (6)

When the distributions of covariates xe and noise ϵe change across environments, we call environments E heterogeneous. The heterogeneity will be important to the method we derive.

3.2 Idealized causal optimization

We start by considering the data set of a single environment, such as one that is generated by Eq. (1). Suppose for the moment that we do not know the causal coeﬃcients β, but we do know which covariates are direct causes of the outcome, i.e., the set S. A key observation is that among the models that share the true causal structure (i.e. the support set of coeﬃcients as S), the causal model (i.e. the model with causal coeﬃcients) is the best predictive model. This observation, though conceptually straightforward, makes the goals of

Optimization-based Causal Estimation

causal estimation and optimization consistent. We can then obtain the causal coeﬃcients by solving a constrained optimization problem.

Lemma 1 (Causal Optimality) Under Assumptions 1, for the squared risk function R(α) = E[(1/2)(ˆy(x; α) y)2], and linear predictor ˆy(x; α) = α x, the following constrained optimization problem

s.t. αj = 0, j / S (7)

has the causal coeﬃcients α = β as the unique solution.

The proof is in Appendix A.

Lemma 1 is the ideal situation where the set of causes S in Eq. (7) is known. Of course, in practice, we do not know which covariates are causal and which are not. This paper aims to build on this idealized optimization problem in Eq. (7) to construct a tractable objective for causal estimation. We ﬁrst introduce a tractable objective with observed data. Then we aggregate this objective over multiple environments to isolate the causal coeﬃcients.

3.3 Derivation of a tractable objective

In this section, we derive an optimization objective for causal estimation. It only involves the observable data and is a relaxation of the idealized optimization in Eq. (7). In this relaxation, the causal coeﬃcient β is one of the optima, though it is not the only one. We will ﬁrst review directional derivatives and feasible directions, which are used to characterize the extreme points of the idealized optimization in Eq. (7). We then relax the optimization problem to a practical one and characterize its extreme points.

Directional derivatives and feasible directions. The directional derivative Dv in the direction of a unit-length vector v is deﬁned as the change rate of a function in that direction (Rudin et al., 1964). It can be computed as the inner product of the gradient and the direction vector Dv R(α) := limt 0(R(α + tv) R(α))/t = R(α), v , where , denotes an inner product. Directional derivatives provide the ﬁrst-order condition for the optima of a constrained optimization. Denote the constraints in Eq. (7) as gj(α) = αj = 0 for j / S. (Recall S is the support set of β, the indices of the non-zero causal coeﬃcients.) The points that satisfy these constraints form a surface in Rp. To ﬁnd a stationary point while obeying to the constraints, an algorithm should search the parameters in the directions tangent to this surface rather than an arbitrary direction. Hence, the tangent directions are called the feasible direction. The ﬁrst-order condition for optimality is that the directional derivative in all feasible directions vanishes (Marban, 1969). Intuitively, satisfying the condition means the algorithm cannot improve the objective value, at least locally, without violating the constraints.

Optimality condition for causal optimization. We derive the feasible directions for the optimization problem in Eq. (7). We then derive an important property of the causal coeﬃcients from the ﬁrst-order condition of the directional derivative.

Yin, Wang, Blei

Geometrically, the feasible directions are perpendicular to the normal vectors of the surface deﬁned by the constraints. The normal vectors point in the directions that violate the constraints at the maximum rate, which can be computed by taking the gradient of the constraints

dgj(α)/dα = ej, j / S. (8)

The feasible directions are perpendicular to the vector space spanned by the basis vectors in Eq. (8). Therefore, all the feasible directions form a linear space U = span{ej : j S}. The ﬁrst-order condition for the causal optimization in Eq. (7) requires the directional derivative vanishing in all the feasible directions, that is, Dv = 0 for each v U. This condition can be equivalently stated using the basis vectors of U as

Dej R(α) = R(α), ej = 0, for j S. (9)

This is because any v U is a linear combination of the basis {ej}j S. More compactly, since βj = 0 for j S, Eq. (9) can be written equivalently as

R(α) β 2 = 0 (10)

where is the Hadamard product. Traditional empirical risk minimization (ERM) searches for α with R(α) = 0, or equivalently R(α) 1 2 = 0. A problem of the gradient-based ERM is its greedy nature: the optimization follows a derivative pointing at the steepest descent direction. Such greedy optimization may leverage the spurious associations and fails to converge to the causal coeﬃcient. In comparison, Eq. (10) reveals that the key to establish the optimality for the causal model is to regularize the derivative s direction by restricting it to the subspace U. The regularization of the derivative direction distinguishes Co Co from the classic regression regularizations that often control the parameter magnitudes. The ﬁrst order condition in Eq. (10) is an explicit condition that the causal coeﬃcients β satisfy. That s to say, solving minα R(α) β 2 would return the causal coeﬃcients for convex risk functions. However, there is no free lunch. Similar to the idealized causal optimization in Eq. (7) with an unknown set S, the condition in Eq. (10) includes β, the unknown causal coeﬃcients. Hence Eq. (10) is not a tractable objective yet.

Relaxing the ﬁrst-order condition. Lemma 1 states that α = β is an optimum of the problem in Eq. (7). Therefore, the vector β must satisfy the ﬁrst-order condition of Eq. (10). Plugging α = β into Eq. (10), we have R(β) β 2 = 0. This fact reveals that the causal coeﬃcient β is an optimum of the following optimization problem,

min α R(α) α 2 . (11)

Notice that Eq. (11) is an optimization problem based entirely on the observational data. It states that the causal coeﬃcient β satisﬁes

β arg min α R(α) α 2 . (12)

Optimization-based Causal Estimation

Eq. (11) is an important step towards an optimization-based causal estimation. The derivation based on the directional derivative theory also reveals an intrinsic connection and distinction between the Co Co objective and the invariance risk minimization (IRM) objective (Arjovsky et al., 2019), which we will discuss in 4. We call the set of points that minimizes Eq. (11) the plausible set F. We have shown that causal coeﬃcients belong to it. What other points are in F? For each j {1, 2, . . . , p} the objective of Eq. (11) is minimized when either αj = 0 or R(α)j = 0. Accordingly, two additional points are the all-zero vector 0 and the ERM solution for which R(α) = 0. Finally, the plausible set contains the points in-between these solutions, those that set the parameters to zero at a subset of indices, and set the elements of the gradient vector to zero at the remaining indices. For convex risk functions, there are at most 2p such solutions, one for each subset; note that the causal coeﬃcient β is one of them. In summary, we began with a constrained optimization in Eq. (7). Its single optimum is the causal coeﬃcient, but it requires the knowledge of which covariates are causal. We relaxed that optimization to Eq. (11), which only relies on the observable data. The causal coeﬃcient remains to be an optimum of Eq. (11), but there are other solutions too, those between (and including) the 0-vector and the ERM solution. Hence, solving Eq. (11) alone does not identify the causal coeﬃcient yet. In the next section, we will restore identiﬁability by appealing to the invariance property of causality under interventions (Peters et al., 2016; Arjovsky et al., 2019; Sch olkopf et al., 2021). With data from multiple environments each one coming from a diﬀerent intervention on the DGP we can deﬁne an optimization problem that whittles down the plausible set to one that only contains the causal coeﬃcient.

3.4 Optimization with multiple environments

In this section, we leverage multi-environment data to restore the uniqueness of the causal coeﬃcient as the solution of an optimization problem.

Narrowing down the optima set by environments. As discussed in 3.3, the causal coeﬃcient is nonidentiﬁable by optimizing Eq. (11) with i.i.d. data. We turn to the setting where we observe data from multiple environments to achieve identiﬁability. We assume the invariance property as in Eq. (5) across environments and that the environments are heterogeneous. Heterogeneous environments can be constructed by (hard) interventions which actively ﬁx a variable at a speciﬁc value during the data generation. They can also be constructed by (soft) interventions where the changes in the DGP are passively observed rather than actively executed (Eberhardt and Scheines, 2007). For example, the heterogeneity might come from varied physical factors such as space and time. When studying the eﬀect of health measurements on the chance of cancer, the environments can be diﬀerent hospitals from which the data are collected (Winkler et al., 2019). Consider the relaxed causal optimization problem in Eq. (11). Due to the invariance of the conditional E(Y | Pa(Y )), for each environment, its feasible set include the same causal coeﬃcient β. But the other optima that utilize the spurious associations will be diﬀerent across environments because the joint distribution of the covariates pe(x) is not invariant.

Yin, Wang, Blei

(a) IRM optima (Env.#1)

(b) IRM optima (Env.#2)

(c) IRM optima (Two envs.)

(d) Co Co optima (Two envs.)

(e) IRM & Co Co optima (Two envs.)

Figure 1: Geometry of the analytic optima sets for the IRM regularization and Co Co objectives in the 3D space. The causal coeﬃcient is β = (3, 2, 0). (a),(b): the optima of the IRM regularization for each environment form a 3D quadric surface; (c): the optima of the IRM regularization with two environments is the intersection of the two surfaces, which forms two elliptic curves; (d): The optima of Co Co objective is a discrete ﬁnite set for each environment. The Co Co optima over two environments is the intersection consisting of the zero point and the causal coeﬃcient (the overlap of the black triangle and blue star points); (e): The top view of the IRM regularization and Co Co optima for the two environments. The optima set by Co Co (α {(0, 0, 0), (3, 2, 0)}) is a strict subset of that by the IRM regularization (the dashed orange elliptic curves). Better viewed in color.

Optimization-based Causal Estimation

Algorithm 1 Co Co with known exogenous variables

input : Data De = {Ye, Xe}, Xe Rne p; the risk function Re for each environment e E; the set of known non-descendant variables C; the predictor f( ). output : Coeﬃcient estimation α with causal interpretation. Initialize α randomly while not converged do

for e in E do

Compute the gradient of the empirical risk:

i=1 Re(α; ye i , ˆye i ), ˆye i = f(xe i; α)

Set α = α (1 1C) + 1C Compute the optimization objective:

Le(α) = ge(α) α 2

end Update α α η

e E Le(α) with step size η end

The invariance property motivates us to aggregate the optimization problems across environments,

min α f E(α) := 1

Re(α) α 2 . (13)

Eq. (13) is the Co Co objective. Denote the optima of each single environment objective as Fe := arg minα Re(α) α 2. The optima of the Co Co objective Eq. (13) consist of the intersection of all Fes,

FE := arg min α f E(α) = \

as long as the intersection is not empty. Assumption 1,2 guarantee the nonemptiness because the causal coeﬃcient β Fe for all e. Because FE is expressed as an intersection, its size shrinks with an increasing number of environments, i.e., |FE1| |FE2| if E2 E1. The multiple environments and heterogeneity therein induce diﬀerences between the optima sets of each environment and, as a result, narrow down the optima set FE of the Co Co objective. For instance, the sets Fe and FE

are visualized with an example DGP in Fig. 1. Fig. 1 (d) illustrates how two environments and the intersection of single-environment optima sets help with the identiﬁcation of the Co Co objective.

Removing the non-informative solution from the optima set. While the environments help remove the points except for the causal coeﬃcient from the optima of Co Co objective, the zero vector remains a solution. We propose two modiﬁcations of the objective to avoid the zero vector being an optimum.

Yin, Wang, Blei

We ﬁrst consider the scenario when certain covariate Xj is known as an exogenous variable independent of the random noise of the outcome, i.e., Xj ϵ. Under Assumption 1, any ancestor variable of the outcome is a proper exogenous variable. With such information, we can modify the Co Co objective Eq. (13) to be

min α 1 |E|

e E Re(α) α] 2 , (14)

where αj = 1 and αj = αj for j = j . In other words, the elements in α corresponding to the known exogenous variables are ﬁxed as one. The following lemma demonstrates the properties of Eq. (14) s optima set.

Lemma 2 The optima set of Eq. (14) contains the causal coeﬃcient and is a subset of Eq. (13) s optima set. If βj = 0, The vector 0 is not an optimum of Eq. (14) almost surely.

The proof of Lemma 2 is in Appendix A. When there is more than one known exogenous variable, we generalize α as α = α (1 1C) + 1C where C is the set of known exogenous variables. It further reduces the number of non-causal optima for Eq. (14). The algorithm is summarized in Alg. 1. We will discuss what conditions guarantee its output to be the causal coeﬃcients in 5 when the environments are suﬃciently heterogeneous. We ﬁnd in theory ( 5) and simulation ( 7) that the optima set of Eq. (14) can shrink to the causal coeﬃcient. In the scenarios when no variables are known as exogenous, an alternative method is to use the risk function as an additive regularization of Eq. (13),

min α 1 |E|

Re(α) α 2 + λr Re(α) , (15)

where λr 0 controls the regularization strength. Note that when there is only a single environment |E| = 1, the ERM solution minimizes both terms in Eq. (15), so the minimizers of Eq. (15) are identical to the minimizers of the ERM. The mechanism behind the regularizer in Eq. (15) is that, though the causal coeﬃcient and the 0 vector both minimize Eq. (13), the risk of the 0 vector is higher than that of the causal coeﬃcient. So the risk function introduces inductive bias to disfavor the 0 vector. To see this, for the DGP in Eq. (4) and the squared risk, under Assumption 1, the risk Re(0) = β E[xx ]β + E[ϵ2] + 2E[x βϵ], and E[x βϵ] = E[x S βSϵ] = 0 since β\S = 0 and x S ϵ; and we have Re(β) = E[ϵ2]. Thus, Re(0) Re(β), which means the risk regularizer in Eq. (15) would encourage the convergence to the causal coeﬃcient β rather than the 0 vector. Optimizing Eq. (15) with the risk regularizer yields a predictive model that is robust to perturbations of spurious associations. However, the risk regularizer may introduce potential estimation bias when the focus is causal estimation. In practice, this bias can be reduced by annealing the strength parameter λr during the optimization. We summarize the complete algorithm in Alg. 2.

4. Connections to invariant risk minimization

Arjovsky et al. (2019) introduces IRM that can learn robust representation in the presence of spurious associations between the covariates and outcome. In particular, IRM considers a

Optimization-based Causal Estimation

predictor f(x; α) : Rp 7 R with parameter α. In a setting similar to Co Co , it considers a set of heterogeneous environments E and a risk function Re(α; ˆy) for each e E (for notational clarity, we explicitly write the prediction ˆy in the notation of the risk function). Based on the intuition that invariant predictor induces invariant features, IRM proposes the following objective to ﬁnd an invariant model

e E Re(α; w(f(xe i; α))) (16)

s.t. w arg min w Re(α; w(f(xe i; α))), for all e E,

where w( ) is a mapping from the range of f( ) to ˆy. For tractable computation, Arjovsky et al. (2019) further introduces the IRMv1 objective:

h Re(α; f(xe i; α)) | {z } Empirical risk

+λ w|w=1.0Re(α; w f(xe i; α)) 2 | {z } IRM regularization

where λ > 0 and w is simpliﬁed as a dummy scalar variable. The IRMv1 objective in Eq. (17) consists of an empirical risk term and an IRM regularization term that encourages invariance. Since the empirical risk does not produce the correct causal estimate, we focus on analyzing the IRM regularization term.

Analytic and geometric connections. The IRM regularization and Co Co are connected by the constrained optimization problem in Eq. (7). Speciﬁcally, the IRM regularization can be derived from optimization in Eq. (7) for several common models. In 3.3, we use the directional derivative to obtain the set of feasible directions U = span{ej : j S} of the idealized causal optimization and the ﬁrst-order optimality condition in Eq. (10). An interesting observation is that the causal parameter itself is a feasible direction with β U. Since β is a feasible direction and is also an optimum of the optimization in Eq. (7), Lemma 1 implies that the directional derivative with direction β vanishes at the causal coeﬃcient β, which leads to the following lemma.

Lemma 3 For any partition P of the set {1, 2, , p}, the optimality of the causal coeﬃcient β is an optimum of

A P ( R(α)A, αA )2. (18)

Speciﬁcally, β is an optimum of

min α ( R(α), α )2. (19)

Lemma 3 gives the general form of optimization that admits causal optimality. The proof of Lemma 3 is in Appendix A. When the outcome model is Linear-Gaussian or Linear-Bernoulli, minimizing the IRM regularization is equivalent to solving Eq. (19). To see this, suppose a linear DGP as in Eq. (4), linear predictor ˆye = α xe, and a squared risk function, then w|w=1.0Re(α; wα xe) 2 = (E[(ye ˆye)α xe])2

= ( Re(α; ˆy), α )2. (20)

Yin, Wang, Blei

ERM IRM regularization Co Co

minα αRe(α) 1 2 minα( αRe(α) α)2 minα αRe(α) α 2

Table 1: Comparison of the ERM (by gradient methods), IRM regularization (for the linear Gaussian and Bernoulli models), and Co Co objectives for one environment.

The left side of Eq. (20) is the invariant regularization term and the right side is Eq. (19). Similarly, suppose the outcome is generated by ye Bernoulli(σ(β xe)), the predictor is ˆye = σ(α xe) where σ(x) = 1/(1+exp( x)) is the sigmoid function, and the risk function is the cross entropy loss Re(α; ˆye) = E[ye log(ˆye) + (1 ye) log(1 ˆye)]. Then1

w|w=1.0Re(α; σ(wα xe)) 2 = (E[(ˆye ye)α xe])2

= ( Re(α; ˆy), α )2. (21)

The comparison is summarized in Table 1. The connections in Eqs. (19) to (21) reveal that IRM regularization implicitly imposes a ﬁrst order condition based on the directional derivative with the direction α, which oﬀers an explanation to the mechanism behind IRMv1. Lemma 3 and Eqs. (19) to (21) can prove that, for linear Gaussian or Bernoulli models, the causal coeﬃcient β is a minimizer of the IRM regularization for one environment. The data from multiple environments narrow down the optima set to the causal coeﬃcient under the invariance assumption. The additional empirical risk term in the IRMv1 objective plays the role of further regularization by encouraging the solutions with high predictive performance. The connection also illustrates the sub-optimality of the IRM regularization. Geometrically, the IRM regularization, rewritten as the inner product between the gradient and parameter vectors, only considers a single feasible direction β for the constrained optimization problem Eq. (7). In contrast, Co Co ﬁnds the optimum across all the feasible directions that form a (p |S|) dimensional space U. The spectrum in Eq. (18) shows that the objective of Co Co corresponds to the ﬁnest partition, imposing the strongest constraint, and has the smallest optima set. Algebraically, we have p R(α) α 2 2 ( R(α), α )2 by the Cauchy Schwarz inequality. The inequality indicates that IRMv1 potentially minimizes a loose lower bound. The following proposition shows that the Co Co objective is guaranteed to have a better identiﬁcation result than the IRM regularization.

Proposition 4 For linear-Gaussian or linear-Bernoulli outcome generating distribution, if minimizing the IRM regularization in Eq. (19) identiﬁes the causal coeﬃcient (excluding point 0) over environments E, then minimizing the Co Co objective Eq. (11) will identify the causal coeﬃcient (excluding point 0). The inverse statement does not hold.

Because of an excessive number of optima in the IRM regularization for a single environment, it has a high requirement on the number of environments and the type of

1. The form Re(α; σ(wα xe)) follows the IRM implementation of binary classiﬁcation loss = mean nll(logits * scale, y) at https://github.com/facebookresearch/Invariant Risk Minimization/blob/main/code/ colored_mnist/main.py.

Optimization-based Causal Estimation

heterogeneity to sieve out the causal coeﬃcient, or relies on the causal model to have the lowest empirical risk among all the optima. These requirements are especially challenging for high-dimensional problems. In a nutshell, connecting IRM regularization with causal optimization in Eq. (7) indicates the potential looseness of IRMv1 objective in the linear Gaussian and linear-Bernoulli DGPs. Another potential issue with the IRM regularization is that it might not optimize in a feasible direction for other types of DGPs, for example, with a categorical outcome.

A case study. We present a brief case study to illustrate the connections. We consider a linear-Gaussian DGP as the Case 1 of Table 3. The two environments correspond to parameters (m(1) 1 , m(1) 2 , γ(1)) = (2, 0.5, 2), (m(2) 1 , m(2) 2 , γ(2)) = (3, 1, 0.5) in the DGP. The invariant causal coeﬃcient is β = (3, 2, 0). Fig. 1 visualizes the optima set of Co Co and the IRM regularization where all the optima are computed analytically.

Fig. 1 (a) (b) show the set of optima for IRM regularization in each of the two environments; each set forms a quadric surface. When considering the two environments together, the IRMv1 optima are the intersection of two 3D surfaces in Fig. 1 (c). Geometrically, as shown in Fig. 1 (e), the intersection forms two continuous elliptic curves consisting of an inﬁnite number of points. The IRMv1 objective cannot identify the causal coeﬃcient unless β has the lowest empirical risk among all the optima of the IRM regularization, which, as we will see, is not the case here. In contrast, Fig. 1 (d) shows the optima set of Co Co that contains only a ﬁnite number of points per environment. With two environments, the optima of Co Co becomes the zero point and the causal coeﬃcient β. The causal coeﬃcient can be identiﬁed by the modiﬁed objectives in Eqs. (14) and (15) that remove the zero point. We examine this case study empirically by implementing Co Co in Eq. (14) and IRMv1 in Eq. (17) with a large sample size N = 105 of each environment. Co Co objective with j = 1 converges to ˆαCo Co = (3.001, 2.004, 0.000) with random initialization. IRMv1 with λ = 1 converges to ˆαIRM 1 = ( 0.842, 0.390, 0.621) with initialization by a random seed and converges to ˆαIRM 2 = (2.932, 1.847, 0.017) with an idealized initialization manually set at the true causal coeﬃcient β = (3, 2, 0). Furthermore, ˆαIRM 2 has the empirical risk as 0.97634 and the IRM regularization as 0.00028, while those of β are 0.99852 and 0.00036, respectively. Thus, among the optima of the IRM regularization for the two environments, the causal coeﬃcient β does not have the lowest risk, and the minimizer of the IRMv1 objective is a biased causal estimate. We provide detailed computation and additional results with diﬀerent tuning parameters of IRMv1 for this case study in the Appendix C.1.

5. Identiﬁcation with heterogeneous environments

Now we establish the causal identiﬁcation for Co Co with the linear SEM in Eq. (4). Identiﬁcation requires the causal quantity of interest to be expressed as a functional of the observed data distribution. This functional is also known as the causal identiﬁcation strategy. In the context of Co Co, we consider the functional that maps the joint distributions p(xe, ye) over a set of environments to the risk function, then to the optima of Co Co objective. To establish identiﬁcation, we prove that the optimum of the Co Co objective is unique and equals the causal coeﬃcient of interest. In 5.1 and 5.2, we present two assumptions on the environment heterogeneity that guarantee the identiﬁcation. In 5.3, we provide two suﬃcient conditions implying the identiﬁcation assumption.

Yin, Wang, Blei

Algorithm 2 Co Co with risk regularization

input : Data De = {Ye, Xe}, Xe Rne p, the risk function Re for each environment e E; predictor fα( ); regularizer coeﬃcients λr, λw; annealing scheme ANNEAL( ) output : Predictor fα( ) that is robust to interventions Initialize α randomly while not converged do

for e in E do

Compute the gradient of the empirical risk:

i=1 Re(α; ˆye i ), ˆye i = f(xe i; α)

Compute: Le(α) = ge(α) α 2 (Optional step:) add weak condition to the objective:

Le(α) += λw( ge(α), α )2

Add risk function as a regularization term:

Le(α) += λr 1 ne (

i=1 Re(α; ˆye i ))

end Update α α η

α P e E Le(α) with step size η λr ANNEAL(λr) end

5.1 Identiﬁcation of the causal coeﬃcients

The causal identiﬁcation of an optimization-based approach is to demonstrate the existence and uniqueness of its optimum, and it equals the causal coeﬃcient. By fully characterizing the optima set of the modiﬁed Co Co objective in Eq. (14), we explore the assumption on the environments that bestows identiﬁcation. The optimum of the Co Co objective has a unique characteristic. It is a point α for which Re(α)H = 0 and α\H = 0 with certain H {1, 2, , p}. Denote the risk function with a subset of covariates xe H as the predictor by Re H(αH) = 1

2E[(y α Hxe H)2]. We have the following proposition.

Lemma 5 For the linear DGP as Eq. (4), a linear predictor and squared risk, if Re(α)H = 0 and α\H = 0, then Re H(αH) = 0; if Re H(αH) = 0, then α = (αH, α\H = 0) satisﬁes Re(α)H = 0.

Lemma 5 demonstrates that a Co Co optimum is an ERM minimizer over a subset of covariates. When restricted to this subset of covariates, the ERM minimizer is shared by all the environments. Lemma 5 oﬀers an interpretation of Co Co optimum. To keep the notation consistent, denote C as the set of exogenous covariates and S as the unknown set of direct causes for the outcome. For any set H with C H {1, 2, , p}, we ﬁt a regression model on Xe H in

Optimization-based Causal Estimation

each environment, and collect the regression coeﬃcients as {ˆαe H}e E. We call the set H an invariant set, if the estimates

ˆαe H = ˆαe H := ˆαH, e, e E. (22)

If H is an invariant set, we deﬁne a length p vector as an invariant vector by equating it to ˆαH when restricting to the set H and padding it with zeros at other elements. By Lemma 5, the optima of the objective Eq. (14) consist of all the invariant vectors. Based on the interpretation of Co Co optimum, we introduce the following assumption for suﬃciently heterogeneous environments.

Assumption 3 (Eﬀectiveness) There is only one invariant vector (deﬁned in Eq. (22)) for all the sets H with C H {1, 2, , p}. C is the set of known exogenous variables.

Then we have the following identiﬁcation result.

Theorem 6 Under Assumptions 1, 2 and 3, assuming the Gram matrix We = E[xe(xe) ] 0 for all e E, the causal coeﬃcient β is identiﬁable, and is given by

β = arg min α 1 |E|

e E Re(α) α] 2 , (23)

which is the optimum of the objective in Eq. (14).

The invariance Assumption 2 ensures the existence of a point that reaches the minimum value of the objective Eq. (14), and this point is the causal coeﬃcient, while the eﬀectiveness Assumption 3 induces uniqueness of the solution. The following corollary provides practical guidance in collecting multi-environment data.

Corollary 7 Assume the environment sets E1 E2, then in Theorem 6, (i) if the invariance Assumption 2 holds for E2, it holds for E1; (ii) if the eﬀectiveness Assumption 3 holds for E1, it holds for E2.

5.2 Identifying the eﬀects of the exogenous variables

The eﬀectiveness assumption can be checked with the observed data, but it needs regression over the power sets of covariates across the environments, which can be computationally expensive. In practice, we might be interested in estimating the causal eﬀect of speciﬁc variables instead of the whole causal coeﬃcient. This motivates us to explore identiﬁcation results under a relaxed Assumption 3. Suppose the goal is to estimate the causal eﬀect of a known exogenous variable xj . For example, such a variable could be the standard treatment variable under the unconfoundedness assumption (Imbens and Rubin, 2015). Though xj is assumed exogenous, i.e.,xj ϵ, regressing over xj alone may produce biased estimation because xj may correlate with other causes x S\j in the DGP. We adopt the following notations. Recall that S is the set of direct causes, C is the known exogenous variables, j C, and H is a set C H {1, 2, , p}. Denote WAB to be a sub-matrix of the Gram matrix W = E[xx ] with rows in the index set A and columns

Yin, Wang, Blei

in the index set B. For the environments E = {e1, , em}, denote WE H R(m |H|) |H| as a stacking matrix that stacks W e HH by row, i.e. WE H := [W e1 HH|W e2 HH| |W em HH] . Assumption 3 excludes a non-causal invariant vector when regressing over a set of covariates xe H. Here, we analyze the property of an invariant vector. By Proposition 5, the invariant vector can be computed by zeroing out the gradient

αHRH(αH) = W e HH(αH βH) W e HHcβHc se H, (24)

where Hc is the complement of H in {1, 2, , p} and se j := E[xe jϵe] = cov(xe jϵe). Denote a stacking vector θE H Rm |H| as θE H := [W e1 HHcβHc + se1 H, , W em HHcβHc + sem H ] . By setting the gradient in Eq. (24) to zero and by Proposition 5, if

WE Hδ = θE H (25)

holds for δ = 0, H is an invariant set with invariant vector α, ˆαH = βH + δ, ˆα\H = 0. This violates Assumption 3 because ˆα and β are two diﬀerent invariant vectors that both minimize the Co Co objective. We formulate the negating statement as the following assumption.

Assumption 4 (Weak eﬀectiveness) H = S, C H {1, 2, , p}, if the stacking vector θE H = 0, it is not in the column space of the stacking matrix WE H, i.e., θE H / C(WE H).

In sum, we have shown ( Assumption 4) ( Assumption 3), thus the contrapositive implies (Assumption 3) (Assumption 4). On the other hand, cases exist when Assumption 4 hold but Assumption 3 doesn t, so Assumption 4 is a weaker assumption. For example, if two subsets of causes Si S, i = 1, 2, C Si, and xe Si xe \Si, then both S1 and S2 are invariant sets when E[xe \Si] = 0. Hence it violates Assumption 3 due to the existence of multiple invariant sets. But in such a case, Assumption 4 can still hold because setting H = Si in Eq. (25) makes θE H = 0 and δ = 0. The Co Co optimum might be (βS1, 0) and (βS2, 0), which share the common unbiased part βC. With the weak eﬀectiveness assumption, we have the following identiﬁcation result.

Theorem 8 Under Assumptions 1, 2 and 4, the causal eﬀect of x C on y is identiﬁable and is given by the optimum of objective Eq. (14). That is, for

α arg min α 1 |E|

e E Re(α) α] 2 , (26)

α C = βC, where β is the causal coeﬃcient.

Theorem 6 guarantees the identiﬁcation of the whole causal coeﬃcient vector β. In comparison, Theorem 8 identiﬁes the eﬀects of the known exogenous variables βC based on a weaker eﬀectiveness assumption. In the next section, we leverage the explicit analytic form of Eq. (25) and discuss several scenarios where weak eﬀectiveness can be guaranteed.

5.3 Suﬃcient conditions for identiﬁcation

We show two approaches that ensure suﬃcient heterogeneity of the environments. One is by actively taking do-interventions, and the other is by passively checking the rank conditions.

Optimization-based Causal Estimation

Eﬀectiveness by do-interventions. Consider the do-intervention (Pearl, 2009) as

Xe j ae j, ae j p(a), j Ie,

where e is environment index, e E, Ie is the index set of intervened variables, and p(a) is a continuous distribution deﬁned on R. do(Xe j = ae j) means all the samples of Xj in the environment e are ﬁxed at the constant ae j during the data generation. Suppose the intervention on one variable is independent of other interventions and variables. The following corollary gives a suﬃcient condition that guarantees Assumption 4.

Corollary 9 For the linear SEM in Eq. (4) and predictor in Lemma 1, and for dointerventions, suppose j {1, 2, , p}, e, e E s.t. Ie = Ie = {j}, and k C, E[Xk] = 0, then the optimum α of the optimization in Eq. (14) satisﬁes α C = βC, where β are the causal coeﬃcients.

Corollary 9 is closely related to the identiﬁcation results of ICP in Peters et al. (2016, Theorem 2). Comparing to Corollary 9, ICP asks for one less do-intervention on each covariate, but it crucially relies on the invariance assumption in Eq. (6) that is stronger than Assumption 2 of Co Co.

Eﬀectiveness by rank checking. In many cases, we collect data from environments where the heterogeneity is introduced not by the manual interventions but by the natural factors such as spatial and temporal diﬀerences. We propose a checking method to ensure a suﬃcient but necessary condition for the Assumption 4, which is easy to compute. As 5.2, denote C P = {1, 2, , p} as the known exogenous variables. For each environment e E, the Gram matrix using the observed data is We = E[(Xe)T Xe] Rp p

and the submatrix We CP is the rows of We with index in C. Let WE CP R(|E| |C|) p be a matrix that stacks We CP by the rows for all e E. Assumption 4 assumes the linear system WE Hδ = θE H for any δ = 0. A key observation is that for j C, E[Xjϵ] = 0 due to exogeneity, so se C = 0 for the gradient computed in Eq. (24). Therefore, Assumption 4 is guaranteed if δ = 0, WE CHδ = WE CHcβHc, which can be further guaranteed if the homogeneous linear system WE CPv = 0 only has trivial solution for the variable v. This linear system only depends on observed data. Then we have the following corollary.

Corollary 10 Assumption 4 is guaranteed if the linear system WE CPv = 0 only has the trivial solution v = 0.

In practice, we ﬁrst collect data from environments E where the heterogeneity may come from do-interventions or soft interventions that change the distribution of the covariates (Eberhardt and Scheines, 2007). Then we check Corollary 10 by equivalently checking if the matrix WE CP has full column rank. If the rank is full, we proceed to Algorithm 1 with environment set E; otherwise, we collect data from a new environment. The algorithm is summarized in Alg. 3.

6. Extensions to the nonlinear model

So far, we focus on the linear SEMs. Here we generalize these results to a nonlinear SEM and a predictor that maps linear combinations of covariates to the outcome, i.e. y = f(Ax). For

Yin, Wang, Blei

Algorithm 3 Co Co with heterogeneity checking input : The environmental set E, the set of known non-descendant variable C output : Coeﬃcients α with causal interpretation Initial E = repeat

Randomly choose a valid intervention Let E E {e} with e as the index of new environment Collect data in environment e, De = {ye i , xe i}ne i=1 Update WE CP until rank(WE CP) = p; Run Algorithm 1 with {E, C, {De}e E, {Re( )}e E} where Re( ) is the risk function.

example, the fully connected neural network is a special case in such a functional form. The generalization of the algorithm to the nonlinear model closely follows the derivation in 3. The ﬁrst step is to build a constrained optimization problem similar to causal optimality in Lemma 1 and show that it admits the causal coeﬃcient as a solution. The analysis presented in 3.3 based on the directional derivative can then be applied to nonlinear models. Suppose we have a collection of environments E, and for each e E, we observe i.i.d. data for variables (xe, ye), xe Rp, ye R. Suppose the underlying DGP is

ye f( B xe S; γ ) + ϵe (27)

where S {1, 2, , p}, ϵe xe S and E[ϵe] = 0. f : RK R is an arbitrary function mapping with causal parameters β = ( B , γ ) where B RK |S| and γ RM. When K = 1 and f( ) is an identity mapping, Eq. (27) reduces to the linear SEM. Eq. (27) can represent a ﬂexible DGP that generates the outcome by a fully-connected neural network, where K and B are the width and weights of the ﬁrst hidden layer, respectively. Assume the nonlinear predictor is

ˆye = f(Bxe; γ), (28)

where B RK p, γ RM and α = (B, γ) are the parameters to optimize. We can re-write B xe S = B Λxe where Λ R|S| p has the i-th row as e i if i S and as 0 p if i / S. Let B = B Λ where the j-th column of B is 0K if j / S. Then for square error Re(α) we have the following proposition.

Proposition 11 (Causal Optimality, Nonlinear) With the DGP in Eq. (27) and with Assumptions 1 (ii) (iii), the causal coeﬃcient α = (B, γ) = (B , γ ) is an optima of the following problem

min B,γ R(B, γ), ˆy = f(Bx; γ)

s.t. Bkj = 0 if B kj = 0, 1 k K, 1 j p

γm = 0 if γ m = 0, 1 m M.

Optimization-based Causal Estimation

Proposition 11 greenlights the analysis in 3. The Co Co objective in Eq. (15) can then be used for nonlinear models in Eqs. (27) and (28),

min B,γ 1 |E|

Re(α) α 2 + λr Re(α) , (30)

where α = (B, γ). When B = B , the multiplication Bx zeros out the non-causal covariates x\S so that the prediction ˆy is independent of the spurious covariates. In situations when the causes and non-causes are not disentangled in the covariate space, there could be a representation z(x) of the covariates where z S(x) are the causes of the outcome and z\S(x) are the non-causes. Based on the representation z, the DGP becomes ye f( B z S(xe); γ ) + ϵe, and the predictor is ˆye = f(Bz(xe; γ1); γ2) with parameters α = (B, γ1, γ2). As will be shown in 7.3 and 7.4, empirically we ﬁnd Co Co in such situations can produce predictions not depending on the spurious associations. For example, 7.3 studies the Color-MNIST data set where the input is a digit image x R14 14 2 with 14 14 pixels and 2 color channels. Across the data points, the causal variables (the digit pixels) change their positions in the image and color channels. The position change makes the selection of causes infeasible in the covariate space but possible in the representation space. We ﬁnd that for such cases, Co Co can still learn a robust predictor relying on the causal information such as the digit pixels. We leave the theoretical analysis of causal representation learning by Co Co as an interesting future direction. In the nonlinear regime, due to the high ﬂexibility of the predictor, identiﬁcation in the high-dimensional parameter space can be diﬃcult. Diﬀerent parameterizations can represent a similar mapping from the input to the output. Thus the same data generation might correspond to an equivalent class of points in the parameter space (Heinze-Deml et al., 2018; Christiansen et al., 2020). Consequently, the optima of Co Co may not be unique. However, the optima points enjoy robust predictive properties.

Proposition 12 (Local optimality) Suppose α minimizes Co Co objective Eq. (13) with f E(α ) = 0. Suppose a new environment ι satisﬁes pι(x, y) = P e E wepe(x, y), P e E we = 1, then απ Rι(α)|α=α = 0, π = supp(α ).

Proposition 11 and Assumption 2 guarantee that the causal coeﬃcient β satisﬁes f E(β) = 0 in Eq. (13). Therefore, if there exists another global optima α under these assumptions, it satisﬁes the condition f E(α ) = 0 in Proposition 12. Proposition 12 shows that if the data distribution of a new environment is a mixture of those for the training environments, the optimum by Co Co already minimizes the predictive risk of this new environment locally for its nonzero elements. This means Co Co optima can transfer the predictive accuracy, quantiﬁed by the risk value, from the training environments to their mixtures. As we will show in empirical studies, a predictor optimized by Co Co generalizes the predictive accuracy to non-i.i.d. data in the new environments where the spurious associations change.

7. Empirical studies

In the empirical study, we aim to answer the following questions: (1) Can Co Co accurately estimate causal eﬀects when some covariates are spuriously associated with the outcome?

Yin, Wang, Blei

Figure 2: The graphs for the simulation studies in 7.1. The case ID of each graph is in the rectangle box. The blue arrow represents a path whose parameter varies across environments, the blue circle of a covariate means its distribution given parents varies across environments, and the blue circle of outcome means the variance of its additive noise varies across environments. Invariance Eq. (5) holds in all cases. The shaded nodes are the variables that are observed.

(2) Can Co Co make an accurate prediction in new environments by relying on causal information? (3) How sensitive is Co Co to its assumptions and tuning parameter? (4) Are the empirical results aligned with theoretical analysis? To answer these questions, we study Co Co and its comparable methods on simulated data, semi-synthetic data, and real data. Code implementations for the empirical studies are available at https://github. com/mingzhang-yin/Co Co.

7.1 Causal inference on the synthetic data

In this section, we apply Co Co for causal inference with linear synthetic data. Consider the scenario where we know one exogenous variable, but the other variables are of unknown status; some might be spurious variables, some might be direct causes, some might be neither. As discussed in 3, running ERM with such data will end up with biased estimates of the causal coeﬃcients. But if we have data from multiple environments, we can use Co Co to estimate the causal coeﬃcients.

We generate data from ﬁve diﬀerent graphs in Fig. 2 with SEMs in Table 3. In cases 1 to 5, variables x3, x4, x4, x4, x2 are spuriously associated with the outcome, respectively. The mapping from the causes to the outcome is linear with additive noise. We specify x1 as a known exogenous variable (for the use of the method in Alg. 1) and run Co Co, IRM, and ERM to estimate the causal coeﬃcients. To generate data from diﬀerent environments, we set the parameter γe in SEMs of Table 3 with environment-speciﬁc parameters me 1, me 2 Unif(0, 1), me Unif(1, 2). We generate two environments with γe {0.5, 2.0}, each environment with 10,000 data points. As required, the DGPs leave the causal coeﬃcient invariant.

Optimization-based Causal Estimation

Table 2: The mean absolute error of the estimations for causal parameters β (lower the better). The estimate by Co Co (this paper) is close to the true causal coeﬃcients across the DGPs. Co Co has a more accurate estimation comparing to RVP (Xie et al., 2020), V-REx (Krueger et al., 2020), Causal Dantzig (Rothenh ausler et al., 2019), IRM (Arjovsky et al., 2019) and ERM. The reported mean and the standard deviation in the parentheses are computed across ten independent trials.

Case 1 2 3 4 5

ERM 0.31 (0.06) 0.16 (0.00) 0.32 (0.00) 0.19 (0.03) 0.38 (0.01)

V-REx 0.16 (0.06) 0.11 (0.01) 0.44 (0.01) 0.13 (0.04) 0.06 (0.10)

RVP 0.10 (0.04) 0.10 (0.01) 0.43 (0.01) 0.11 (0.04) 0.05 (0.04)

Dantzig 0.54 (0.62) 3.23 (2.64) 4.95 (3.06) 0.43 (0.05) 0.20 (0.01)

IRMv1 2.12 (0.70) 0.01 (0.00) 0.02 (0.01) 2.17 (0.65) 0.72 (0.35)

Co Co 0.01 (0.00) 0.02 (0.01) 0.01 (0.01) 0.01 (0.01) 0.01 (0.00)

The ﬁve graphs test diﬀerent scenarios: (1) independent causes; (2) observed mediator; (3) observed confounder and mediator; (4) observed confounder and unobserved mediator; (5) unobserved confounding with an omitted common cause. The data generation covers the following circumstances: variables except the outcome might be generated from nonlinear models (Case 2, 3); the distribution of the causes of the outcome might shift across environments (Case 1, 4); the variance of the outcome distribution conditional on its causes might vary across environments (Case 3). Whether a method can produce accurate estimation in all of these situations reﬂects its generalizability. We evaluate with the mean absolute error (MAE) between the estimation α and the true coeﬃcients β. We compare Co Co with ERM, IRM (Arjovsky et al., 2019), V-REx (Krueger et al., 2020), RVP (Xie et al., 2020) and Causal Dantzig (Rothenh ausler et al., 2019). The properties and assumptions of these methods are summarized in Table 5 in Appendix B for comparison. For the algorithms with tuning parameter λ, we report the best result for IRM with λ {2, 20, 200}, for V-REx and RVP with λ {10, 102, 103, 104}. We choose stepsize from {0.01, 0.1} that produces the lowest objective for each method. For all methods, the algorithm is considered to converge if the mean absolute diﬀerence between the parameters in consecutive iterations is less than 10 3 and the total iterations are over 104.

The results are presented in Table 2. The mean MAE shows that the estimates of pure prediction ERM are biased when the covariates have spurious associations with the outcome. IRMv1 with proper hyper-parameter performs well in cases 2, 3, while it has large MAEs in other cases. It suggests the IRMv1 performance is aﬀected by a limited number of environments and the type of intervention. The MAE by IRM has a high standard deviation across random trials, indicating IRM converges to diﬀerent global optima. This echoes the discussion in 4 that the IRMv1 objective is over-relaxed and has excessive non-causal optima. V-REx and RVP perform better than ERM except in Case 3 when the variance of the exogenous noise of outcome is not invariant. This means their performance largely relies

Yin, Wang, Blei

on whether a strong assumption of invariance in Eq. (6) is satisﬁed. Causal Dantzig has large MAE in these cases which might be because the DGPs do not satisfy the inner-product invariance it requires (Rothenh ausler et al., 2019). In comparison to these methods, Co Co estimates have the lowest or equally lowest error uniformly over all the cases. The MAEs with four environments are presented in Fig. 8 in Appendix.

Table 3: SEMs for the simulation study in 7.1. The environments are indexed by e. γe, me 1, me 2, me are ﬁxed scalars in an environment.

Case 1 Case 2 Case 3

xe 2 N(me 2, (γe)2)

xe 1 N(me 1, (γe)2)

ye 3xe 1 + 2xe 2 + N(0, 1)

xe 3 γeye + N(0, (γe)2)

xe 2 N(1, (1

xe 1 xe 2 + Unif( 1, 1)

xe 3 sin(xe 1) + N(0, (1

ye 2xe 1 + 1.5xe 3 + N(0, 1)

xe 4 γeye + N(0, 1)

xe 2 N(1, (1

xe 1 xe 2 + Unif( 1, 1)

xe 3 sin(xe 1) + N(0, (1

ye 2xe 1 + xe 2 + 1.5xe 3 + N(0, (γe)2)

xe 4 γeye + N(0, 1)

Case 4 Case 5

xe 2 N(1, (1

xe 1 xe 2 + Unif(0, me)

xe 3 xe 1 + xe 2 + N(0, (1

ye xe 2 + 2xe 3 + N(0, 1)

xe 4 γeye + N(0, 1)

xe 1 N(1, 1

ye 2xe 1 + ϵe

xe 2 0.5γeϵe + (0.5 + γe)xe 1 + N(0, 1)

As an ablation study, we replace the elementwise produce in Co Co objective by the inner product and minimize P e E( Re(α), α )2. We call this method Naive-Co Co, which is an intermediate between the Co Co objective Eq. (14) and the IRMv1 objective in Eq. (20). Naive Co Co has high MAEs 1.17, 0.53, 0.51, 1.31, 0.03 across the ﬁve cases, respectively. It means the key element for the Co Co objective is the Hadamard product and the norm derived from optimizing in all the feasible directions ( 3.3), instead of knowing a variable is exogenous. To test the model checking method proposed in 5.3, we generate heterogeneous data with γe Unif(0, 5) for each environment. When the set of known non-descendant of outcome is C = {1}, Cases 1, 4, 5 pass the checking condition with the number of environments 3, 3, 2 respectively, while cases 2, 3 do not pass. We further check the nonidentiﬁable case in Appendix C.2. It cannot pass the checking step with any number of environments. In practice, Co Co can accurately estimate the causal coeﬃcients with two environments, as

Optimization-based Causal Estimation

shown in Table 2. It validates our analysis in 5.3 that the checking step is a suﬃcient condition for identiﬁcation but is not a necessary condition.

Model Mismatch. Using data from Case 5, we further study the performance of ERM and Co Co when the predictive model does not exactly match the data-generating model. The data in Case 5 is generated from a linear model. We compare two predictors, one is a linear model that matches the DGP, and the other is a nonlinear neural network. Both models are trained with ERM and Co Co. In Case 5, X1 is the cause, and X2 is a predictive but non-causal covariate. As shown in Fig. 3, when the model is speciﬁed correctly as linear, the predictor trained by ERM fail to generalize to new values of X2, but the model trained by Co Co can generalize to any input (X1, X2). When the predictor is nonlinear, diﬀering from the DGP, ERM prediction is only accurate for the data interpolating the training points. In contrast, Co Co can make accurate predictions for the inputs with new values of spurious covariate X2. From the robust prediction view, Co Co prediction can generalize to new environments by avoiding the spurious association for both linear and nonlinear predictors.

(a) ERM, linear

(b) Co Co, linear

(c) ERM, nonlinear

(d) Co Co, nonlinear

Figure 3: Prediction accuracy for Co Co and ERM with linear and nonlinear predictors. The heatmap is the prediction error (ˆy E[y|x])2, the x-axis, y-axis are the values of input x1 and x2. The orange points are training data from two environments. Co Co has better out-of-sample generalization with a wider region of low error (blue region) than ERM for both linear and nonlinear predictors.

7.2 Gaussian mixture example

We study a multi-class classiﬁcation problem with a categorical outcome when the inputs contain non-causal covariates. We modify a Gaussian mixture model (GMM) to simulate the data set. The observed covariates are (xe 1, xe 2) and the outcome is ye, where e is the environment index. For each environment e, the data are generated with SEM

xe 1 PK k=1 1 K N(µk, I)

ye Categorical(p1, , p K)

xe 2 (1 pe)δue ye + peδue k1, (31)

Yin, Wang, Blei

where pk = N(xe 1; µk, I)/ PK k =1 N(xe 1; µk , I), k1 Multinomial(1/K, , 1/K). Among the covariates, the mapping from the causes xe 1 to label ye is invariant across all e, whereas xe 2 is predictive to ye due to spurious associations.

0.00 0.25 0.50 0.75 1.00 1.25 1.50 Noise scale

(a) Noise scale

1 2 3 4 5 6 7 8 Number of Environments

(b) # of environments

0 20 40 60 80 1/ r

(c) Regularization strength

Figure 4: The change of test predictive error of Co Co with diﬀerent levels of invariance, number of environments, and the hyperparameter. The dashed line is the ERM error rate for reference. The error bar is the standard deviation over 5 trials.

0 1000 2000 3000 4000 5000 Iteration

0 5000 10000 15000 20000 25000 30000 Iteration

Accuracy(noised label)

Co Co_Training Co Co_Testing IRM_Training IRM_Testing ERM_Training ERM_Testing

(b) Colored MNIST

Figure 5: Trace plot of training and testing accuracy for Co Co, IRM and ERM on GMM and Colored MNIST data. In panel (b), the accuracy is measured on predicting the noised label ye. Co Co has the highest prediction accuracy in a new environment.

The observed data are generated from Eq. (31). The GMM component centers µk =

1.5Kek RK. To generate the non-causal covariates xe 2, we ﬁrst generate K random vectors {ue k}K k=1 with ue k Q[k/2] i=1 U(0, 1) for environment e. Then for a data point in the component ye, xe 2 equals ue ye with probability 1 pe and equals a random vector from {ue k}K k=1 otherwise. By doing so, xe 2 is associated with ye, but the association varies across environments when ue 1:K changes with environment e. The DGPs that generate the environments are characterized by the values of ue 1:K and pe. We set the training environments with K = 5 and pe {0.01, 0.02, , 0.05} in Eq. (31). For a validation/test environment f we generate a new set of {uf k}K k=1 and set pf = 0.

Optimization-based Causal Estimation

We evaluate the test performance by averaging the accuracy over ten testing environments. If the predictor learns to predict based on the causes xe 1 instead of the spurious covariates xe 2, it can accurately predict ye in training and testing environments. We use a fully connected neural network with two hidden layers as the predictor and train the model with Alg. 2. For both Co Co and IRM, the penalty weight is chosen from ten values equally spaced from 1 to 100 on a log-scale using the validation environments. The weight on the empirical risk term is reduced to 0 after 5k iterations. The results are shown in Figs. 4 to 6 and Table 4. Fig. 5 is the trace plot for the predictive accuracy. The testing accuracy increases for all the methods in the early stage of training but drops in the later stage for ERM and IRM. We hypothesize that ERM and IRM ﬁrst improve the prediction by utilizing all the covariates, including the causal ones. But in the later stage of training, it relies more heavily on the spurious associations to boost the training accuracy, which harms the accuracy at the test time. Fig. 6 shows the weight matrix that connects the input and the ﬁrst hidden layer. The model trained by Co Co zeros out the weights associated with the spurious covariates x2 (the right block), aligned with the analysis in Proposition 11. In comparison, the weights obtained by IRM and ERM are mostly nonzero, passing the spurious association from the input to the subsequent hidden layers and outputs.

Table 4 summarizes the numerical results. Co Co and V-REx have the highest prediction accuracy in the test environments. In this example, the strong invariance Eq. (6) is satisﬁed due to the DGP in Eq. (31), which meets V-REx assumption. Hence, V-REx has a high accuracy close to Co Co. When the data satisﬁes the strong invariance, it is potentially beneﬁcial to add the V-REx regularization of equal noise variance to the Co Co objective.

0 1 2 3 4 5 6 7 Input layer

First hidden layer

0 1 2 3 4 5 6 7 Input layer

First hidden layer

0 1 2 3 4 5 6 7 Input layer

First hidden layer

Figure 6: The heatmap for the ﬁrst layer weight matrix of the neural networks trained by Co Co, IRM and ERM on the GMM data. The matrix dimension is 10 8 where the input dimension is 8, and the ﬁrst hidden layer dimension is 10. In the input, the ﬁrst ﬁve elements are x1, and the last three elements are x2. Comparing to IRM and ERM, Co Co solution has the weights related to non-causal input x2 (the right block) close to 0.

Sensitivity to the assumption and hyper-parameter. In Fig. 4, we study how Co Co performs if the invariance assumption is violated and the hyperparameters change. In

Yin, Wang, Blei

panel (a), we construct the training environments by changing the cluster centers {µk}K k=1 in Eq. (31) to {µk + ϵe k}K k=1, ϵe k N(0, σ2IK). The noise ϵe k changes the mapping from the covariates xe 1 to the label ye across the environments, deviating from the invariance assumption. The noise scale σ2 reﬂects the magnitude of deviation. Panel (a) shows that the testing predictive accuracy increases as the invariance tends to hold. Under a moderate deviation from the invariance assumption, the test prediction by Co Co remains more accurate than that by ERM. In panel (b), we study how the test accuracy varies with the number of environments M in training. Training environments are constructed by setting the vectors ue k Q[k/2] i=1 U(0, 1) for all k and pe {0.01, 0.02, , 0.01M} in Eq. (31). We ﬁnd a growing number of environments reduces the testing error monotonically due to the increased environments heterogeneity. In panel (c), we study how the testing error changes with the penalty weight λr in Co Co objective Eq. (15). When λr is large, the objective is close to the empirical risk, and the test error is high; when λr is small, the parameters might collapse to the point 0. Between the two extremes, Co Co can learn a model that makes robust predictions in new environments.

7.3 Colored MNIST (CMNIST)

CMNIST is a semi-synthetic data set for binary classiﬁcation introduced in Arjovsky et al. (2019). Based on the MNIST data set, the image of hand-written digits 0-4 and 5-9 are labeled as y = 0 and y = 1, respectively. For each environment, the outcome ye is generated with probability 0.75 as y and with probability 0.25 as 1 y. We call y the clean labels and ye the noised labels. The digit is colored green with probability pe if ye = 1 and with probability 1 pe if ye = 0; if not colored green, it is colored red. The DGPs across the environments diﬀer in the value of pe. Environments are constructed for the training with pe {0.1, 0.2}, for the validation with pe = 0.5 and for the testing with pe = 0.9. The predictor takes the colored digit image as the input and the noised label ye as the target. The input x R14 14 2 where the image has 14 14 pixels and two color channels. The relationship between the digit shape and ye is genuinely causal, while the relationship between the color and ye is spurious. The goal is to learn a predictor that makes the predictions based on the digit shape rather than the color. A predictor using color information cannot accurately predict the noised label ye at the test time and the clean label y during training and testing. Empirical results. We optimize the predictor, a fully connected neural network with two hidden layers, by Alg. 2. For the algorithms in 7.1, we compare with ERM, IRM (Arjovsky et al., 2019), and V-REx (Krueger et al., 2020) that can generalize to nonlinear models. More baseline results are reported in Table 7 in Appendix. The weight of the risk term for Co Co and V-REx objectives is chosen on the validation environment from 2 {10 1, , 10 5}, and is reduced by a factor of 10 after 15k iterations (half of the total iterations). We use Adam optimizer (Kingma and Ba, 2014) with learning rate 10 4. For IRM, we reduce the learning rate of the public code to 10 4 to ensure stability over long iterations, and we use all the other hyperparameters and annealing strategies provided in the author s code 2.

2. https://github.com/facebookresearch/Invariant Risk Minimization

Optimization-based Causal Estimation

Table 4: Predictive accuracy in training and testing environments on the GMM, CMNIST, and Wildlife data. For GMM, the Oracle results are the predictions with causes xe 1 instead of (xe 1, xe 2). For CMNIST, the prediction accuracy is reported for both clean label y and noised labels ye; the Oracle is the same function ﬁtted on grey-scale images by ERM.

GMM CMNIST Wildlife

Train Test Train ( y) Test ( y) Test (ye) Train Test

ERM 99.4 51.0 75.8 44.4 31.1 99.6 58.4

IRM 95.9 75.9 81.4 70.3 46.5 83.4 84.9

V-REx 92.6 91.4 75.2 49.5 31.8 96.2 67.3

Co Co (this paper) 91.9 91.6 93.0 92.9 74.7 86.1 85.2

Random Guess 20 20 50 50 50 50 50

Oracle 92.3 91.8 99.3 97.9 74.8 - -

Fig. 9 in Appendix D visualizes the weight matrix that connects the input x and the ﬁrst hidden layer. The weight learned by ERM connects the hidden layer to all the inputs, hence encoding non-causal information in the latent representation. In contrast, the weight learned by Co Co sets multiple columns close to zero and is symmetric over the two color channels, potentially removing the dependence of the hidden layer on some non-causal pixels. The numerical results are shown in Table 4. The trace plot for the noised label is in Fig. 5 (b) and the trace plot for the clean label is in Fig. 7 (a) of Appendix D. The results show that ERM predicts labels with accuracy close to 1 in the training but has the lowest accuracy at the testing. The reason might be that its prediction largely depends on the color information rather than the digit shape, whereas the association between color and label ye

changes from training to testing environments. The testing accuracy of IRM increases in the early stage of training but drops in the later stage. We hypothesize that the model at ﬁrst improves the prediction by utilizing all the information including that of digit shape, but later it relies more on the color information, which reduces the accuracy at the test time. Similar patterns appear in the prediction of clean label y, as shown in Fig. 7.

7.4 Natural image classiﬁcation

In this example, following Cloudera (2020), we adapt the i Wild Cam 2019 dataset (Beery et al., 2019) that contains wildlife images taken in the wild. The images are collected from diﬀerent cameras, each at one of 143 locations. The goal is to classify coyotes and raccoons in images. The data collected in diﬀerent locations and time usually follow diﬀerent distributions due to varying physical factors such as landscape, season, vegetation, illumination conditions, etc. Therefore, the images taken from diﬀerent cameras can be considered data from heterogeneous environments. The physical factors reﬂected in the image background might be predictive of the species but due to spurious association. Our goal is to learn a predictor

Yin, Wang, Blei

that can make accurate predictions in a new environment by training on the data from a limited number of environments. This goal can be achieved if the predictor manages to recognize coyotes and raccoons based on animal pixels instead of using spurious associations from the backgrounds. Based on the setting of Cloudera (2020), we use images from two locations as the training data and images from another location as the test data. We use images from an additional location as the validation data. The inputs are 512-dimensional features extracted from Res Net18 (He et al., 2016), a pre-trained model on the Image Net dataset (Deng et al., 2009). The predictor is a fully connected neural network with one hidden layer of size 10. Co Co is trained by Alg. 2. In this example, we ﬁnd that adding the weak condition (19) in Alg. 2 with weight λw improves convergence stability with random initialization3. The hyperparameters are selected on the validation environment. We set the weak condition weight λw = 10 4

and the risk regularizer weight λr = 1. λr is reduced to 10 5 after 100 epochs. The risk regularizer is an inductive bias to encourage nonzero solutions. After the optimizer is suﬃciently away from the zero point, annealing the risk regularizer prevents the algorithm from minimizing the objective by reducing the risk function, hence preventing it from using the spurious association. Co Co is compared with ERM, IRM (Arjovsky et al., 2019), and V-REx (Krueger et al., 2020). All methods are trained by ADAM with a learning rate 10 3. The result is summarized in Table 4 and Fig. 7 (b) in Appendix D. It demonstrates that ERM has high accuracy in the training but low accuracy during testing. Co Co accuracy is slightly higher than IRM and is much higher than V-REx and ERM. Compared to ERM, prediction by Co Co has a slight drop in training accuracy but has signiﬁcantly higher testing accuracy. Co Co has the smallest performance gap between training and testing, indicating that it largely avoids predicting animal labels via information from image backgrounds, i.e., information that varies across environments.

8. Conclusion

This paper formulates causal estimation as a constrained optimization problem. Applying directional derivative methods to this optimization problem, we propose the Co Co objective, a computationally tractable optimization method for estimating causal coeﬃcients with datasets from multiple environments. Theoretically, we discussed the necessary and suﬃcient conditions by which the causal coeﬃcients are identiﬁed by optimizing the Co Co objective. We discuss the mathematical connection between Co Co and IRM. In empirical studies, we ﬁnd that Co Co produces accurate causal estimation and distributionally robust predictions. Co Co is applicable to high-dimensional data and to linear and nonlinear models. Looking ahead, we think several problems are worthy of further exploration. One direction is to consider the situations when there is an unobserved confounder as a direct cause. In this case, a potential approach is to connect environments and instrumental variables. Another direction is to further understand the interplay between the type of interventions, the number of environments, and the identiﬁcation of causal coeﬃcients, especially for nonlinear models. Such understanding can enable causal estimation with a minimal number of environments by methods like active learning.

3. Adding a general condition in Eq. (18) to the strong condition of Eq. (11) does not change the optima set, but it may improve the smoothness of the optimization landscape.

Optimization-based Causal Estimation

Acknowledgments

The authors thank the Editor, Action Editors, and anonymous reviewers for helpful and constructive comments. Mingzhang Yin thanks the support from the Data Science Institute and the Irving Institute for Cancer Dynamics, Columbia University. The authors acknowledge the fundings from Warrington College of Business, Oﬃce of Naval Research under grant number N00014-23-1-2590, the National Science Foundation under Grant No. 2231174 and No. 2310831, NSF IIS-2127869, NSF DMS-2311108, ONR N000142412243, and Simons Foundation.

Appendix A. Proofs

In this section, we present proofs for the results in the main paper. First, we prove the causal optimality results of the proposed optimization problems. Proof [Lemma 1] Let the random vector x = (x1, , xp) denote the covariates. The expected mean square error is

=E[(α x β x ϵ)2]

=(α β) E[xx ](α β) 2E[(α β) xϵ] + E[ϵ2].

Since supp(α) supp(β), the (α β) x is a linear combination of the true causes as P j supp(β)(αj βj)xj which is independent of ϵ by the SEM, thus E[(α β) xϵ] = 0. Since E[xx ] is assumed to be positive deﬁnite, the unique optima of the square error is α = β.

Proof [Lemma 2] Recall that Xj is a known exogenous variable with Xj ϵ. We ﬁrst prove that the optima set of the modiﬁed Eq. (14) is a subset of that for the non-modiﬁed Eq. (13), but it still keeps the causal coeﬃcient. This is because Re(α)j = 0 implies Re(α)j αj = 0, hence a minimizer of Eq. (14) must be a minimizer of Eq. (13). Next we prove that the causal coeﬃcient β minimizes Eq. (14) to zero. It is suﬃcient to show that Re(β)j = 0 since we already know Re(β) β 2 = 0. It is true because for the DGP in Eq. (4), Re(β)j = E[xj ϵ] = 0. Finally, we show that the vector 0 is not an optimum of Eq. (14) almost surely when βj = 0. The zero vector minimizes Eq. (14) if and only if Re(0)j = 0. For the linear model, it requires P j S E[xe jxe j ]βj = 0 for all e E. By the independent causal mechanisms principle (Sch olkopf, 2019; Sch olkopf et al., 2021), the causal coeﬃcients β (the mechanism) are independent of the distribution of causes x S. It means for the zero vector to be an optimum, βS has to fall in the intersection of |E| hyperplanes in R|S| which has measure zero, hence the probability is zero.

Proof [Lemma 3] 3.3 shows the feasible directions U = span{ej : j S}. Therefore, the causal parameter β itself is a feasible direction with β U. The ﬁrst order condition implies the optima of the Eq. (7) sets the directional derivative to zero in the direction of β, i.e., R(α), β = 0. By Lemma 1, plugging β into α produces

Yin, Wang, Blei

R(β), β = 0, which means

β arg min α ( R(α), α )2. (32)

Similarly, any partition P of the set {1, 2, , p} gives a necessary condition that admits

β arg min α

A P ( R(α)A, αA )2. (33)

Proof [Proposition 4] The statement follows directly from the Cauchy Schwarz inequality X

e E p Re(α) α 2 2 X

e E ( Re(α), α )2 0

Proof [Lemma 5] By Eq. (37), the sub-vector of the gradient is

R(α)H = WH(α β) s H (34)

By Eq. (24),

αHRH(αH) = WHH(αH βH) WHHcβHc s H. (35)

By Eq. (34), suppose α is a Co Co objective optima with R(α)H = 0 and α\H = 0, α also satisﬁes αHRH(αH), so its nonzero elements are the ERM solution when regressing with input x H. On the other hand, suppose x H is ERM solution with input x H, α = (αH, 0) will be a Co Co objective optima by Eq. (35).

The following proofs are for the identiﬁcation results in main paper 5. Proof [Theorem 6] Let se j = E[Xe j ϵ] = cov(Xe j , ϵ), se = (se 1, , se p)T . By the data generating process, se j = 0 for j {1, , K}. Let

ge(α) = Re(α) α] 2 , f(α) = 1

e E ge(α). (36)

where f(α) is Co Co objective. Direct computation shows

Re(α) = W e(α β) se (37)

Notice f(α) 0 and by the structual equation model, due to independence of the exogenous noise ϵ and causes Pa(Y ), we have se β = 0. Hence for α = β, f(α ) = 0. This guarantees the existence of a solution as causal coeﬃcient β. To prove the identiﬁcation, it is suﬃcient to prove that for all α = α , f(α) > 0. We use proof by contradiction.

Optimization-based Causal Estimation

Let H = supp( α) and Hc as its compoment set in {1, 2, , p}. We assume f(α) = 0 and α = β and deduce a contradiction. Since f(α) = 0, for all e, ge(α) = 0. Since ge(α) = Re(α) α, it means Re(α)H = 0, for all e. However, by the characterization of the feasible set in Section 3.3, Assumption A2) implies that there does not exist α Rp, α = β, such that Re(α)H = 0, e E. Otherwise, the set H is an invariant set and both α and β are invariant estimations, which violates Assumption A2). Hence for α = β, there exists an environment e E with Re (α)H = 0. This yields a contradiction.

Proof [Corollary 7] The claim i) is trivial. To see why the claim ii) holds, we prove its equivalent contrapositive statement. If assumption (A2) does not hold for E2, it means there exists α, Re(α)H = 0 for all e E2, which also applies to all e E1 since E1 E2. Hence the assumption (A2) does not hold for E1.

Proof [Theorem 8] We prove the statement by contradiction. Using notations in Eq. (36), suppose for α , f(α ) = 0 and α C = βC. Then let set H = supp(α ) C. Since f(α ) = 0, we have Re(α )H = 0, for all e. By Eq. (24), this means

W e HH(α H βH) = W e HHcβHc + se H, e (38)

Denoting δ = α H βH, we have δ = 0. Then Eq. (38) contradicts with the assumption (A2 ).

Proof [Corollary 9] Since the do intervention satisﬁes validity assumption (A1), to prove the identiﬁcation of the treatment eﬀect for the variable of interest, say xj , by Theorem 8 it is suﬃcient to show that it satisﬁes the weak eﬀectiveness assumption (A2 ). We suppose the environments E violate assumption (A2 ), that is H {1, 2, , p}, δ R|H|, δ = 0 such that WE Hδ = θE H, and yield a contradiction. By For notation convenience, denote δσ(j) as the element of δ associated with the column of WE H that consists of the elements {W e ij; e E, i H}; here σ : H 7 {1, 2, , |H|} is a bijection by deﬁnition. Suppose E[Xj ] = 0. Consider the set of variables Sδ = {j : j H, xσ(j) = 0} {j : j Hc, βj = 0}. Since δ = 0, the set Sδ is non-empty. We consider the youngest node Xj, j Sδ that there is no direct path from Xj to any other node with index in Sδ. There are two possible cases. When j H Sδ, for e, e E with Ie = Ie = {j}, we have the linear equation W e j Hδ = θe j in the linear system WE Hδ = θE H as

k H δσ(k)E[Xe j Xe k] = X

t Hc βt E[Xe j Xe t ] + E[Xe j ϵe]. (39)

For the do intervention Xe j ae j, we have E[Xe j Xe k] = ae j E[Xe k] = ae jµe k, and since Xj is the youngest node, Eq. (39) becomes

δσ(j)ae j + X

k H,k =j δσ(k)µk = X

t Hc βtµt (40)

Yin, Wang, Blei

for environment e , we have

δσ(j)ae j + X

k H,k =j δσ(k)µk = X

t Hc βtµt (41)

Since ae j = ae j , Eqs. (40), (41) are inconsistent and yield a contradiction. When j Hc Sδ, analogously we have the linear equation W e j Hδ = θe j as

k H δσ(k)Wj k = X

t Hc,t =j βt Wj t + βjµj ae j (42)

and equation W e j Hδ = θe j as

k H δσ(k)Wj k = X

t Hc,t =j βt Wj t + βjµ jae j (43)

Since ae j = ae j , Eqs. (42), (43) yield a contradiction.

Proof [Corollary 10] Suppose Assumption 4 does not hold, then by Eq. (25) there exists H {1, 2, , p}, C H and δ R|H|, δ = 0, such that We HHδ + We HHc( βHc) = se H for all e E. Since xi ϵ, for i C, we know se i = 0. Letting v H = δ, v Hc = βHc, we have

WE CPv = WE CHv H + WE CHcv Hc = s E C = 0, (44)

which cannot pass the checking step since v = 0.

Proof [Proposition 11] By the construction of Λ, B = AΛ is a matrix where the j-th column B j = 0 if j / S. Similar to the proof of Lemma 1, we can compute the L2 risk as

=E[(fγ(Bx) fγ (B x) ϵ)2]

=E[(fγ(Bx) fγ (B x))2] 2E[((fγ(Bx) fγ (B x))ϵ] + E[ϵ2].

Due to the constraints, Bj = B j = 0, Bx ϵ, B x ϵ, therefore the second term is zero. Then the L2 risk reaches its minimum as E[ϵ2] when B = B , γ = γ .

Proof [Proposition 12] Since f E(α ) = 0, e E

|| Re(α ) α ||2 = 0.

Let π = supp(α ), we have ( Re(α ))π = 0. This means for j π,

0 = αj Re(α ) = Z pe(x, y)

αj L(y, ˆy(x; α))|α=α dxdy

Optimization-based Causal Estimation

αj Ex,y pγ(x,y)L(y, ˆy(x; α))

αj L(y, ˆy(x; α))dxdy.

Plug in α = α we have αj Rγ(α)|α=α = 0.

Appendix B. A Summary of algorithms on multiple environments

We summarize the properties of Co Co and several representative causal algorithms that leverage data from multiple environments in Table 5.

Table 5: Comparing causal algorithms with multiple environments. Gnr. interv.: allow a general type of intervention as long as the invariance in Eq. (5) or Eq. (6) is satisﬁed. nl. model: has been applied to the nonlinear predictive function. scalability: computational eﬃciency in scaling up to high dimensional problems. uneq. variance: allow the variance of exogenous noise of the outcome to vary across environments. unm. cf.: allow unmeasured confounding between the true causes and the outcome.

gnr. interv. nl. model scalability uneq. variance unm. cf.

Co Co (this paper) À

IRM (Arjovsky et al., 2019) À

V-REx (Krueger et al., 2020) À

RVP (Xie et al., 2020) À

group-DRO (Sagawa et al., 2019) À

ICP (Peters et al., 2016) Ä

Causal Dantzig (Rothenh ausler et al., 2019) À

LRE (Ghassami et al., 2017) Ä

MC (Ghassami et al., 2018) Ä

Appendix C. Analytic case studies

This section include two concrete cases. One analytically demonstrate how optimizationbased methods can estimate causal coeﬃcients, and compare Co Co, IRM and ERM. The other case is an instance for ineﬀective interventions.

Yin, Wang, Blei

C.1 An example of optimization-based estimation

To illustrative the discussion in 4, we study ERM, IRM and Co Co on a speciﬁc example. The DGP follows the Case 1 of Table 3,

xe 2 N(me 2, (γe)2)

xe 1 N(me 1, (γe)2)

ye 3xe 1 + 2xe 2 + N(0, 1)

xe 3 γeye + N(0, (γe)2).

We consider the DGP of two environments corresponding to parameters (m(1) 1 , m(1) 2 , γ(1)) = (2, 0.5, 2), (m(2) 1 , m(2) 2 , γ(2)) = (3, 1, 0.5). Consider the predictor ˆy = α xe and the risk function Re(α) = 1

2E[(ye α xe)2]. The gradient of the risk function is

αRe(α) = (α1 3)E[x2 1] + (α2 2)E[x1x2] + α3E[x1x3],

(α2 2)E[x2 2] + (α1 3)E[x1x2] + α3E[x2x3],

α3E[x2 3] + (α1 3)E[x1x3] + (α2 2)E[x2x3] γ .

And the moments are

E[x1x3] = 3γ(m2 1 + γ2) + 2γ(m1m2),

E[x2x3] = 3γ(m1m2) + 2γ(m2 2 + γ2),

E[x2 1] = m2 1 + γ2,

E[x2 2] = m2 2 + γ2,

E[x1x2] = m1m2,

E[y2] = 9E[x2 1] + 12E[x1x2] + 4E[x2 2] + 1,

E[x2 3] = γ2(E[y2] + 1).

The Co Co optima for each environment is given by solving α with the system of equations

α1(α1 3)E[x2 1] + α1(α2 2)E[x1x2] + α1α3E[x1x3] = 0,

α2(α2 2)E[x2 2] + α2(α1 3)E[x1x2] + α2α3E[x2x3] = 0,

α2 3E[x2 3] + α3(α1 3)E[x1x3] + α3(α2 2)E[x2x3] γα3 = 0.

The optima of the IRM regularization for each environment forms the quadric surface

α1(α1 3)E[x2 1] + α1(α2 2)E[x1x2] + α1α3E[x1x3] + α2(α2 2)E[x2 2] + α2(α1 3)E[x1x2]

+ α2α3E[x2x3] + α2 3E[x2 3] + α3(α1 3)E[x1x3] + α3(α2 2)E[x2x3] γα3 = 0.

Empirically, we sample 105 data points of each environment. The ERM solution for the two environments is αERM = (2.815, 1.778, 0.043). The empirical solutions of IRMv1 with various λ values are in Table 6, which is obtained with a ﬁxed random seed. The stepsize is

Optimization-based Causal Estimation

λ Random initialization Idealized initialization at β

0.1 (-0.372, -0.104, 0.547) (2.923, 1.842, 0.019)

1 (-0.842, -0.390, 0.621) (2.932, 1.847, 0.017)

10 (-0.877, -0.411, 0.626) (2.933, 1.848, 0.017)

1000 (-0.830, -0.300, 0.613) (2.954, 1.896, 0.012)

Table 6: The convergence points of IRMv1 for the case study.

0.01. We consider random initialization or an idealized initialization at the causal coeﬃcient β. For λ {0.1, 1, 10}, both initializations converges within 3 104 iterations. For a large λ = 1000, the convergence is slow with 8.8 105 iterations for the random initialization and 1.7 105 iterations for the initialization at β, likely due to the weak regularization of the empirical risk and the identiﬁcation issue as discussed in 4. The observation that the idealized initialization does not stay at β indicates β does not have the lowest empirical risk among the optima set of the IRM regularization. For λ = + , the optimization will stay at β for the idealized initialization since β belongs to the optima set of the IRM regularization, but such initialization is not feasible in practice. The empirical solution of Co Co is (3.001, 2.004, 0.000) which is close to the true causal coeﬃcient (3, 2, 0).

C.2 An example of the ineﬀective intervention

Consider environments indexed by γe {1, 2, 3}, and SEM as:

xe 2 N(0, (γe/2)2)

xe 1 xe 2 + U( γe, γe) + 1

ye 2x1 + 1.5x2 + N(0, 1)

xe 3 0.5 ye + N(0, 1).

The predictor is linear as:

ˆye(α) = α1xe 1 + α2xe 2 + α3xe 3. (46)

Ideally, we want to identify the causal coeﬃcients β = (2, 1.5, 0). However, in this example, a straightforward calculation shows the point ˆα = (1.6, 1.2, 0.4) minimizes the risk function E[(1/2)(ye ˆye)2] for each environment. This means Re(α)|α=ˆα = 0, and hence ˆα minimizes the objective Eq. (14). Both β and ˆα belong to the set of optima of objective (14), which cannot be distinguished under given interventions. Loosely speaking, β and ˆα are equally invariant over these three environments.

Appendix D. Additional simulation results

This section contains experimental results additional to 7 in the main paper. Fig. 8 contain the MAEs for the linear synthetic data ( 7.1). It shows the mean and minimal MAEs over ten random trials for two and four environments. Fig. 9 shows the weight parameters learned

Yin, Wang, Blei

Table 7: Test accuracy (percent) for noised label ye on the Colored MNIST. Results are reported from the original papers. The baseline methods compared with Co Co (this paper) are V-IRMG, F-IRMG (Ahuja et al., 2020), Re Bias (Bahng et al.,

2020), Dec Aug (Bai et al., 2021), MM-REx (Krueger et al., 2020), PAIR (Chen et al., 2022), Fishr (Rame et al., 2022).

V-IRMG F-IRMG Re Bias Dec Aug MM-REx PAIR Fishr Co Co

Test accu. 49.06(3.4) 59.91(2.7) 29.40(0.3) 69.60(2.0) 66.1(1.5) 68.4(1.1) 73.8(1.0) 74.7(0.3)

by Co Co and ERM on CMNIST ( 7.3), which provides explanations of why Co Co could learn a causal representation. Table 7 contains more baseline results for comparison on the Color MNIST data. Fig. 7 shows the trace plot of the predictive accuracy for the clean label y prediction on the CMNIST data ( 7.3) and for the prediction on the i Wild Cam data ( 7.4).

0 5000 10000 15000 20000 25000 30000 Iteration

Accuracy(clean label)

(a) Colored MNIST

250 500 750 1000 1250 1500 1750 2000 Iteration

Co Co-training Co Co-testing IRM-training IRM-testing ERM-training ERM-testing

(b) Wildlife

Figure 7: Trace plot of training and testing accuracy for Co Co, IRM and ERM on Color MNIST and Wildlife data. In panel (a), the accuracy is measured on predicting the clean label y. Co Co has high accuracy in both training and testing environments.

Optimization-based Causal Estimation

(a) Two environments

(b) Four environments

Figure 8: The boxplot for the mean absolute error (MAE) of the estimations for causal parameters β (lower the better) over two (left) and four environments (right). Co Co estimation is close to the true causal coeﬃcients across all settings. Co Co has a more accurate estimation comparing to RVP (Xie et al., 2020), V-REx (Krueger et al., 2020), IRM (Arjovsky et al., 2019) and ERM. Each case is run with ten independent trials.

(a) Weights by Co Co

(b) Weights by Co Co (stacked color channels)

(c) Weights by ERM

Figure 9: The weight matrix that connects the input and the ﬁrst hidden layer for the CMNIST data. (a) The weights learned by Co Co set multiple columns close to zero, removing the dependence of the hidden layer on a set of pixels. (b) Stacking the weights corresponding to the two image channels (the left and right half of (a)). Co Co selects the same pixels over two color channels hence removing the color information from the hidden layer. (c) The weights by ERM connect the hidden layer to all the inputs, so the prediction depends on the spurious association.

Yin, Wang, Blei

Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. Invariant risk minimization games. In International Conference on Machine Learning, pages 145 155. PMLR, 2020.

Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identiﬁcation of causal eﬀects using instrumental variables. Journal of the American statistical Association, 91(434): 444 455, 1996.

Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Jun-Hyun Bae, Inchul Choi, and Minho Lee. Meta-learned invariant risk minimization. ar Xiv preprint ar Xiv:2103.12947, 2021.

Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, and Seong Joon Oh. Learning de-biased representations with biased representations. In International Conference on Machine Learning, pages 528 539. PMLR, 2020.

Haoyue Bai, Rui Sun, Lanqing Hong, Fengwei Zhou, Nanyang Ye, Han-Jia Ye, S-H Gary Chan, and Zhenguo Li. Decaug: Out-of-distribution generalization via decomposed feature representation and semantic augmentation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 6705 6713, 2021.

Sara Beery, Dan Morris, and Pietro Perona. The i Wild Cam 2019 challenge dataset. ar Xiv preprint ar Xiv:1907.07617, 2019.

Dimitris Bertsimas, Angela King, and Rahul Mazumder. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813, 2016.

Philippe Brouillard, S ebastien Lachapelle, Alexandre Lacoste, Simon Lacoste-Julien, and Alexandre Drouin. Diﬀerentiable causal discovery from interventional data. Advances in Neural Information Processing Systems, 33, 2020.

Peter B uhlmann et al. Invariance, causality and robustness. Statistical Science, 35(3): 404 426, 2020.

Yongqiang Chen, Kaiwen Zhou, Yatao Bian, Binghui Xie, Kaili Ma, Yonggang Zhang, Han Yang, Bo Han, and James Cheng. Pareto invariant risk minimization. ar Xiv preprint ar Xiv:2206.07766, 2022.

Rune Christiansen, Niklas Pﬁster, Martin Emil Jakobsen, Nicola Gnecco, and Jonas Peters. A causal framework for distribution generalization. ar Xiv e-prints, pages ar Xiv 2006, 2020.

Cloudera. Causality for machine learning, 2020. URL https://ff13.fastforwardlabs.com.

A Philip Dawid, Vanessa Didelez, et al. Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview. Statistics Surveys, 4:184 231, 2010.

Optimization-based Causal Estimation

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255. Ieee, 2009.

Frederick Eberhardt and Richard Scheines. Interventions and causal inference. Philosophy of Science, 74(5):981 995, 2007.

Bradley Efron. Prediction, estimation, and attribution. Journal of the American Statistical Association, 115(530):636 655, 2020.

Felix Elwert and Christopher Winship. Endogenous selection bias: The problem of conditioning on a collider variable. Annual review of sociology, 40:31 53, 2014.

Amir Emad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learning causal structures using regression invariance. In Neural Information Processing Systems, pages 3015 3025, 2017.

Amiremad Ghassami. Causal discovery beyond Markov equivalence. Ph D thesis, University of Illinois at Urbana-Champaign, 2020.

Amir Emad Ghassami, Negar Kiyavash, Biwei Huang, and Kun Zhang. Multi-domain causal structure learning in linear systems. In Neural Information Processing Systems, pages 6269 6279, 2018.

Gene H Golub, Per Christian Hansen, and Dianne P O Leary. Tikhonov regularization and total least squares. SIAM journal on matrix analysis and applications, 21(1):185 194, 1999.

Ruocheng Guo, Pengchuan Zhang, Hao Liu, and Emre Kiciman. Out-of-distribution prediction with invariant risk minimization: The limitation and an eﬀective ﬁx. ar Xiv preprint ar Xiv:2101.07732, 2021.

Trygve Haavelmo. The probability approach in econometrics. Econometrica: Journal of the Econometric Society, pages iii 115, 1944.

Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. Machine Learning, 110(2):303 348, 2021.

Christina Heinze-Deml, Jonas Peters, and Nicolai Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2), 2018.

Yin, Wang, Blei

Biwei Huang, Kun Zhang, Mingming Gong, and Clark Glymour. Causal discovery and forecasting in nonstationary environments with state-space models. In International Conference on Machine Learning, pages 2901 2910. PMLR, 2019.

Biwei Huang, Kun Zhang, Mingming Gong, and Clark Glymour. Causal discovery from multiple data sets with non-identical variable sets. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 10153 10161, 2020.

Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

Pritish Kamath, Akilesh Tangella, Danica J Sutherland, and Nathan Srebro. Does invariant risk minimization capture invariance? ar Xiv preprint ar Xiv:2101.01134, 2021.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). ar Xiv preprint ar Xiv:2003.00688, 2020.

Chaochao Lu, Yuhuai Wu, Jo se Miguel Hern andez-Lobato, and Bernhard Sch olkopf. Nonlinear invariant risk minimization: A causal approach. ar Xiv preprint ar Xiv:2102.12353, 2021.

Jorge A Marban. Directional derivatives in classical optimization. Ph D thesis, University of Florida, 1969.

Joris M Mooij, Sara Magliacane, and Tom Claassen. Joint causal inference from multiple contexts. Journal of Machine Learning Research, 21(99):1 108, 2020.

Jens M uller, Robert Schmier, Lynton Ardizzone, Carsten Rother, and Ullrich K othe. Learning robust models using the principle of independent causal mechanisms. ar Xiv preprint ar Xiv:2010.07167, 2020.

J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.

Jonas Peters and Peter B uhlmann. Identiﬁability of gaussian structural equation models with equal error variances. Biometrika, 101(1):219 228, 2014.

Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: Identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947 1012, 2016.

Niklas Pﬁster, Peter B uhlmann, and Jonas Peters. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264 1276, 2019.

Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. In International Conference on Machine Learning, pages 18347 18377. PMLR, 2022.

Optimization-based Causal Estimation

Mateo Rojas-Carulla, Bernhard Sch olkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research, 19(1):1309 1342, 2018.

Elan Rosenfeld, Pradeep Ravikumar, and Andrej Risteski. The risks of invariant risk minimization. ar Xiv preprint ar Xiv:2010.05761, 2020.

Dominik Rothenh ausler, Peter B uhlmann, and Nicolai Meinshausen. Causal dantzig: fast inference in linear structural equation models with hidden variables under additive interventions. The Annals of Statistics, 47(3):1688 1722, 2019.

Dominik Rothenh ausler, Nicolai Meinshausen, Peter B uhlmann, and Jonas Peters. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2):215 246, 2021.

Walter Rudin et al. Principles of mathematical analysis, volume 3. Mc Graw-hill New York, 1964.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Bernhard Sch olkopf. Causality for machine learning. ar Xiv preprint ar Xiv:1911.10500, 2019.

Bernhard Sch olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. ar Xiv preprint ar Xiv:1206.6471, 2012.

Bernhard Sch olkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634, 2021.

Claudia Shi, Victor Veitch, and David Blei. Invariant representation learning for treatment eﬀect estimation. ar Xiv preprint ar Xiv:2011.12379, 2020.

G. Shmueli et al. To explain or to predict? Statistical Science, 25(3):289 310, 2010.

Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. Springer, 2000.

Jin Tian and Judea Pearl. Causal discovery from changes. In Proceedings of the Seventeenth conference on Uncertainty in artiﬁcial intelligence, pages 512 521, 2001.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996.

Julia K Winkler, Christine Fink, Ferdinand Toberer, Alexander Enk, Teresa Deinlein, Rainer Hofmann-Wellenhof, Luc Thomas, Aimilios Lallas, Andreas Blum, Wilhelm Stolz, et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA dermatology, 155(10):1135 1141, 2019.

Yin, Wang, Blei

Sewall Wright. Correlation and causation. Journal of agricultural research, 20:557 580, 1921.

Chuanlong Xie, Fei Chen, Yue Liu, and Zhenguo Li. Risk variance penalization: From distributional robustness to causality. ar Xiv preprint ar Xiv:2006.07544, 2020.

Mingzhang Yin, Nhat Ho, Bowei Yan, Xiaoning Qian, and Mingyuan Zhou. Probabilistic Best Subset Selection by Gradient-Based Optimization. ar Xiv e-prints, 2020.

Kui Yu, Lin Liu, and Jiuyong Li. Learning markov blankets from multiple interventional data sets. IEEE transactions on neural networks and learning systems, 31(6):2005 2019, 2019a.

Kui Yu, Lin Liu, Jiuyong Li, Wei Ding, and Thuc Duy Le. Multi-source causal feature selection. IEEE transactions on pattern analysis and machine intelligence, 42(9):2240 2256, 2019b.

Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, and Doina Precup. Invariant causal prediction for block MDPs. In International Conference on Machine Learning, pages 11214 11224. PMLR, 2020.

Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541 2563, 2006.