# pac_generalization_via_invariant_representations__aadb710b.pdf PAC Generalization via Invariant Representations Advait Parulekar 1 Karthikeyan Shanmugam 2 Sanjay Shakkottai 1 Abstract Invariant representations are transformations of the covariates such that the best model on top of the representation is invariant across training environments. In the context of linear Structural Equation Models (SEMs), invariant representations might allow us to learn models with outof-distribution guarantees, i.e., models that are robust to interventions in the SEM. To address the invariant representation problem in a finite sample setting, we consider the notion of ϵ-approximate invariance. We study the following question: If a representation is approximately invariant with respect to a given number of training interventions, will it continue to be approximately invariant on a larger collection of unseen intervened SEMs? Inspired by PAC learning, we obtain finite-sample out-of-distribution generalization guarantees for approximate invariance that holds probabilistically over a family of linear SEMs without faithfulness assumptions. We also show how to extend our results to a linear indirect observation model that incorporates latent variables. 1. Introduction A common failure of empirical risk minimization in machine learning is that it is beneficial to exploit spurious correlations to minimize training loss. A classic example of this comes from (Beery et al., 2018) where we have a dataset with pictures of cows and camels. In most cases, the cows are in pastures and the camels are in deserts, and a classifier that works well on such a dataset generalizes poorly to classifying cows in deserts. How do we prevent a classifier tasked with distinguishing pictures of cows from pictures of camels from using the color of the images? Certainly doing so might actually help with the classification - take an extreme example wherein rather than just having 1Department of Electrical and Computer Engineering, University of Texas at Austin 2Google Research India. Correspondence to: Advait Parulekar . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). an indicative background, the species of the animal is annotated in the image itself. In this case, simply learning to interpret what is annotated will lead to an excellent classifier. However, such a classifier will perform poorly on a future dataset with incorrect or missing annotations. In asking for such out-of-distribution (OOD) generalization (generalization to different distributions rather than different samples from the same distribution), we are asking for more than what is guaranteed by traditional PAC learning. For instance, we might have a covariate shift, meaning the marginal on the covariates of the joint distributions of our training and test data is different. We would like a classifier that is trained on training environments to generalize to the test environments. Recently, in the OOD generalization literature, the data for the training and test environments have been modeled to arise from a set of causal models that are intervened versions of each other. The true causal relationship in causal models between a target variable and its causal parents remains invariant while other relationships could change (Peters et al., 2016; Bareinboim et al., 2012). Building on this observation, authors in (Arjovsky et al., 2019) proposed to learn a representation across the training environments such that the optimal classifier on top of it remains the same. We refer to this as the invariant representation in this work. Will this representation and the predictor generalize (remain invariant) in unseen environments? Does such a formulation result in improvements over other methods such as empirical risk minimization or distributionally robust optimization? There has been some debate about this matter. Some theoretical works (Rosenfeld et al., 2020), (Kamath et al., 2021) describe regimes and settings in which either the convex penalty suggested by (Arjovsky et al., 2019) or the invariance formulation in general are not sufficient to generalize to new environments. On the other hand, there are also settings described in (Arjovsky et al., 2019), (Ahuja et al., 2020), (Kamath et al., 2021) in which IRM does provably lead to OOD generalization guarantees that are not possible using other methods. However, these prior studies require strong conditions such as faithfulness (Yang et al., 2018), (Pearl, 2009), that come from the structure learning community. Indeed, as noted in (Uhler et al., 2013), faithfulness (that observed invariances imply structural constraints on the PAC Generalization via Invariant Representations causal graph) is a very strong assumption in the finite sample setting. This is because the volume of the set of linear SEMs that are close to ones with faithfulness violations is a large constant fraction of all linear SEMs. This makes it difficult to resolve the question of whether an observed conditional independence truly holds due to the underlying causal structure with finite samples. Invariance based results almost always study the infinite sample setting and assume general position conditions on the training environments; similar to the faithfulness assumption, these become much stronger assumptions in the finite sample setting. Thus in the worst case, perhaps we cannot expect generalization guarantees without an exponentially many interventions/samples. Instead of studying the worst case setting, suppose we consider the PAC setting can we now derive finite sample generalization guarantees without the need of strong faithfulness or general position assumptions, yet with a polynomial scaling in interventions/samples? 1.1. Contributions We answer the above affirmatively, and provide the first generalization and associated finite-sample guarantees for invariant representations without faithfulness assumptions. We study a setting in which we have access to a family of linear SEMs related to each other by hard and soft interventions on arbitrary subsets of variables. We assume that the training environments arise by random sampling from a distribution over this family. We derive a PAC bound for the number of interventions needed to get probabilistic generalization on the family of SEMs. A central result is that we need only O( n4 δ ) training environments, such that approximately invariant representations on the training set generalizes with probability 1 δ on the family of unseen linear SEMs. Further our theory also works for general linear representations beyond feature selection for indirect and direct observation models (e.g., could have latent variables) with a scaling of O( n8 δ ); please see Appendix 9.4 for details. We show tighter interventional complexity independent of n for SEM families that have simultaneous atomic interventions on any fixed set of k nodes and soft interventions on k nodes in a degree bounded setting. Secondly, we characterize the number of samples in each training environment that is needed for approximate invariance to hold with high probability. This gives an end-to-end sample complexity guarantee along with the interventional complexity required in the above setting without any faithfulness or general position assumptions. Furthermore, we show extensions of our results to a linear indirect observation model that incorporates latent variables. Finally, we empirically demonstrate that in the setting described above using a notion of generalization that we de- scribe, most approximately invariant representations generalize to most new distributions. 1.2. Related Work Causal Structure Learning: One approach, in the presence of data sampled from structural models, is simply to try and learn the DAG structure explicitly. Algorithms that use conditional independence (CI) tests and/or penalized likelihood scoring are used to determine the DAG (Spirtes et al., 2000; Solus et al., 2021; Chickering, 2020; Brouillard et al., 2020). These methods can only learn DAGs up to what is called the Markov Equivalence Class (MEC) (Verma & Pearl, 1990), or I-MECs (Interventional MECs) when given access to general interventions (Hauser & B uhlmann, 2012; Squires et al., 2020; Brouillard et al., 2020) (Yang et al., 2018) under some form of faithfulness assumptions on the underlying causal model. It has been shown that some forms of faithfulness assumption are actually stronger than one might expect; linear SEMs that are close to a faithful violation form a much larger set (Uhler et al., 2013) and are difficult to distinguish in finite samples from faithful ones. However, in light of the fact that we only want generalization guarantees for invariance and not structure recovery, one could argue that learning the structure, or even feature selection is stronger than what we actually need. Domain Adaptation Methods: Domain adaptation literature (Ganin et al., 2016; Muandet et al., 2013; Ben-David et al., 2007) learns representations whose distribution does not change across training and an unlabeled test distribution. Another line of work searches for the best mixture of training distributions to train on or mixture of pretrained models for generalization to unlabeled test data or test data with limited test labels (Mansour et al., 2021; 2008). Distributionally robust optimization approaches that optimize the worst risk over an uncertainty set of distributions have been proposed (Duchi et al., 2021; Sagawa et al., 2019). Please see Appendix A of (Gulrajani & Lopez-Paz, 2020) for a more exhaustive list of works. Robust optimization and multi source domain adaptation methods search in an uncertainty ball around the training distributions while domain adaptation fail even with a shift in marginal distributions of the labels (Zhao et al., 2019). Such methods fail when the test distribution is outside the convex hull of training distributions. (Magliacane et al., 2017) use invariance tests, which require faithfulness assumptions. (Chen et al., 2023) applies a multi-objective optimization perspective to the OOD problem to resolve some of the optimization difficulties of OOD generalization. Finally, the interventional complexity of learning invariant representations is also considered in (Chen et al., 2022) and (Wang et al., 2022). Under the data model of (Rosenfeld et al., 2020), (Chen et al., 2022) show that a smoothed analysis (with distributional assumptions) can get logarithmic interventional complexity, while (Wang PAC Generalization via Invariant Representations et al., 2022) use second-order moments to get just O(1) training environments in the infinite sample limit. Finite Samples and Invariant Risk Minimization: One framework proposed to bypass structure learning difficulties and directly look for invariant representations is to use Invariant Risk Minimization (IRM) (Arjovsky et al., 2019). (Arjovsky et al., 2019) show that under some general position assumptions on the population covariances of the training distributions the exact invariant representation recovers the true causal parents. The work of (Ahuja et al., 2020) makes similar assumptions to derive sample complexity guarantees. (Kamath et al., 2021) also consider the question of getting generalization guarantees from IRM for general distributions, and address the number of training intervention needed for IRM; however, they work in the infinite sample limit with exact invariance and require an analytical mapping between intervention index set and training distributions. All these works assume regularity conditions and provide deterministic guarantees for a test distribution. Invariant Predictors on DAGs: Recent works (Subbaswamy et al., 2019) have assumed the knowledge of the causal DAG behind the unseen test and/or train distributions along with information about the nodes that are intervened on in the test. It is then possible to characterize all possible invariant representations as a function of the graph. In our study, we do not assume any such side information about the SEM underlying our distributions. Finally, in (Rothenh ausler et al., 2021), there is a known set of variables (anchors) whose effect is considered to be spurious (in our setting, we do not know these). They study a weighted projection that penalizes estimates that are aligned with the anchor observations. In contrast, we study the IRM objective: rather than being a min-max of residual error over some set of distributions, our objective encourages representations that lead to invariant least squares solutions. 2.1. Structural Equation Model We consider a causal generative model specified by a linear Structural Equation Model (SEM) - data that is generated by linear equations on a directed acyclic graph (DAG) of variables. Specifically, we work with a random n + 1 dimensional vector X = (X1, X2, , Xn, Xt). We assume that there is some unknown directed acyclic graph G with nodes (v1, v2, , vn, vt) which specifies the relationships between the variables in X, so that the node vi corresponds to the random variable Xi. The variables are related to each other by structural equations of the form: Xθ i = βθ i Paθ i + ηθ i = X βθ j,i Xθ j + ηθ i , (1) Xt = β t Pat + ηt = X j Pa βj,t Xj + ηt. (2) Here ηi are the exogenous variables in our system and Xi denotes the endogenous variables in the system. Pai denote the set of random variables (stacked as a column vector) corresponding to the parents of node vi in G. The weights of the linear SEM, when there is a directed edge between node vj and vi are represented by βj,i. There is some target node Xt that is special in that we would like to predict its value given the rest of the variables. We will denote the vector consisting of the rest of the variables as X t. We will denote by barred quantities the zero-padded versions of the above vectors, so βt, is such that β t Pat = β t X t. 2.2. Interventions An important feature of Equation (1) is the superscript θ for θ Θ, an element of some intervention index set Θ. Θ specifies a parameterized intervention family, i.e. a family of linear SEMs related to each other by interventions. Conceptually various environments correspond to different mechanisms by which causal parents influence children. The changes in mechanisms are reflected in the weights βθ j,i parameterized by θ. Note that the conditional distribution of the target variable Xt given its parents Pat is assumed to be the same even for different θ Θ. This is a property that we would like to exploit so that our task of predicting Xt is robust to different interventions in Θ, i.e. there exists at least one robust predictor for all environments. We begin with an observational distribution with weights given by βθ . There are two common types of interventions studied: Atomic or Hard Interventions: In hard interventions θ at some node vi, we assign a value to Xi. This corresponds to setting βj,i = 0 for all Xj Pai, and setting ηθ i = aθ i . Soft Interventions: In soft interventions, the weights βθ i = βθ i are altered while keeping the noise the same ηθ i = ηθ i . The SEM parameters θ determine a joint distribution over X, which we denote by θ. Distributions over Interventions: In our model, we consider a distribution D over Θ from which we assume that our training and test data-sets are drawn by sampling the intervention index θ independently and randomly from D. We provide generalization bounds that work with high probability over the randomness of the choice of test distribution, rather than deterministically as has been done in prior works (as we will explore in more detail, providing deterministic bounds is not possible without additional assumptions). Formally, Θtrain is an index set of interventions drawn i.i.d. from D over Θ. The test intervention index θtest is also drawn from D independently. The learner does not see any details PAC Generalization via Invariant Representations about the intervention (which nodes are affected, etc), but sees the distinct datasets from the distinct interventions. Data generation: In this sense, the training data is generated by first sampling Θtrain from D, and then sampling | Θtrain | training datasets Xθ θ for θ Θtrain. 3. PAC-Invariant Representations Motivated by Invariant Risk Minimization (IRM), we consider the search for a model that performs well under a variety of distributions in the hopes of attaining generalization guarantees. Consider a prediction function f with loss Rθ(f). Here the superscript θ represents the intervention index. In this paper, we consider least squares regression, f Rn is a vector, and the loss is given by Rθ(f) = Eθ[ f X t Xt 2] (3) where the expectation is taken with respect to θ. We are interested in generalization guarantees for representations Φ that satisfy the property that f = arg min f Rθ(f Φ) θ Θtrain, (4) is invariant across environments. That is, the least squares solution on top of the representation is the same for every intervention. We will denote the set of all invariant representations over a class of SEMs Θ by I(Θ ). I(Θ ) = {Φ : f s.t. f = arg min f Rθ(f Φ) θ Θ } We refer to the full model f Φ when Φ is invariant as an invariant solution. We use Rθ Φ(f) to denote the loss for f on top of representation Φ, so Rθ Φ = Rθ(f Φ). We refer to f as the head of the model. Feature selection, a motivation for the study of invariant representations, corresponds to diagonal representations; henceforth, we focus on the class of representations that are feature selectors, given by diagonal matrices. Invariance via gradients: Rather than working with the loss itself, we follow (Ahuja et al., 2020), (Arjovsky et al., 2019) and use the gradient of the loss instead (see Lemma 8.5). Φ I(Θtrain) f s.t. f Rθ Φ(f) = 0 θ Θtrain. Unlike previous work, we do not further minimize the weighted empirical loss over training environments. We will instead consider the generalizability of any approximately invariant representation. Finite samples and ϵ approximately Invariant Representations: Since we are working with finite samples, we can no longer hope for exact invariance across environments. We slightly change the definition of invariance to be about the gradient being close to 0. Φ Iϵ(Θtrain) f s.t. Rθ Φ(f) 2 ϵ θ Θtrain. (5) We refer to approximate versions of quantities using a superscript ϵ, and, given specific datasets, we refer to finite sample versions of these quantities using hats, so for instance the set of representations that are invariant for some set Θ is denoted I(Θ ), the set of ϵ approximately invariant representations is given by Iϵ(Θ ), finally, given a particular dataset, the set of ϵ approximately invariant representations would be denoted ˆIϵ(Θ ). Note that the finite sample quantities are random (from the randomness of the samples drawn from θ), while the ϵ approximate quantities are deterministic. The dataset of N points is denoted in matrix form as X with X t and Xt denoting the non-target matrix and target vector respectively. Significance of linear representations: A linear representation Φ affects invariance non-trivially by selecting an effective column space for the regression, that is, in determining the column space of Φ X t. Because we are looking for a fixed (across environments) least-squares head on top of the representation, Φ can be further composed with any invertible linear map, only to be inverted at the head to obtain another invariant solution (ΦA, A 1f) from an invariant solution (Φ, f). Because of this freedom, we can actually choose to work with some fixed head f . When the space of Φ is all matrices f is chosen to be (1, 0 . . . 0). When the space of Φ is only diagonal matrices, then f is taken to be (1, 1 . . . 1). These are formalized in Lemma 8.3 and Lemma 8.4. As we will see, a key construction is that of Iϵ f (θ), defined as those Φ such that the gradient is near-zero at f , that is, Φ Iϵ f (θ) Rθ Φ(f ) ϵ, for some specific f . Similarly, for some set of interventions Θ , we define Iϵ f (Θ ) = T θ Θ Iϵ f (θ). This allows us to look at invariance as a property of a representation and a single intervention, rather than that over collection of interventions as in Equation 4. This is discussed further in Remark 3.6. 3.1. Motivation for Probabilistic Guarantees Consider a distribution D over interventions. During training, we see samples from specific interventions drawn from this distribution Θtrain, and compute invariant representations for these interventions. Ideally, we would like to be able to say that an invariant solution computed in this way will generalize to all future interventions on the same DAG. However, this is not possible without further assumptions, as the following examples illustrate. PAC Generalization via Invariant Representations Figure 1. Figure for Example 3.1 showing that there could be more independencies in the data than apparent from the DAG alone. Example 3.1 (Faithfulness). Consider a linear SEM on four variables X1, X2, X3, Xt given by X1 = N(0, 1), X2 = X1 + N(0, 1), X3 = a X1 + b X2 + N(0, 1), Xt = X2 + X3 + N(0, 1). depicted in Figure 1. This system has at least the following two invariant representations of X = (X1, X2, X3) if a = b for every intervention: (X2, X3) and (X2). However, only the first continues to be invariant when a = b. In other words, given an arbitrary number of training environments, each of which satisfies a = b, we might decide that (X2) is an invariant representation. However, this fails to be invariant once we include a test intervention for which a = b. This happens because the joint distribution is not faithful to the DAG provided. In short, this means that the conditional independencies indicated by the DAG are not the only ones found in the data. In our case, the effect of X1 on X3 through the blue path and the red path exactly cancel in all training interventions. Example 3.2 (Degenerate Interventions). Consider the degenerate situation in which each of the interventions is the same. The Empirical Risk Minimization (ERM) solution itself is invariant, however, the ERM solution comes with no generalization guarantees to other interventions. It is clear that some diversity condition among the training environments is necessary to get generalization guarantees. The key idea is that in such situations one might say that these issues (faithfulness violations as in Example 3.1 or degeneracies as in 3.2) will likely continue to manifest in future distributions. In particular, if we assume that both training and test intervention come from a single distribution over interventions, can we at least say that with high probability over a test intervention also drawn from D, we will generalize to that intervention? More formally, we ask: How big should Θtrain be so that for θ D, Iϵ f (Θtrain) Iϵ f (θ) with high probability? In other words, we would like invariance on our training interventions to certify invariance over a future interventions with high probability. To further clarify the type of guarantee considered, we look at the following example. Example 3.3. Consider the DAG given in Figure (2). Consider the following distribution over interventions: We set X1 X2 Xn 1 Xn Figure 2. DAG for intervention in Example 3.3 X1 = 1. Independently and with probability 1/2 for each node that is not X1 or Xt, we assign its value to be 0. Else we set its value equal to its parent. The only invariant representations in this case are (Xn) and (X1, Xn). Now consider any training set of m interventions. If m 2n, we expect to never see the intervention in which none of the edges are disconnected (i.e., an environment with no intervention on any of X2, X3, .., , Xn 1), in which case (X1) also appears to be invariant over every observed intervention. However, it fails to be invariant on the disconnected intervention described above, which occurs with an extremely low probability of 1/2n 1. Our result instead gives a probabilistic guarantee that says that with a certain probability over the randomness from which we are drawing interventions, any representation that appears to be invariant over the training set will also be invariant when the set is expanded to include another i.i.d. intervention (the test intervention). As demonstrated in this example, we may need Θ(2n) interventions to get deterministic guarantees while our poly(n) result highlights the benefit of getting a probabilistic guarantee in this setting. 3.2. PAC Formulation In this section, we show that the question of generalization of ϵ invariant representations introduced in the previous section can be rewritten as a question of PAC generalization. 3.2.1. VC DIMENSION AND PAC GENERALIZATION Let X Rd be a subset of points. A function class G is said to shatter X if every binary assignment to X is realized by some element of G, that is, for any σ : X { 1}, we have that there is some f G such that f(x) = σ(x) for all x X. The VC dimension of a function class is defined as the size of the largest set that is shattered by it. The following result from PAC learning theory allows us to determine how many interventions are needed to generalize. Lemma 3.1 ((Shalev-Shwartz & Ben-David, 2014)). Consider a class G of functions from X to { 1} of VC dimension D, and a distribution D over X { 1}. In the realizable setting, when there exists f G such that P(X,Y ) D[f(X) = Y ] = 0, given m = O( D+log 1 δ1 δ2 ) samples (xi, yi)m i=1 and f such that f(xi) = yi for all i, with probability at least 1 δ1, f satisfies P(X,Y ) D[ f(X) = Y ] < δ2. PAC Generalization via Invariant Representations We will need the following known VC dimension. Lemma 3.2 ((Shalev-Shwartz & Ben-David, 2014)). The VC dimension of halfspaces in RD, that is, the function class G = 1{a x < b} : a RD, b R is D + 1. 3.2.2. PAC INVARIANCE We will use PAC learning theory to study generalization for representations. For that we will rephrase the invariance problem to that of the generalization of binary classifiers. The proof is deferred to the Appendix, Lemma 8.7. Lemma 3.3. There is a function class G mapping interventions θ Θ to {0, 1}, such that Φ Iϵ f (Θtrain) gϵ Φ G such that gϵ Φ(θ) = 1 θ Θtrain. In summary, if we can bound the VC dimension of G, we can specify interventional complexity guarantees. The proof is deferred to the Appendix, Corollary 8.8 Corollary 3.4. If the VC dimension of G is bounded by D, then given at least O( D+log 1 δ δ ) training interventions, if Iϵ(Θ) = {}, we have with probability 1 δ over the set of training interventions, Pθ D(Iϵ f (Θtrain) Iϵ f (θ)) δ . Finally, there is at least one truly invariant representation, as demonstrated in the following Lemma. Lemma 3.5. There exists an invariant representation. Proof. The representation Φ = diag(βt) is invariant. That is, the diagonal matrix with βt on the diagonal in indices corresponding to Pat. Remark 3.6. (Fixing a head is important for PAC learning) Considering a fixed head f allows us to certify invariance locally, that is, looking at only a single intervention at a time. This is in contrast with the previous characterization, Equation (4), in which we simply ask for the best head on top of a representation to be the same for all interventions - a condition that we can only check for by considering all interventions simultaneously. 3.3. The VC dimension of certain classes of interventional distributions Further, we show that for atomic interventions on some fixed set of k nodes, O(k4) training interventions suffices. For soft interventions on k nodes, assuming that each node has in-degree bounded by d, we show that O(d4k) interventions suffices. This captures the intuition that if there is only some sparse, local change in the SEM, it should only take a number of interventions that scales in the sparsity to capture the invariances it leaves. In the next section, we will show how to extend these results to the finite sample setting. 3.3.1. GENERAL INTERVENTIONS For this section we approach the task of bounding the VC dimension from the perspective of the complexity of the representation space. The class of interventions we consider are any interventions that lead to joint distributions over the variables that leave fixed the conditional distribution for the target variable given its parents. Theorem 3.7. Suppose that we are given O( n4+log 1 δ δ ) interventions drawn independently from distribution D over the intervention index set Θ. Then with probability at least 1 δ over the randomness in Θtrain, the following statement holds: Pθ D(Iϵ f (Θtrain) / Iϵ f (θ)) δ . In other words, an ϵapproximate invariant representation on the training environments generalizes with high probability to the test environment. Connections to Causal Bayesian Networks: Invariance Principle or (less popularly known as) the modularity condition can be equivalently used to define causal Bayesian Networks (Bareinboim et al., 2012) - the central object in Pearlian Causal Models. With respect to all possible interventions wherein variable y has not been intervened on, only the representation that involves the true causal parents of y is invariant. In other words, the property of invariance can be taken to be the signature of causality. An open question in the setting has been the following: In how many interventional environments does this invariance property need to hold before it generalizes to most unseen environments. With respect to a random sampling on interventional environments for linear SEMs, our result answers this via high probability generalization guarantees without any faithfulness assumptions. Structure Learning and Faithfulness: Invariance testing between a given set of interventional environments has been used to constrain the space of causal models (Yang et al., 2018) in the literature. However, the learning algorithm that synthesizes various invariances does require some notion of interventional faithfulness, i.e. observed invariances imply topological constraints on the true causal graph. In contrast, we do not make any such assumptions about faithfulness. Comparison to Regularity Conditions in Invariant Prediction: Recently, invariant prediction (Arjovsky et al., 2019) has been used for out of distribution generalization. However, exact invariance holding deterministically in all unseen distributions from a family of linear SEMs was de- PAC Generalization via Invariant Representations sired. This required some general position conditions on the population covariances of the training interventional environments. The number of environments despite these additional conditions, required was O(n2). Our result on the interventional complexity is weaker but holds when we need no such assumptions on covariances and it holds under random sampling for approximate invariances. Similar technical general position conditions were also required in (Ahuja et al., 2020) for generalization. In another recent work (Kamath et al., 2021), if the mapping between the intervention index set Θ and the observed training distributions on X is analytic, then only two training environments (almost surely) suffice for exact invariance in the population setting (i.e., infinite samples) for the entire index set. Here, we analyze approximate invariance without any restrictions on the mapping being analytic. For instance, their result does not apply in Example 3.3, where we need two specific interventions (one in which each node is assigned the value of the parent, and any other) to certify invariance. In fact, any set of interventions needs to have the specific intervention highlighted in the example to certify invariance, and this happens with very low probability. 3.3.2. ATOMIC INTERVENTIONS Here we consider the instance in which our family of interventions consists of atomic interventions on a total of k nodes At. An atomic intervention at node i replaces the generative equation Xi = γi(e) Pai + ηi by the assignment Xi = ai for some scalar ai. Note that we can interpret this as zeroing out the corresponding entries of B and changing the noise variable to be a constant ai. Theorem 3.8. Given |Θtrain| = O( k4+log 1 δ δ ) interventions drawn independently from a distribution D over all atomic interventions on some fixed set At of k nodes that with probability at least 1 δ over the randomness in Θtrain, with probability at least 1 δ over the randomness in D, we have for θ D, Iϵ(Θtrain) Iϵ(θ) 3.3.3. SOFT INTERVENTIONS Soft interventions are those in which the weights of the SEM are modified while the underlying causal structure remains the same. That is, a soft intervention at node vi is equivalent to replacing the equation Xi = (βθ i ) X + ηi with Xi = (βθ i ) X + ηi. Note that this is equivalent also to changing the ith row of Bθ. Theorem 3.9. Given O( d4k+log 1 δ δ ) interventions drawn independently from a distribution D over all soft interventions on some fixed set At of k nodes such that each intervened node has in-degree at most d, we have that any ϵ approximately invariant representation Φ satisfies that with probability at least 1 δ over the randomness in Θtrain, with probability at least 1 δ over the randomness in D, we have for θ D,Iϵ(Θtrain) = Iϵ(θ) 3.3.4. INDIRECT OBSERVATIONS Finally, we consider the setting in which there is an underlying SEM which we observe after a linear transformation. In particular, we consider the setting in which we observe (S t(X), St(X)) for some linear transformations S t and St. Here, X is generated as before, according to 1. Note that the target variable is now St. We assume for realizability (in particular, for Lemma 3.5) the following: Assumption 3.10. There is some linear representation Φ such that ΦS t(X) = Pat. Theorem 3.11. Given O( n8+log 1 δ δ ) interventions drawn independently from a distribution D interventions, we have that any ϵ approximately invariant representation Φ satisfies that with probability at least 1 δ over the randomness in Θtrain, with probability at least 1 δ over the randomness in D, we have for θ D,Iϵ(Θtrain) = Iϵ(θ) Remark 3.12. Note that Theorem 3.11 encompasses models that have latent variables, so long as the causal parents are observed. Note that the existence of latent variables typically precludes any guarantees for algorithms that rely on CI-testing. One could consider this to be a factor in favor of using IRM-based solutions for OOD generalization. Remark 3.13. Note that we only need Assumption 3.10 to be satisfied, and not that the functions S t and St themselves are linear. This is weaker than a linearity assumption, since we could have non-linearity in the spurious components. 4. Finite samples Note that the above discussion was about population statistics. In practice, we only see finitely many samples from each interventional distribution. We use an estimator similar to that of (Ahuja et al., 2020) to get finite sample guarantees. We need the following assumption for scale. Assumption 4.1. The following bounds hold, Φ 2 1, X < L. Lemma 4.2. Given 4n L2 δ + n2 log(1 + 8n3/2 samples from θ, we have that with probability 1 δ over the samples drawn in each interventional distribution1 ˆIϵ f (θ) I2ϵ f (θ). Proof Sketch. We show this using the standard concentration arguments for a single fixed representation Φ. We take a union bound over an ϵ net of Rθ Φ to get a uniform bound over all representations. See Lemma 10.3 for a proof. 1In contrast to claims over the randomness D over Θ from previous theorems, this is a statement about the randomness θ for X. PAC Generalization via Invariant Representations We can now take a union bound over all θ Θtrain. In conclusion, we have shown that with high probability over the randomness in the samples we see in each of our interventional distributions, ϵ approximate invariance over the training data certifies ϵ approximate invariance over a 1 δ fraction of out full intervention set. The proof is defered to the Appendix Theorem 10.4 Theorem 4.3. Given O( k4+log 1 δ δ ) Θ is k nodes, hard interventions O( d4k+log 1 δ δ ) Θ is k nodes, soft interventions O( n4+log 1 δ δ ) Θ any interventions interventional datasets, and, in each dataset, at least 4n L2 δ + n2 log(1 + 8n3/2 ϵ ) samples, we have that with probability 1 δ, with probability 1 δ for θ D, ˆIϵ f (Θtrain) I2ϵ f (θ). 5. Empirical Study In this section we highlight the above results empirically. The experiment is described in detail in Appendix 11. Figure 3. Linear SEM for Section 5. Nodes {v3, v4, v5} (purple) are used as intervention sites for Dhard and Dsoft. Each edge represents a weight of 1. Node Xt (green) is taken to be the target node. We consider the 7-node linear SEM in Figure 3. The target variable is taken to be Xt. Each edge weight is set to 1 for the observational distribution. We consider an interventional distribution Dhard with support over the set of hard interventions on nodes {v3, v4, v5}. Recall that a hard intervention consists of assigning a value to a node. We draw m interventional distributions from Dhard as our training interventions, and draw a sample consisting of N = 200000 datapoints from each distribution. The corresponding plots with N {10000, 15000, 30000, 45000} are available in Appendix 11. We use a notion of interdataset variance of the least squares solutions to construct a set shard = {S1, S2, , } where each Si is an approximately invariant representation derived from the training datasets. Rather than using variance, we can use the norm of the gradient; please see Appendix 11 for details. We then generate an additional mtest = 100 test interventions from Dhard, and generate datasets of size N = 200000 from each of them. We use the same inter-dataset variance to count the percentage of subsets in shard that continue to have low variation between the least squares solutions on the new test distributions and the average least squares solution on the training distributions. We repeat the above for soft interventions drawn from Dsoft. We consider soft interventions that modify the weights of the edges into v3, v4, and v5. Please see Appendix 11 for exact details about how the hard and soft interventional ensembles are defined. 3 5 8 11 14 17 20 number of training interventions generalization percentage intervention hard soft hard+rademacher ERM+hard Figure 4. Generalization of approximately invariant representations under different combinations of train and test interventional distributions for the experiment described in Section 5. The ERM solution almost always fails to generalize, and the corresponding box plot is at 0 . We also consider a different interventional distribution Drad to test the generalization of representations that are approximately invariant given training distributions drawn from Dhard. These interventions are generated by randomly flipping the signs of the edge weights in the original SEM. Finally, to confirm that there is indeed variation across the least squares solutions in the different environments, we also plot the percentage of test distributions to which the ERM solution on the training distributions is similar to ERM solution of the test distributions (this percentage is almost always 0; meaning that the ERM solution always fails to generalize). We repeat this for m varying beteen 3 and 20. The results are plotted in Figure 4. The code is available at https://github.com/ PAC Generalization via Invariant Representations advaitparulekar/PAC IRM Interpretation: While not every subset in shard and ssoft generalizes to every test dataset, we see that most subsets generalize to large percentage of test distributions. The percentage that generalize when the train and test interventions are identically distributed exceeds the percentage that generalizes to datasets that come from the ( adversarial ) rademacher interventional family. This phenomenon is captured in our bounds, which describe when most approximately invariant representations will continue to be approximately invariant on most test distributions. Acknowledgements This research is supported in part by NSF Grants 2019844, 2107037 and 2112471, ONR Grant N00014-19-1-2566, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program. Ahuja, K., Wang, J., Dhurandhar, A., Shanmugam, K., and Varshney, K. R. Empirical or invariant risk minimization? a sample complexity perspective. In International Conference on Learning Representations, 2020. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Bareinboim, E., Brito, C., and Pearl, J. Local characterizations of causal bayesian networks. In Graph Structures for Knowledge Representation and Reasoning, pp. 1 17. Springer, 2012. Beery, S., Horn, G. V., and Perona, P. Recognition in terra incognita. In ECCV, 2018. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F., et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19:137, 2007. Brouillard, P., Lachapelle, S., Lacoste, A., Lacoste-Julien, S., and Drouin, A. Differentiable causal discovery from interventional data. ar Xiv preprint ar Xiv:2007.01754, 2020. Chen, Y., Rosenfeld, E., Sellke, M., Ma, T., and Risteski, A. Iterative feature matching: Toward provable domain generalization with logarithmic environments. In Advances in Neural Information Processing Systems, 2022. Chen, Y., Zhou, K., Bian, Y., Xie, B., Wu, B., Zhang, Y., KAILI, M., Yang, H., Zhao, P., Han, B., et al. Pareto invariant risk minimization: Towards mitigating the optimization dilemma in out-of-distribution generalization. In The Eleventh International Conference on Learning Representations, 2023. Chickering, M. Statistically efficient greedy equivalence search. In Conference on Uncertainty in Artificial Intelligence, pp. 241 249. PMLR, 2020. Duchi, J. C., Glynn, P. W., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 2021. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096 2030, 2016. Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. In International Conference on Learning Representations, 2020. Hauser, A. and B uhlmann, P. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409 2464, 2012. Kamath, P., Tangella, A., Sutherland, D., and Srebro, N. Does invariant risk minimization capture invariance? In International Conference on Artificial Intelligence and Statistics, pp. 4069 4077. PMLR, 2021. Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. Domain adaptation by using causal inference to predict invariant conditional distributions. ar Xiv preprint ar Xiv:1707.06422, 2017. Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation with multiple sources. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. Mansour, Y., Mohri, M., Ro, J., Suresh, A. T., and Wu, K. A theory of multiple-source adaptation with limited target labeled data. In International Conference on Artificial Intelligence and Statistics, pp. 2332 2340. PMLR, 2021. Muandet, K., Balduzzi, D., and Sch olkopf, B. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10 18. PMLR, 2013. Pearl, J. Causality. Cambridge University Press, 2 edition, 2009. doi: 10.1017/CBO9780511803161. PAC Generalization via Invariant Representations Peters, J., B uhlmann, P., and Meinshausen, N. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pp. 947 1012, 2016. Rosenfeld, E., Ravikumar, P. K., and Risteski, A. The risks of invariant risk minimization. In International Conference on Learning Representations, 2020. Rothenh ausler, D., Meinshausen, N., B uhlmann, P., and Peters, J. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B, 83(2):215 246, April 2021. doi: 10.1111/rssb.12398. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. doi: 10.1017/ CBO9781107298019. Solus, L., Wang, Y., and Uhler, C. Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika, 108(4):795 814, 2021. Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. MIT Press, 2000. Squires, C., Wang, Y., and Uhler, C. Permutation-based causal structure learning with unknown intervention targets. In Conference on Uncertainty in Artificial Intelligence, pp. 1039 1048. PMLR, 2020. Subbaswamy, A., Schulam, P., and Saria, S. Preventing failures due to dataset shift: Learning predictive models that transport. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3118 3127. PMLR, 2019. Sullivant, S., Talaska, K., and Draisma, J. Trek separation for gaussian graphical models. The Annals of Statistics, 38(3):1665 1685, 2010. Uhler, C., Raskutti, G., B uhlmann, P., and Yu, B. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, pp. 436 463, 2013. Verma, T. and Pearl, J. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI 90, pp. 255 270, USA, 1990. Elsevier Science Inc. ISBN 0444892648. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices, 2011. Wang, H., Si, H., Li, B., and Zhao, H. Provable domain generalization via invariant-feature subspace recovery. In International Conference on Machine Learning, pp. 23018 23033. PMLR, 2022. Yang, K., Katcoff, A., and Uhler, C. Characterizing and learning equivalence classes of causal dags under interventions. In International Conference on Machine Learning, pp. 5541 5550. PMLR, 2018. Zhao, H., Des Combes, R. T., Zhang, K., and Gordon, G. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pp. 7523 7532. PMLR, 2019. PAC Generalization via Invariant Representations Xi random variable in SEM, see Equation 1 βi linear weight vector, see Equation 1 Pai vector of random variables corresponding to parents of Xi, see 1 Θ set of possible interventions Θtrain set of training interventions D distribution from which training and test distributions are sampled θ joint distribution over variables induced by intervention θ I(Θ ) set of invariant representations with respect to Θ Iϵ(Θ ) set of approximately invariant representations with respect to Θ ˆIϵ(Θ ) set of approximately invariant representations with respect to Θ (finite samples) Φ Diagonal matrix encoding a feature selector d in-degree of nodes in Section 9.3 k number of nodes that have interventions in Section 9.2 and 9.3 δ randomness in samples in the interventional datasets δ randomness in interventions drawn from D Table 1. Notation table 6. Notation For any quantity that depends on population statistics, A, we use ˆA to denote the corresponding sample estimate. For a vector a Rn, we use a( k) to denote the vector consisting of all monomials of degree less than or equal to k with variables in a. Thus, we have a( 2) = (1, a1, a2, , an, a2 1, a1a2, , a2 n) . We use a(k) (that is, with no dot) to denote the vector consisting of all monomials of exactly degree k with variables in a, so a(2) = (a2 1, a1a2, a1a3, a2 n) . For a matrix A, we use r(A) to denote the vector formed by flattening the matrix, so r(A) = (A11, A12, A13, , An 1,n, Ann) . Please see Table 1 for a summary of notation used in the paper. 7. Auxillary proofs 7.1. General Results Lemma 7.1. Consider a sequence of i.i.d. random vectors V1, V2, , Vm Rd with mean µ such that Vi < L always. Then for m > d L2 δ , we have i=1 Vi µ < ϵ] > 1 δ. Proof. By a standard Hoeffding bound, with probability 1 δ, each entry of 1 m Pm i=1 Vi will be within ϵ d of its mean for δ . Then the squared norm of the error 1 m Pm i=1 Vi µ will be less than ϵ. Lemma 7.2 ((Vershynin, 2011)). The covering number of the Euclidean ball of radius R is (1 + 2R Lemma 7.3. Let A Rn n be a matrix and x Rn be a column vector, then we have Ax 2 2 = r(A A)x(2). Proof. This follows by writing the norm squared as a quadratic form. Lemma 7.4. Consider a diagonal matrix Φ Rn n . Then there exists a fixed matrix V Rn2 n such that r(Φ) = V diag(Φ). Proof. To construct V , we take the n n identity matrix and insert with n 1 rows of all 0s between every row of the PAC Generalization via Invariant Representations identity matrix obtain V Rn2 n which looks as follows: 1 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 Lemma 7.5 ((Sullivant et al., 2010), (Uhler et al., 2013)). For the class of linear SEMs defined in Equation 1, we have the following Σθ = J(1 Bθ) 1Ξθ (1 Bθ) 1 J where Ξθ = Eθ[ηη ] and J is identity concatenated with one column of zeros at the index corresponding to vt. (I Bθ) 1 = Pn 1 k=0 Bk θ Σθ = P2n 2 k=0 P r+s=k r,s 1 δ . PAC Generalization via Invariant Representations 9. Intervention Models 9.1. General Interventions Lemma 9.1. For some matrix Uθ, we have Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) = Uθr(Φ)(2). Proof. We will see that both of these terms can be written as Uθr(Φ)( 2) for some (different) choices of Uθ. We will enumerate the coordinates of r(Φ)( 2) (ij),(kl) as ΦijΦkl. Note that r(Φ)( 2) also contains the individual Φij, and we will use to indicate this extra alphabet. To clarify, some of the coordinates are then of the form r(Φ)( 2) ( ,(kl)) = Φkl. Now observe that Φ ΣθΦf = U 1 θ r(Φ)( 2) (U 1 θ )a,((ij),(kl)) = ( Σjk(f )l, a = i and (ij) = and (kl) = 0, a = i or (ij) = or (kl) = The second term is almost in the form we would like already Φ (Σθβ J(1 Bθ) 1et) = U 2 θ r(Φ)( 2) (U 2 θ )a,((ij),(kl)) = ( (Σθβ J(1 B) 1et)ij (ij) = and (kl) = 0, (ij) = or (kl) = Our result follows by setting Uθ = U 1 θ + U 2 θ . Theorem 9.2. Suppose that we are given O( n4+log 1 δ δ ) interventions drawn independently from distribution D over the intervention index set Θ. Then with probability at least 1 δ over the randomness in Θtrain, the following statement holds: Pθ D(Iϵ f (Θtrain) / Iϵ f (θ)) δ . In other words, an ϵapproximate invariant representation on the training environments generalizes with high probability to the test environment. Proof. We again begin with our expression for the gradient of the loss given by Lemma 8.6. Rθ Φ(f ) = Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et). We show in Lemma 9.1 that we can write this as Rθ Φ(f ) = Uθr(Φ)( 2) for some matrix Uθ. From Lemma 7.3 we can write the squared norm as Rθ Φ(f ) 2 2 = r(U θ Uθ)(r(Φ)( 2))(2). Finally, we can now write our function class as a class of halfspace classifiers. gϵ Φ(θ) = 1 r(U θ Uθ)(r(Φ)( 2))(2) ϵ. Now consider the map Ψ : diag(Φ( 4)) r(Φ)( 2)(2). Because Φ is diagonal, this is well defined. Each term in r(Φ)( 2)(2) contains at most 4 terms of Φ, and each of these is an element of Φ( 4). Also, since the relation between entries of r(Φ)( 2)(2) and entries of diag(Φ( 4)) is fixed, Ψ is actually a linear transformation (that is, it is some sparse 0/1 matrix). We can then write gϵ Φ(θ) = 1 r(U θ Uθ)Ψdiag(Φ( 4)) ϵ. Finally, note that diag(Φ( 4)) R(n+1)4, that is, we have written the function class G as corresponding to some subset of all halfspace classifiers in R(n+1)4. By Lemma 3.2, we know that the VC dimension of G is bounded by (n + 1)4. The result follows from Corollary 3.4. PAC Generalization via Invariant Representations 9.2. Atomic interventions Here we consider the instance in which our family of interventions consists of atomic interventions on a total of k nodes At. An atomic intervention at node i replaces the generative equation Xi = γi(e) Pai + ηi by the assignment Xi = ai for some scalar ai. Note that we can interpret this as zeroing out the corresponding entries of B and changing the noise variable to be a constant ai. Lemma 9.3. For some matrix U, we have Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) 2 2 = r(Φ)( 4)Ua( 2). Proof. We will examine the dependence of Eθ[X t X t] and Eθ[X t X t ] on ai θ starting from Lemma 7.5. For the former, note that (Σθ)ij can be interpreted as having one term corresponding to every path from i backwards to some node k and then forwards to j. Because atomic interventions essentially cut off the connections between a node and its parents, any path that consists initially of backwards edges and then forward edges can only have a quadratic dependence on the parameters of the intervention, regardless of how many nodes are intervened on. It is only such paths that contribute to the covariance matrix, and thus to the thresholded polynomials that determine approximate invariance. We will make this explicit below. The nature of atomic interventions is that Bθ = BΘ = BIΘ for each of the interventions θ Θ, where IΘ is an identity matrix with zeros on the diagonal entries corresponding to At (we are disconnecting each of the sites of intervention from their parents). In addition, the noise covariance changes. Previously, with independent unit variance noise, the covariance was Ξθ = Eθ[ηη ] = I. Since we are incorporating the assignments for the atomic interventions into the noise, it is now given by 1, i = j At 0, i At, j At, i = j 0, i At, j At, i = j aθ i aθ j i, j At That is, we have replaced the submatrix corresponding to the sites of the interventions with aθ(aθ) where aθ denotes the vector of assigments aθ = (aθ 1, aθ 2, , aθ |At|). (Φ ΣθΦf )j = r+s=k r,s 1) dist(Φ, Φ ) ϵ 2n3/2 2 n 3 2 + ϵ 2n3/2 n ϵ. PAC Generalization via Invariant Representations Putting these together, Lemma 10.3. Given 4d L2 δ + log Nϵ/2 samples in a dataset, with probability 1 δ, for every representation Φ simultaneously we have ˆI ϵ 2 f (θ) Iϵ f (θ) Proof. Consider Φ ˆI ϵ 2 f (θ). From the definition, ˆ Rθ Φ(f ) < ϵ By Lemma 7.2 and Lemma 10.1, we know that Rθ Φ(f ) ˆ Rθ Φ(f ) < ϵ Putting these together, we see that Rθ Φ(f ) ϵ = Φ Iϵ f (θ). Theorem 10.4. Given O( k4+log 1 δ δ ) Θ is k nodes, hard interventions O( d4k+log 1 δ δ ) Θ is k nodes, soft interventions O( n4+log 1 δ δ ) Θ any interventions interventional datasets, and 4n L2 δ + n2 log(1 + 8n3/2 ϵ ) samples in each dataset, we have that with probability 1 δ, with probability 1 δ for θ D, ˆIϵ f (Θtrain) I2ϵ f (θ). Proof. From Lemma 4.2 we know that ˆIϵ f (θ) I2ϵ f (θ). (10) With a union bound over Θtrain, we can establish Equation (10) at once over all interventions in the training set. Since Iϵ f (Θtrain) = T θ Θtrain Iϵ f (θ) and ˆIϵ f (Θtrain) = T θ Θtrain ˆIϵ f (θ), we get that ˆIϵ f (Θtrain) Iϵ f (Θtrain). Finally, from Theorems 3.7, 3.8, 3.9, we have Iϵ f (Θtrain) Iϵ f (Θ) for each of the cases listed. 11. Empirical study Construction of training and test interventional distributions: For the class of hard interventions, we consider assignments drawn from Gaussians on nodes {v1, v2, v3}. Specifically, we take Xi = ai for i {3, 4, 5} with ai N(0, σ2) independently for each node. For the class of soft interventions, we assign each nonzero entry of βθ i N(0, σ2) for i {3, 4, 5} independently. In our experiments, σ2 = 1. In this way, we generate m atomic-interventional datasets {Nj,hard}j [m] and m soft-interventional datasets {Nj,soft}j [m] for m varying from 3 to 20. Subsets as Representations: We iterate over the power-set of the nodes to make a list of approximately invariant representations. For every subset S of the non-target nodes, we consider ΦS to be the corresponding representation, diagonal, with a one on each index in S (a projection onto S). We count the fraction of these representations that are approximately invariant, as defined below. Approximately Invariant Representations: Here, these are defined as being such that there is not much variation over the least squares solutions on top of the representations. For every such representation ΦS, we denote by f S,j the least squares solution on top of the representation ΦS for dataset Nj. We take as a measure of invariance the quantity j1,j2 [m] f S,j1 f S,j2 2 (m 1) P j [m] f S,j 2 . PAC Generalization via Invariant Representations generalization percentage N = 10000 N = 15000 3 5 8 11 14 17 20 number of training interventions generalization percentage 3 5 8 11 14 17 20 number of training interventions intervention hard+hard hard+rad soft+soft soft+rad ERM+hard Figure 5. Comparison of generalization for various combinations of training and test interventional distributions. Here hard refers to the i.i.d. interventions setting with hard interventions from the same distribution in both training and test interventions, similarly soft refers to i.i.d. soft interventions, while hard+rad (respectively soft+rad ) refer to hard interventions (respectively soft) in training with rademacher interventions in test. We repeat the above for N = 10, 000, 15, 000, 30, 000, and 45, 000. PAC Generalization via Invariant Representations Note that this is the ratio of the average distance between the various least squares solutions and the norm of the average least squares solution. We expect this to be low when the various least squares solutions are close to one another. An approximately invariant subset S is then taken to be one such that ρS < 0.02. Let shard, ssoft denote the approximately invariant subsets given the training sets {Nj,hard}j [m], {Nj,soft}j [m] Test distributions: We construct mtest test datasets induced by interventions drawn from each of the same families, that is, {N test j,hard}j mtest for atomic interventions, and {N test j,hard}j mtest for soft interventions, as well as a set of datasets {N test α,j,rad}j mtest, drawn from the linear SEM constructed by flipping each edge weight in the original SEM with probability α (except the ones connecting the parents of the target to the target). Note that this last dataset is not drawn from interventions that are drawn from the same distribution as the training datasets. For each set in shard, we return the percentage of datasets in {N test j,hard}j mtest, {N test j,soft}j mtest, and {N test α,j,rad}j mtest such that the least squares solution on top of the subsets continues to be approximately the same. The labels of the plots reflect in order the trianing and test interventional ditributions, so soft+rad(0.5) means we are evaluating ssoft on 100 datasets {N0.5,j,rad}. ERM generalization: To demonstrate that generalization in this sense is indeed non-trivial, we also plot the fraction of least squares solutions to the observational datasets (where we pool together the training distributions) that exhibit low variance in test interventional distributions. This box plot is essentially trivial at 0. Norm of gradient: Rather than using an RMS variance, we can use the norm of the gradient. The empirical results are similar (in fact, these are related to each other, since R1(f1) , R2(f2) = 0, f1 f2 2 ϵ = R1( f1+f2 2 ) , R2( f1+f2 2 ) βϵ where β is the smoothness of the loss functions).