# pac_generalization_via_invariant_representations__aadb710b.pdf

PAC Generalization via Invariant Representations

Advait Parulekar 1 Karthikeyan Shanmugam 2 Sanjay Shakkottai 1

Abstract Invariant representations are transformations of the covariates such that the best model on top of the representation is invariant across training environments. In the context of linear Structural Equation Models (SEMs), invariant representations might allow us to learn models with outof-distribution guarantees, i.e., models that are robust to interventions in the SEM. To address the invariant representation problem in a ﬁnite sample setting, we consider the notion of ϵ-approximate invariance. We study the following question: If a representation is approximately invariant with respect to a given number of training interventions, will it continue to be approximately invariant on a larger collection of unseen intervened SEMs? Inspired by PAC learning, we obtain ﬁnite-sample out-of-distribution generalization guarantees for approximate invariance that holds probabilistically over a family of linear SEMs without faithfulness assumptions. We also show how to extend our results to a linear indirect observation model that incorporates latent variables.

1. Introduction

A common failure of empirical risk minimization in machine learning is that it is beneﬁcial to exploit spurious correlations to minimize training loss. A classic example of this comes from (Beery et al., 2018) where we have a dataset with pictures of cows and camels. In most cases, the cows are in pastures and the camels are in deserts, and a classiﬁer that works well on such a dataset generalizes poorly to classifying cows in deserts. How do we prevent a classiﬁer tasked with distinguishing pictures of cows from pictures of camels from using the color of the images? Certainly doing so might actually help with the classiﬁcation - take an extreme example wherein rather than just having

1Department of Electrical and Computer Engineering, University of Texas at Austin 2Google Research India. Correspondence to: Advait Parulekar <advaitp@utexas.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

an indicative background, the species of the animal is annotated in the image itself. In this case, simply learning to interpret what is annotated will lead to an excellent classiﬁer. However, such a classiﬁer will perform poorly on a future dataset with incorrect or missing annotations.

In asking for such out-of-distribution (OOD) generalization (generalization to different distributions rather than different samples from the same distribution), we are asking for more than what is guaranteed by traditional PAC learning. For instance, we might have a covariate shift, meaning the marginal on the covariates of the joint distributions of our training and test data is different. We would like a classiﬁer that is trained on training environments to generalize to the test environments.

Recently, in the OOD generalization literature, the data for the training and test environments have been modeled to arise from a set of causal models that are intervened versions of each other. The true causal relationship in causal models between a target variable and its causal parents remains invariant while other relationships could change (Peters et al., 2016; Bareinboim et al., 2012). Building on this observation, authors in (Arjovsky et al., 2019) proposed to learn a representation across the training environments such that the optimal classiﬁer on top of it remains the same. We refer to this as the invariant representation in this work. Will this representation and the predictor generalize (remain invariant) in unseen environments? Does such a formulation result in improvements over other methods such as empirical risk minimization or distributionally robust optimization?

There has been some debate about this matter. Some theoretical works (Rosenfeld et al., 2020), (Kamath et al., 2021) describe regimes and settings in which either the convex penalty suggested by (Arjovsky et al., 2019) or the invariance formulation in general are not sufﬁcient to generalize to new environments. On the other hand, there are also settings described in (Arjovsky et al., 2019), (Ahuja et al., 2020), (Kamath et al., 2021) in which IRM does provably lead to OOD generalization guarantees that are not possible using other methods. However, these prior studies require strong conditions such as faithfulness (Yang et al., 2018), (Pearl, 2009), that come from the structure learning community. Indeed, as noted in (Uhler et al., 2013), faithfulness (that observed invariances imply structural constraints on the

PAC Generalization via Invariant Representations

causal graph) is a very strong assumption in the ﬁnite sample setting. This is because the volume of the set of linear SEMs that are close to ones with faithfulness violations is a large constant fraction of all linear SEMs. This makes it difﬁcult to resolve the question of whether an observed conditional independence truly holds due to the underlying causal structure with ﬁnite samples. Invariance based results almost always study the inﬁnite sample setting and assume general position conditions on the training environments; similar to the faithfulness assumption, these become much stronger assumptions in the ﬁnite sample setting.

Thus in the worst case, perhaps we cannot expect generalization guarantees without an exponentially many interventions/samples. Instead of studying the worst case setting, suppose we consider the PAC setting can we now derive ﬁnite sample generalization guarantees without the need of strong faithfulness or general position assumptions, yet with a polynomial scaling in interventions/samples?

1.1. Contributions

We answer the above afﬁrmatively, and provide the ﬁrst generalization and associated ﬁnite-sample guarantees for invariant representations without faithfulness assumptions. We study a setting in which we have access to a family of linear SEMs related to each other by hard and soft interventions on arbitrary subsets of variables. We assume that the training environments arise by random sampling from a distribution over this family. We derive a PAC bound for the number of interventions needed to get probabilistic generalization on the family of SEMs. A central result is that we need only O( n4

δ ) training environments, such that approximately invariant representations on the training set generalizes with probability 1 δ on the family of unseen linear SEMs. Further our theory also works for general linear representations beyond feature selection for indirect and direct observation models (e.g., could have latent variables) with a scaling of O( n8

δ ); please see Appendix 9.4 for details. We show tighter interventional complexity independent of n for SEM families that have simultaneous atomic interventions on any ﬁxed set of k nodes and soft interventions on k nodes in a degree bounded setting.

Secondly, we characterize the number of samples in each training environment that is needed for approximate invariance to hold with high probability. This gives an end-to-end sample complexity guarantee along with the interventional complexity required in the above setting without any faithfulness or general position assumptions. Furthermore, we show extensions of our results to a linear indirect observation model that incorporates latent variables.

Finally, we empirically demonstrate that in the setting described above using a notion of generalization that we de-

scribe, most approximately invariant representations generalize to most new distributions.

1.2. Related Work

Causal Structure Learning: One approach, in the presence of data sampled from structural models, is simply to try and learn the DAG structure explicitly. Algorithms that use conditional independence (CI) tests and/or penalized likelihood scoring are used to determine the DAG (Spirtes et al., 2000; Solus et al., 2021; Chickering, 2020; Brouillard et al., 2020). These methods can only learn DAGs up to what is called the Markov Equivalence Class (MEC) (Verma & Pearl, 1990), or I-MECs (Interventional MECs) when given access to general interventions (Hauser & B uhlmann, 2012; Squires et al., 2020; Brouillard et al., 2020) (Yang et al., 2018) under some form of faithfulness assumptions on the underlying causal model. It has been shown that some forms of faithfulness assumption are actually stronger than one might expect; linear SEMs that are close to a faithful violation form a much larger set (Uhler et al., 2013) and are difﬁcult to distinguish in ﬁnite samples from faithful ones. However, in light of the fact that we only want generalization guarantees for invariance and not structure recovery, one could argue that learning the structure, or even feature selection is stronger than what we actually need.

Domain Adaptation Methods: Domain adaptation literature (Ganin et al., 2016; Muandet et al., 2013; Ben-David et al., 2007) learns representations whose distribution does not change across training and an unlabeled test distribution. Another line of work searches for the best mixture of training distributions to train on or mixture of pretrained models for generalization to unlabeled test data or test data with limited test labels (Mansour et al., 2021; 2008). Distributionally robust optimization approaches that optimize the worst risk over an uncertainty set of distributions have been proposed (Duchi et al., 2021; Sagawa et al., 2019). Please see Appendix A of (Gulrajani & Lopez-Paz, 2020) for a more exhaustive list of works. Robust optimization and multi source domain adaptation methods search in an uncertainty ball around the training distributions while domain adaptation fail even with a shift in marginal distributions of the labels (Zhao et al., 2019). Such methods fail when the test distribution is outside the convex hull of training distributions. (Magliacane et al., 2017) use invariance tests, which require faithfulness assumptions. (Chen et al., 2023) applies a multi-objective optimization perspective to the OOD problem to resolve some of the optimization difﬁculties of OOD generalization. Finally, the interventional complexity of learning invariant representations is also considered in (Chen et al., 2022) and (Wang et al., 2022). Under the data model of (Rosenfeld et al., 2020), (Chen et al., 2022) show that a smoothed analysis (with distributional assumptions) can get logarithmic interventional complexity, while (Wang

PAC Generalization via Invariant Representations

et al., 2022) use second-order moments to get just O(1) training environments in the inﬁnite sample limit.

Finite Samples and Invariant Risk Minimization: One framework proposed to bypass structure learning difﬁculties and directly look for invariant representations is to use Invariant Risk Minimization (IRM) (Arjovsky et al., 2019). (Arjovsky et al., 2019) show that under some general position assumptions on the population covariances of the training distributions the exact invariant representation recovers the true causal parents. The work of (Ahuja et al., 2020) makes similar assumptions to derive sample complexity guarantees. (Kamath et al., 2021) also consider the question of getting generalization guarantees from IRM for general distributions, and address the number of training intervention needed for IRM; however, they work in the inﬁnite sample limit with exact invariance and require an analytical mapping between intervention index set and training distributions. All these works assume regularity conditions and provide deterministic guarantees for a test distribution.

Invariant Predictors on DAGs: Recent works (Subbaswamy et al., 2019) have assumed the knowledge of the causal DAG behind the unseen test and/or train distributions along with information about the nodes that are intervened on in the test. It is then possible to characterize all possible invariant representations as a function of the graph. In our study, we do not assume any such side information about the SEM underlying our distributions. Finally, in (Rothenh ausler et al., 2021), there is a known set of variables (anchors) whose effect is considered to be spurious (in our setting, we do not know these). They study a weighted projection that penalizes estimates that are aligned with the anchor observations. In contrast, we study the IRM objective: rather than being a min-max of residual error over some set of distributions, our objective encourages representations that lead to invariant least squares solutions.

2.1. Structural Equation Model

We consider a causal generative model speciﬁed by a linear Structural Equation Model (SEM) - data that is generated by linear equations on a directed acyclic graph (DAG) of variables. Speciﬁcally, we work with a random n + 1 dimensional vector X = (X1, X2, , Xn, Xt). We assume that there is some unknown directed acyclic graph G with nodes (v1, v2, , vn, vt) which speciﬁes the relationships between the variables in X, so that the node vi corresponds to the random variable Xi. The variables are related to each other by structural equations of the form:

Xθ i = βθ i Paθ i + ηθ i = X

βθ j,i Xθ j + ηθ i , (1)

Xt = β t Pat + ηt = X

j Pa βj,t Xj + ηt. (2)

Here ηi are the exogenous variables in our system and Xi denotes the endogenous variables in the system. Pai denote the set of random variables (stacked as a column vector) corresponding to the parents of node vi in G. The weights of the linear SEM, when there is a directed edge between node vj and vi are represented by βj,i. There is some target node Xt that is special in that we would like to predict its value given the rest of the variables. We will denote the vector consisting of the rest of the variables as X t. We will denote by barred quantities the zero-padded versions of the above vectors, so βt, is such that β t Pat = β t X t.

2.2. Interventions

An important feature of Equation (1) is the superscript θ for θ Θ, an element of some intervention index set Θ. Θ speciﬁes a parameterized intervention family, i.e. a family of linear SEMs related to each other by interventions. Conceptually various environments correspond to different mechanisms by which causal parents inﬂuence children. The changes in mechanisms are reﬂected in the weights βθ j,i parameterized by θ. Note that the conditional distribution of the target variable Xt given its parents Pat is assumed to be the same even for different θ Θ. This is a property that we would like to exploit so that our task of predicting Xt is robust to different interventions in Θ, i.e. there exists at least one robust predictor for all environments. We begin with an observational distribution with weights given by βθ . There are two common types of interventions studied:

Atomic or Hard Interventions: In hard interventions θ at some node vi, we assign a value to Xi. This corresponds to setting βj,i = 0 for all Xj Pai, and setting ηθ i = aθ i .

Soft Interventions: In soft interventions, the weights βθ i = βθ i are altered while keeping the noise the same ηθ i = ηθ i .

The SEM parameters θ determine a joint distribution over X, which we denote by θ.

Distributions over Interventions: In our model, we consider a distribution D over Θ from which we assume that our training and test data-sets are drawn by sampling the intervention index θ independently and randomly from D. We provide generalization bounds that work with high probability over the randomness of the choice of test distribution, rather than deterministically as has been done in prior works (as we will explore in more detail, providing deterministic bounds is not possible without additional assumptions). Formally, Θtrain is an index set of interventions drawn i.i.d. from D over Θ. The test intervention index θtest is also drawn from D independently. The learner does not see any details

PAC Generalization via Invariant Representations

about the intervention (which nodes are affected, etc), but sees the distinct datasets from the distinct interventions.

Data generation: In this sense, the training data is generated by ﬁrst sampling Θtrain from D, and then sampling | Θtrain | training datasets Xθ θ for θ Θtrain.

3. PAC-Invariant Representations

Motivated by Invariant Risk Minimization (IRM), we consider the search for a model that performs well under a variety of distributions in the hopes of attaining generalization guarantees. Consider a prediction function f with loss Rθ(f). Here the superscript θ represents the intervention index. In this paper, we consider least squares regression, f Rn is a vector, and the loss is given by

Rθ(f) = Eθ[ f X t Xt 2] (3)

where the expectation is taken with respect to θ. We are interested in generalization guarantees for representations Φ that satisfy the property that

f = arg min f Rθ(f Φ) θ Θtrain, (4)

is invariant across environments. That is, the least squares solution on top of the representation is the same for every intervention. We will denote the set of all invariant representations over a class of SEMs Θ by I(Θ ).

I(Θ ) = {Φ : f s.t. f = arg min f Rθ(f Φ) θ Θ }

We refer to the full model f Φ when Φ is invariant as an invariant solution. We use Rθ Φ(f) to denote the loss for f on top of representation Φ, so Rθ Φ = Rθ(f Φ). We refer to f as the head of the model.

Feature selection, a motivation for the study of invariant representations, corresponds to diagonal representations; henceforth, we focus on the class of representations that are feature selectors, given by diagonal matrices.

Invariance via gradients: Rather than working with the loss itself, we follow (Ahuja et al., 2020), (Arjovsky et al., 2019) and use the gradient of the loss instead (see Lemma 8.5).

Φ I(Θtrain) f s.t. f Rθ Φ(f) = 0 θ Θtrain.

Unlike previous work, we do not further minimize the weighted empirical loss over training environments. We will instead consider the generalizability of any approximately invariant representation.

Finite samples and ϵ approximately Invariant Representations: Since we are working with ﬁnite samples, we

can no longer hope for exact invariance across environments. We slightly change the deﬁnition of invariance to be about the gradient being close to 0.

Φ Iϵ(Θtrain) f s.t. Rθ Φ(f) 2 ϵ θ Θtrain. (5)

We refer to approximate versions of quantities using a superscript ϵ, and, given speciﬁc datasets, we refer to ﬁnite sample versions of these quantities using hats, so for instance the set of representations that are invariant for some set Θ is denoted I(Θ ), the set of ϵ approximately invariant representations is given by Iϵ(Θ ), ﬁnally, given a particular dataset, the set of ϵ approximately invariant representations would be denoted ˆIϵ(Θ ). Note that the ﬁnite sample quantities are random (from the randomness of the samples drawn from θ), while the ϵ approximate quantities are deterministic. The dataset of N points is denoted in matrix form as X with X t and Xt denoting the non-target matrix and target vector respectively.

Signiﬁcance of linear representations: A linear representation Φ affects invariance non-trivially by selecting an effective column space for the regression, that is, in determining the column space of Φ X t. Because we are looking for a ﬁxed (across environments) least-squares head on top of the representation, Φ can be further composed with any invertible linear map, only to be inverted at the head to obtain another invariant solution (ΦA, A 1f) from an invariant solution (Φ, f). Because of this freedom, we can actually choose to work with some ﬁxed head f . When the space of Φ is all matrices f is chosen to be (1, 0 . . . 0). When the space of Φ is only diagonal matrices, then f is taken to be (1, 1 . . . 1). These are formalized in Lemma 8.3 and Lemma 8.4. As we will see, a key construction is that of Iϵ f (θ), deﬁned as those Φ such that the gradient is near-zero at f , that is,

Φ Iϵ f (θ) Rθ Φ(f ) ϵ,

for some speciﬁc f . Similarly, for some set of interventions Θ , we deﬁne Iϵ f (Θ ) = T

θ Θ Iϵ f (θ). This allows us to look at invariance as a property of a representation and a single intervention, rather than that over collection of interventions as in Equation 4. This is discussed further in Remark 3.6.

3.1. Motivation for Probabilistic Guarantees

Consider a distribution D over interventions. During training, we see samples from speciﬁc interventions drawn from this distribution Θtrain, and compute invariant representations for these interventions. Ideally, we would like to be able to say that an invariant solution computed in this way will generalize to all future interventions on the same DAG. However, this is not possible without further assumptions, as the following examples illustrate.

PAC Generalization via Invariant Representations

Figure 1. Figure for Example 3.1 showing that there could be more independencies in the data than apparent from the DAG alone.

Example 3.1 (Faithfulness). Consider a linear SEM on four variables X1, X2, X3, Xt given by

X1 = N(0, 1), X2 = X1 + N(0, 1),

X3 = a X1 + b X2 + N(0, 1), Xt = X2 + X3 + N(0, 1).

depicted in Figure 1. This system has at least the following two invariant representations of X = (X1, X2, X3) if a = b for every intervention: (X2, X3) and (X2). However, only the ﬁrst continues to be invariant when a = b. In other words, given an arbitrary number of training environments, each of which satisﬁes a = b, we might decide that (X2) is an invariant representation. However, this fails to be invariant once we include a test intervention for which a = b. This happens because the joint distribution is not faithful to the DAG provided. In short, this means that the conditional independencies indicated by the DAG are not the only ones found in the data. In our case, the effect of X1 on X3 through the blue path and the red path exactly cancel in all training interventions. Example 3.2 (Degenerate Interventions). Consider the degenerate situation in which each of the interventions is the same. The Empirical Risk Minimization (ERM) solution itself is invariant, however, the ERM solution comes with no generalization guarantees to other interventions. It is clear that some diversity condition among the training environments is necessary to get generalization guarantees.

The key idea is that in such situations one might say that these issues (faithfulness violations as in Example 3.1 or degeneracies as in 3.2) will likely continue to manifest in future distributions. In particular, if we assume that both training and test intervention come from a single distribution over interventions, can we at least say that with high probability over a test intervention also drawn from D, we will generalize to that intervention? More formally, we ask:

How big should Θtrain be so that for θ D, Iϵ f (Θtrain) Iϵ f (θ) with high probability?

In other words, we would like invariance on our training interventions to certify invariance over a future interventions with high probability. To further clarify the type of guarantee considered, we look at the following example. Example 3.3. Consider the DAG given in Figure (2). Consider the following distribution over interventions: We set

X1 X2 Xn 1 Xn

Figure 2. DAG for intervention in Example 3.3

X1 = 1. Independently and with probability 1/2 for each node that is not X1 or Xt, we assign its value to be 0. Else we set its value equal to its parent. The only invariant representations in this case are (Xn) and (X1, Xn). Now consider any training set of m interventions. If m 2n, we expect to never see the intervention in which none of the edges are disconnected (i.e., an environment with no intervention on any of X2, X3, .., , Xn 1), in which case (X1) also appears to be invariant over every observed intervention. However, it fails to be invariant on the disconnected intervention described above, which occurs with an extremely low probability of 1/2n 1.

Our result instead gives a probabilistic guarantee that says that with a certain probability over the randomness from which we are drawing interventions, any representation that appears to be invariant over the training set will also be invariant when the set is expanded to include another i.i.d. intervention (the test intervention). As demonstrated in this example, we may need Θ(2n) interventions to get deterministic guarantees while our poly(n) result highlights the beneﬁt of getting a probabilistic guarantee in this setting.

3.2. PAC Formulation

In this section, we show that the question of generalization of ϵ invariant representations introduced in the previous section can be rewritten as a question of PAC generalization.

3.2.1. VC DIMENSION AND PAC GENERALIZATION

Let X Rd be a subset of points. A function class G is said to shatter X if every binary assignment to X is realized by some element of G, that is, for any σ : X { 1}, we have that there is some f G such that f(x) = σ(x) for all x X. The VC dimension of a function class is deﬁned as the size of the largest set that is shattered by it. The following result from PAC learning theory allows us to determine how many interventions are needed to generalize.

Lemma 3.1 ((Shalev-Shwartz & Ben-David, 2014)). Consider a class G of functions from X to { 1} of VC dimension D, and a distribution D over X { 1}. In the realizable setting, when there exists f G such that P(X,Y ) D[f(X) =

Y ] = 0, given m = O( D+log 1 δ1 δ2 ) samples (xi, yi)m i=1 and f such that f(xi) = yi for all i, with probability at least 1 δ1, f satisﬁes P(X,Y ) D[ f(X) = Y ] < δ2.

PAC Generalization via Invariant Representations

We will need the following known VC dimension.

Lemma 3.2 ((Shalev-Shwartz & Ben-David, 2014)). The VC dimension of halfspaces in RD, that is, the function class G = 1{a x < b} : a RD, b R is D + 1.

3.2.2. PAC INVARIANCE

We will use PAC learning theory to study generalization for representations. For that we will rephrase the invariance problem to that of the generalization of binary classiﬁers. The proof is deferred to the Appendix, Lemma 8.7.

Lemma 3.3. There is a function class G mapping interventions θ Θ to {0, 1}, such that

Φ Iϵ f (Θtrain) gϵ Φ G

such that gϵ Φ(θ) = 1 θ Θtrain.

In summary, if we can bound the VC dimension of G, we can specify interventional complexity guarantees. The proof is deferred to the Appendix, Corollary 8.8

Corollary 3.4. If the VC dimension of G is bounded by D, then given at least O( D+log 1

δ δ ) training interventions, if Iϵ(Θ) = {}, we have with probability 1 δ over the set of training interventions,

Pθ D(Iϵ f (Θtrain) Iϵ f (θ)) δ .

Finally, there is at least one truly invariant representation, as demonstrated in the following Lemma.

Lemma 3.5. There exists an invariant representation.

Proof. The representation Φ = diag(βt) is invariant. That is, the diagonal matrix with βt on the diagonal in indices corresponding to Pat.

Remark 3.6. (Fixing a head is important for PAC learning) Considering a ﬁxed head f allows us to certify invariance locally, that is, looking at only a single intervention at a time. This is in contrast with the previous characterization, Equation (4), in which we simply ask for the best head on top of a representation to be the same for all interventions - a condition that we can only check for by considering all interventions simultaneously.

3.3. The VC dimension of certain classes of interventional distributions

Further, we show that for atomic interventions on some ﬁxed set of k nodes, O(k4) training interventions sufﬁces. For soft interventions on k nodes, assuming that each node has in-degree bounded by d, we show that O(d4k) interventions sufﬁces. This captures the intuition that if there is only some sparse, local change in the SEM, it should only take a

number of interventions that scales in the sparsity to capture the invariances it leaves.

In the next section, we will show how to extend these results to the ﬁnite sample setting.

3.3.1. GENERAL INTERVENTIONS

For this section we approach the task of bounding the VC dimension from the perspective of the complexity of the representation space. The class of interventions we consider are any interventions that lead to joint distributions over the variables that leave ﬁxed the conditional distribution for the target variable given its parents.

Theorem 3.7. Suppose that we are given O( n4+log 1

δ δ ) interventions drawn independently from distribution D over the intervention index set Θ. Then with probability at least 1 δ over the randomness in Θtrain, the following statement holds: Pθ D(Iϵ f (Θtrain) / Iϵ f (θ)) δ .

In other words, an ϵapproximate invariant representation on the training environments generalizes with high probability to the test environment.

Connections to Causal Bayesian Networks: Invariance Principle or (less popularly known as) the modularity condition can be equivalently used to deﬁne causal Bayesian Networks (Bareinboim et al., 2012) - the central object in Pearlian Causal Models. With respect to all possible interventions wherein variable y has not been intervened on, only the representation that involves the true causal parents of y is invariant. In other words, the property of invariance can be taken to be the signature of causality. An open question in the setting has been the following: In how many interventional environments does this invariance property need to hold before it generalizes to most unseen environments. With respect to a random sampling on interventional environments for linear SEMs, our result answers this via high probability generalization guarantees without any faithfulness assumptions.

Structure Learning and Faithfulness: Invariance testing between a given set of interventional environments has been used to constrain the space of causal models (Yang et al., 2018) in the literature. However, the learning algorithm that synthesizes various invariances does require some notion of interventional faithfulness, i.e. observed invariances imply topological constraints on the true causal graph. In contrast, we do not make any such assumptions about faithfulness.

Comparison to Regularity Conditions in Invariant Prediction: Recently, invariant prediction (Arjovsky et al., 2019) has been used for out of distribution generalization. However, exact invariance holding deterministically in all unseen distributions from a family of linear SEMs was de-

PAC Generalization via Invariant Representations

sired. This required some general position conditions on the population covariances of the training interventional environments. The number of environments despite these additional conditions, required was O(n2). Our result on the interventional complexity is weaker but holds when we need no such assumptions on covariances and it holds under random sampling for approximate invariances. Similar technical general position conditions were also required in (Ahuja et al., 2020) for generalization.

In another recent work (Kamath et al., 2021), if the mapping between the intervention index set Θ and the observed training distributions on X is analytic, then only two training environments (almost surely) sufﬁce for exact invariance in the population setting (i.e., inﬁnite samples) for the entire index set. Here, we analyze approximate invariance without any restrictions on the mapping being analytic. For instance, their result does not apply in Example 3.3, where we need two speciﬁc interventions (one in which each node is assigned the value of the parent, and any other) to certify invariance. In fact, any set of interventions needs to have the speciﬁc intervention highlighted in the example to certify invariance, and this happens with very low probability.

3.3.2. ATOMIC INTERVENTIONS

Here we consider the instance in which our family of interventions consists of atomic interventions on a total of k nodes At. An atomic intervention at node i replaces the generative equation Xi = γi(e) Pai + ηi by the assignment Xi = ai for some scalar ai. Note that we can interpret this as zeroing out the corresponding entries of B and changing the noise variable to be a constant ai.

Theorem 3.8. Given |Θtrain| = O( k4+log 1

δ δ ) interventions drawn independently from a distribution D over all atomic interventions on some ﬁxed set At of k nodes that with probability at least 1 δ over the randomness in Θtrain, with probability at least 1 δ over the randomness in D, we have for θ D, Iϵ(Θtrain) Iϵ(θ)

3.3.3. SOFT INTERVENTIONS

Soft interventions are those in which the weights of the SEM are modiﬁed while the underlying causal structure remains the same. That is, a soft intervention at node vi is equivalent to replacing the equation Xi = (βθ i ) X + ηi with Xi = (βθ i ) X + ηi. Note that this is equivalent also to changing the ith row of Bθ.

Theorem 3.9. Given O( d4k+log 1

δ δ ) interventions drawn independently from a distribution D over all soft interventions on some ﬁxed set At of k nodes such that each intervened node has in-degree at most d, we have that any ϵ approximately invariant representation Φ satisﬁes that with probability at least 1 δ over the randomness in Θtrain, with probability at least 1 δ over the randomness in D,

we have for θ D,Iϵ(Θtrain) = Iϵ(θ)

3.3.4. INDIRECT OBSERVATIONS

Finally, we consider the setting in which there is an underlying SEM which we observe after a linear transformation. In particular, we consider the setting in which we observe (S t(X), St(X)) for some linear transformations S t and St. Here, X is generated as before, according to 1. Note that the target variable is now St. We assume for realizability (in particular, for Lemma 3.5) the following:

Assumption 3.10. There is some linear representation Φ such that ΦS t(X) = Pat.

Theorem 3.11. Given O( n8+log 1

δ δ ) interventions drawn independently from a distribution D interventions, we have that any ϵ approximately invariant representation Φ satisﬁes that with probability at least 1 δ over the randomness in Θtrain, with probability at least 1 δ over the randomness in D, we have for θ D,Iϵ(Θtrain) = Iϵ(θ)

Remark 3.12. Note that Theorem 3.11 encompasses models that have latent variables, so long as the causal parents are observed. Note that the existence of latent variables typically precludes any guarantees for algorithms that rely on CI-testing. One could consider this to be a factor in favor of using IRM-based solutions for OOD generalization. Remark 3.13. Note that we only need Assumption 3.10 to be satisﬁed, and not that the functions S t and St themselves are linear. This is weaker than a linearity assumption, since we could have non-linearity in the spurious components.

4. Finite samples

Note that the above discussion was about population statistics. In practice, we only see ﬁnitely many samples from each interventional distribution. We use an estimator similar to that of (Ahuja et al., 2020) to get ﬁnite sample guarantees. We need the following assumption for scale.

Assumption 4.1. The following bounds hold, Φ 2 1, X < L.

Lemma 4.2. Given 4n L2

δ + n2 log(1 + 8n3/2

samples from θ, we have that with probability 1 δ over the samples drawn in each interventional distribution1

ˆIϵ f (θ) I2ϵ f (θ).

Proof Sketch. We show this using the standard concentration arguments for a single ﬁxed representation Φ. We take a union bound over an ϵ net of Rθ Φ to get a uniform bound over all representations. See Lemma 10.3 for a proof.

1In contrast to claims over the randomness D over Θ from previous theorems, this is a statement about the randomness θ for X.

PAC Generalization via Invariant Representations

We can now take a union bound over all θ Θtrain. In conclusion, we have shown that with high probability over the randomness in the samples we see in each of our interventional distributions, ϵ approximate invariance over the training data certiﬁes ϵ approximate invariance over a 1 δ fraction of out full intervention set. The proof is defered to the Appendix Theorem 10.4

Theorem 4.3. Given

O( k4+log 1

δ δ ) Θ is k nodes, hard interventions

O( d4k+log 1

δ δ ) Θ is k nodes, soft interventions

O( n4+log 1

δ δ ) Θ any interventions

interventional datasets, and, in each dataset, at least 4n L2

δ + n2 log(1 + 8n3/2

ϵ ) samples, we have that with probability 1 δ, with probability 1 δ for θ D, ˆIϵ f (Θtrain) I2ϵ f (θ).

5. Empirical Study

In this section we highlight the above results empirically. The experiment is described in detail in Appendix 11.

Figure 3. Linear SEM for Section 5. Nodes {v3, v4, v5} (purple) are used as intervention sites for Dhard and Dsoft. Each edge represents a weight of 1. Node Xt (green) is taken to be the target node.

We consider the 7-node linear SEM in Figure 3. The target variable is taken to be Xt. Each edge weight is set to 1 for the observational distribution. We consider an interventional distribution Dhard with support over the set of hard interventions on nodes {v3, v4, v5}. Recall that a hard intervention consists of assigning a value to a node. We draw m interventional distributions from Dhard as our training interventions, and draw a sample consisting of N = 200000 datapoints from each distribution. The corresponding plots with N {10000, 15000, 30000, 45000}

are available in Appendix 11. We use a notion of interdataset variance of the least squares solutions to construct a set shard = {S1, S2, , } where each Si is an approximately invariant representation derived from the training datasets. Rather than using variance, we can use the norm of the gradient; please see Appendix 11 for details. We then generate an additional mtest = 100 test interventions from Dhard, and generate datasets of size N = 200000 from each of them. We use the same inter-dataset variance to count the percentage of subsets in shard that continue to have low variation between the least squares solutions on the new test distributions and the average least squares solution on the training distributions. We repeat the above for soft interventions drawn from Dsoft. We consider soft interventions that modify the weights of the edges into v3, v4, and v5. Please see Appendix 11 for exact details about how the hard and soft interventional ensembles are deﬁned.

3 5 8 11 14 17 20 number of training interventions

generalization percentage

intervention

hard soft hard+rademacher ERM+hard

Figure 4. Generalization of approximately invariant representations under different combinations of train and test interventional distributions for the experiment described in Section 5. The ERM solution almost always fails to generalize, and the corresponding box plot is at 0 .

We also consider a different interventional distribution Drad to test the generalization of representations that are approximately invariant given training distributions drawn from Dhard. These interventions are generated by randomly ﬂipping the signs of the edge weights in the original SEM.

Finally, to conﬁrm that there is indeed variation across the least squares solutions in the different environments, we also plot the percentage of test distributions to which the ERM solution on the training distributions is similar to ERM solution of the test distributions (this percentage is almost always 0; meaning that the ERM solution always fails to generalize). We repeat this for m varying beteen 3 and 20. The results are plotted in Figure 4.

The code is available at https://github.com/

PAC Generalization via Invariant Representations

advaitparulekar/PAC IRM

Interpretation: While not every subset in shard and ssoft generalizes to every test dataset, we see that most subsets generalize to large percentage of test distributions. The percentage that generalize when the train and test interventions are identically distributed exceeds the percentage that generalizes to datasets that come from the ( adversarial ) rademacher interventional family. This phenomenon is captured in our bounds, which describe when most approximately invariant representations will continue to be approximately invariant on most test distributions.

Acknowledgements

This research is supported in part by NSF Grants 2019844, 2107037 and 2112471, ONR Grant N00014-19-1-2566, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Afﬁliates Program.

Ahuja, K., Wang, J., Dhurandhar, A., Shanmugam, K., and Varshney, K. R. Empirical or invariant risk minimization? a sample complexity perspective. In International Conference on Learning Representations, 2020.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Bareinboim, E., Brito, C., and Pearl, J. Local characterizations of causal bayesian networks. In Graph Structures for Knowledge Representation and Reasoning, pp. 1 17. Springer, 2012.

Beery, S., Horn, G. V., and Perona, P. Recognition in terra incognita. In ECCV, 2018.

Ben-David, S., Blitzer, J., Crammer, K., Pereira, F., et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 19:137, 2007.

Brouillard, P., Lachapelle, S., Lacoste, A., Lacoste-Julien, S., and Drouin, A. Differentiable causal discovery from interventional data. ar Xiv preprint ar Xiv:2007.01754, 2020.

Chen, Y., Rosenfeld, E., Sellke, M., Ma, T., and Risteski, A. Iterative feature matching: Toward provable domain generalization with logarithmic environments. In Advances in Neural Information Processing Systems, 2022.

Chen, Y., Zhou, K., Bian, Y., Xie, B., Wu, B., Zhang, Y., KAILI, M., Yang, H., Zhao, P., Han, B., et al. Pareto

invariant risk minimization: Towards mitigating the optimization dilemma in out-of-distribution generalization. In The Eleventh International Conference on Learning Representations, 2023.

Chickering, M. Statistically efﬁcient greedy equivalence search. In Conference on Uncertainty in Artiﬁcial Intelligence, pp. 241 249. PMLR, 2020.

Duchi, J. C., Glynn, P. W., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 2021.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096 2030, 2016.

Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. In International Conference on Learning Representations, 2020.

Hauser, A. and B uhlmann, P. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409 2464, 2012.

Kamath, P., Tangella, A., Sutherland, D., and Srebro, N. Does invariant risk minimization capture invariance? In International Conference on Artiﬁcial Intelligence and Statistics, pp. 4069 4077. PMLR, 2021.

Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. Domain adaptation by using causal inference to predict invariant conditional distributions. ar Xiv preprint ar Xiv:1707.06422, 2017.

Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation with multiple sources. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008.

Mansour, Y., Mohri, M., Ro, J., Suresh, A. T., and Wu, K. A theory of multiple-source adaptation with limited target labeled data. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2332 2340. PMLR, 2021.

Muandet, K., Balduzzi, D., and Sch olkopf, B. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10 18. PMLR, 2013.

Pearl, J. Causality. Cambridge University Press, 2 edition, 2009. doi: 10.1017/CBO9780511803161.

PAC Generalization via Invariant Representations

Peters, J., B uhlmann, P., and Meinshausen, N. Causal inference by using invariant prediction: identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pp. 947 1012, 2016.

Rosenfeld, E., Ravikumar, P. K., and Risteski, A. The risks of invariant risk minimization. In International Conference on Learning Representations, 2020.

Rothenh ausler, D., Meinshausen, N., B uhlmann, P., and Peters, J. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B, 83(2):215 246, April 2021. doi: 10.1111/rssb.12398.

Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. doi: 10.1017/ CBO9781107298019.

Solus, L., Wang, Y., and Uhler, C. Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika, 108(4):795 814, 2021.

Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. MIT Press, 2000.

Squires, C., Wang, Y., and Uhler, C. Permutation-based causal structure learning with unknown intervention targets. In Conference on Uncertainty in Artiﬁcial Intelligence, pp. 1039 1048. PMLR, 2020.

Subbaswamy, A., Schulam, P., and Saria, S. Preventing failures due to dataset shift: Learning predictive models that transport. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 3118 3127. PMLR, 2019.

Sullivant, S., Talaska, K., and Draisma, J. Trek separation for gaussian graphical models. The Annals of Statistics, 38(3):1665 1685, 2010.

Uhler, C., Raskutti, G., B uhlmann, P., and Yu, B. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, pp. 436 463, 2013.

Verma, T. and Pearl, J. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artiﬁcial Intelligence, UAI 90, pp. 255 270, USA, 1990. Elsevier Science Inc. ISBN 0444892648.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices, 2011.

Wang, H., Si, H., Li, B., and Zhao, H. Provable domain generalization via invariant-feature subspace recovery. In International Conference on Machine Learning, pp. 23018 23033. PMLR, 2022.

Yang, K., Katcoff, A., and Uhler, C. Characterizing and learning equivalence classes of causal dags under interventions. In International Conference on Machine Learning, pp. 5541 5550. PMLR, 2018.

Zhao, H., Des Combes, R. T., Zhang, K., and Gordon, G. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pp. 7523 7532. PMLR, 2019.

PAC Generalization via Invariant Representations

Xi random variable in SEM, see Equation 1 βi linear weight vector, see Equation 1 Pai vector of random variables corresponding to parents of Xi, see 1 Θ set of possible interventions Θtrain set of training interventions D distribution from which training and test distributions are sampled θ joint distribution over variables induced by intervention θ I(Θ ) set of invariant representations with respect to Θ

Iϵ(Θ ) set of approximately invariant representations with respect to Θ ˆIϵ(Θ ) set of approximately invariant representations with respect to Θ (ﬁnite samples) Φ Diagonal matrix encoding a feature selector d in-degree of nodes in Section 9.3 k number of nodes that have interventions in Section 9.2 and 9.3 δ randomness in samples in the interventional datasets δ randomness in interventions drawn from D

Table 1. Notation table

6. Notation

For any quantity that depends on population statistics, A, we use ˆA to denote the corresponding sample estimate. For a vector a Rn, we use a( k) to denote the vector consisting of all monomials of degree less than or equal to k with variables in a. Thus, we have a( 2) = (1, a1, a2, , an, a2 1, a1a2, , a2 n) . We use a(k) (that is, with no dot) to denote the vector consisting of all monomials of exactly degree k with variables in a, so a(2) = (a2 1, a1a2, a1a3, a2 n) . For a matrix A, we use r(A) to denote the vector formed by ﬂattening the matrix, so r(A) = (A11, A12, A13, , An 1,n, Ann) . Please see Table 1 for a summary of notation used in the paper.

7. Auxillary proofs

7.1. General Results

Lemma 7.1. Consider a sequence of i.i.d. random vectors V1, V2, , Vm Rd with mean µ such that Vi < L always. Then for m > d L2

δ , we have

i=1 Vi µ < ϵ] > 1 δ.

Proof. By a standard Hoeffding bound, with probability 1 δ, each entry of 1

m Pm i=1 Vi will be within ϵ

d of its mean for

δ . Then the squared norm of the error 1

m Pm i=1 Vi µ will be less than ϵ.

Lemma 7.2 ((Vershynin, 2011)). The covering number of the Euclidean ball of radius R is (1 + 2R

Lemma 7.3. Let A Rn n be a matrix and x Rn be a column vector, then we have Ax 2 2 = r(A A)x(2).

Proof. This follows by writing the norm squared as a quadratic form.

Lemma 7.4. Consider a diagonal matrix Φ Rn n . Then there exists a ﬁxed matrix V Rn2 n such that r(Φ) = V diag(Φ).

Proof. To construct V , we take the n n identity matrix and insert with n 1 rows of all 0s between every row of the

PAC Generalization via Invariant Representations

identity matrix obtain V Rn2 n which looks as follows:

1 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1

Lemma 7.5 ((Sullivant et al., 2010), (Uhler et al., 2013)). For the class of linear SEMs deﬁned in Equation 1, we have the following

Σθ = J(1 Bθ) 1Ξθ (1 Bθ) 1 J where Ξθ = Eθ[ηη ] and J is identity concatenated with one column of zeros at the index corresponding to vt.

(I Bθ) 1 = Pn 1 k=0 Bk θ

Σθ = P2n 2 k=0 P r+s=k r,s<n 1 BrΞθ(B )s.

Proof. Note that the SEM equations can all be combined into a single vector equation of the form

X = BθX + η = X = (I Bθ) 1η

for some lower triangular matrix Bθ Rn n, and a noise vector η. We can then write the covariance of the features X t, Σθ, in terms of Bθ. Let J Rn n+1 denote the matrix that is identity concatenated with an additional column consisting of all zeros in the index corresponding to vt, such that it selects the submatrix of Eθ[XX ] corresponding to Σθ = Eθ[X t X t]

Σθ := J Eθ[XX ]J

= J(I Bθ) 1 Eθ[ηη ] (I Bθ) 1 J

Since Bθ < 1, we have (I Bθ) 1 = P k=0 Bk θ . Because Bθ is lower triangular, Bk θ = 0 for k n. So (I Bθ) 1 = Pn 1 k=0 Bk θ . Similarly ((I Bθ) 1) = Pn 1 k=0(B θ )k. The result follows by considering every cross term in the product of the sums.

r=0 (Bθ)rΞθ

s=0 (B θ )s =

r+s=k r,s<n 1

8. Invariant Representations

Consider the SEM model of Section 2.

Xi = βθ i Pai + ηi

Xt = β Pat + ηt

We see that a representation that projects onto the parents is indeed invariant, Eθ[Xt|Pat] = β Pat and the RHS has no dependence on θ. If X is independent of ηt, actually more can be said: the identity representation itself is invariant!

PAC Generalization via Invariant Representations

Lemma 8.1. If X ηt, then I I(Θ).

Proof. We begin with Equation (8)

Rθ Φ(f ) = Φ ΣθΦf Φ Eθ[XθX t ]

Plugging in Φ = I, we have

Rθ I(f ) = Σθf Eθ[XθX t ]

= Σθf Eθ[X Pa t β + ηt ]

= Σθf Eθ[X X β + ηt ]

= Σθf Σθβ + Eθ[Xηt]

= Σθf Σθβ + Eθ[X] Eθ[ηt]

So the loss is minimized at the common f = β (that is, independent of intervention).

Remark 8.2. In the absence of anti-causal variables (variables other than the target variable Xt that are downstream from Xt), the condition of Lemma 8.1 is satisﬁed, and we have the convenient result that the ERM solutions are actually already invariant.

8.1. Representations

In this section, we will look more closely at the set of representations.

We show that when the space of representations is all matrix transformations, then the invariant regressor can be taken to be f = (1, 0 . . . 0).

Lemma 8.3. Let the set of candidate representations be from Rn n. Let the set of candidate invariant regressors be Rn 1. Any approximately invariant solution (Φ, f) Rn n Rn 1, corresponds to another invariant solution (Φ , f ) where Φ = ΦA 1 for some invertible A, and f = (1, 0, , 0)T .

Proof. We will construct A as follows. For the ﬁrst column, take A:,0 = f. Take the subspace orthogonal to f and consider a basis b2 . . . bn that spans this. Then use Gram-Schmidt on bi s to complete A, iteratively using columns orthogonal to previous columns. This ensures that Af = f, while also keeping A invertible. Furthermore, we have that Rθ Φ (f ) ϵ Rθ Φ A 1(Af ) ϵ Rθ Φ(f) ϵ

We show that when the space of representations is all diagonal matrices, then the invariant regressor can be taken to be f = (1, 1 . . . 1).

Lemma 8.4. Let the set of candidate representations be diagonal matrices of order n denoted by Dn. Let the set of candidate invariant regressors be Rn 1 where the regressor has non-zero entries.Consider an approximately invariant solution (Φ, f) Dn Rn 1 : fi = 0 where fi is the i-th coordinate of vector f. This corresponds to another invariant solution (Φ , f ) where f = (1, 1 . . . 1)T

Proof. Consider the vector f T Φ that is an invariant solution. Since Φ acts as feature selector, without loss of generality fi = 0, i. Now, put the entries into a diagonal matrix denoted by diag(f T Φ) = diag(f)Φ. Now let Φ = Φdiag(f). Clearly, diag(f)f = f where f = (1, 1 . . . 1)T . Therefore, for A = diag(f), we have Rθ Φ (f ) ϵ Rθ Φ A 1(Af ) ϵ Rθ Φ(f) ϵ

In Sections 9.2,9.1 and 9.3 where diagonal representations (used as feature selectors) are used, f = (1, 1 . . . 1). In Sections 9.4 where general representations are used, then f = (1, 0 . . . 0).

Since the least squares objective is strongly convex, using gradients instead of loss values does not change the set of invariant representations

PAC Generalization via Invariant Representations

f = arg min Rθ Φ(f) θ Θ Rθ Φ(f ) = 0 θ Θ

Proof. This follows from strong convexity of the loss function.

Rθ Φ(f ) = 2 Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et)

E[ f Φ X t Xt 2 f=f 2

= 2 E[Φ X t X tΦf Xt ] (6)

= 2 Φ ΣθΦf Φ Eθ[X t X t ]

= 2 Φ Σθ(Φf βt) Φ Eθ[X tηt]

= 2 Φ Σθ(Φf β) Φ Eθ[J(1 Bθ) 1ηηt]

= 2 Φ Σθ(Φf β) Φ J(1 Bθ) 1 Eθ[ηηt]

= 2 Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) (7)

Lemma 8.7. There is a function class G mapping interventions θ Θ to {0, 1}, such that

Φ Iϵ f (Θtrain) f ϵ Φ(θ) = 1 θ Θtrain

Proof. By Lemmas 8.6 and 7.5, the invariance condition can be simpliﬁed as follows (for approriate J, et deﬁned in the Lemma):

E[ f Φ X t Xt 2 f 2 = 2 Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) ϵ. (8)

Our desired function class is G = {f ϵ Φ(θ)} parameterized by Φ where

f ϵ Φ(θ) := 1{2 Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) ϵ},

is a function mapping an intervention to {0, 1} such that Φ Iϵ f (θ) f ϵ Φ(θ) = 1.

Corollary 8.8. If the VC dimension of G is bounded by d, then given at least O( d+log 1

δ δ ) training interventions, if Iϵ(Θ) is not empty, we have that with probability 1 δ over the set of training interventions,

Pθ D(Iϵ f (Θtrain) Iϵ f (θ)) δ .

Proof. Let D be a distribution over Θ { 1} such that the marginal over Θ is D, and the marginal over { 1} is the dirac-delta on the element 1 (i.e. it always takes value 1). We can now apply Lemma 3.1 to our function class G = {gϵ Φ(θ)} and distribution D. By assumption, there exists a representation Φ Iϵ f (Θ) = T θ Θtrain Iϵ(θ), which means f ϵ Φ(θ) = 1

for all θ Θ. Since Θtrain Θ, we have f ϵ Φ(θ) = 1 for all θ Θtrain. PAC learnability now states that given O( d+log 1

δ δ ) interventions, if Φ is such that Φ gϵ Φ(Θtrain) (so that it is a hypothesis that ﬁts the training data exactly), we have that with probability at least 1 δ, Pθ D[gϵ Φ(θ) = 1] > 1 δ .

PAC Generalization via Invariant Representations

9. Intervention Models

9.1. General Interventions

Lemma 9.1. For some matrix Uθ, we have

Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) = Uθr(Φ)(2).

Proof. We will see that both of these terms can be written as Uθr(Φ)( 2) for some (different) choices of Uθ. We will enumerate the coordinates of r(Φ)( 2) (ij),(kl) as ΦijΦkl. Note that r(Φ)( 2) also contains the individual Φij, and we will use

to indicate this extra alphabet. To clarify, some of the coordinates are then of the form r(Φ)( 2) ( ,(kl)) = Φkl. Now observe that

Φ ΣθΦf = U 1 θ r(Φ)( 2)

(U 1 θ )a,((ij),(kl)) =

( Σjk(f )l, a = i and (ij) = and (kl) = 0, a = i or (ij) = or (kl) =

The second term is almost in the form we would like already

Φ (Σθβ J(1 Bθ) 1et) = U 2 θ r(Φ)( 2)

(U 2 θ )a,((ij),(kl)) =

( (Σθβ J(1 B) 1et)ij (ij) = and (kl) = 0, (ij) = or (kl) =

Our result follows by setting Uθ = U 1 θ + U 2 θ .

Theorem 9.2. Suppose that we are given O( n4+log 1

δ δ ) interventions drawn independently from distribution D over the intervention index set Θ. Then with probability at least 1 δ over the randomness in Θtrain, the following statement holds:

Pθ D(Iϵ f (Θtrain) / Iϵ f (θ)) δ .

In other words, an ϵapproximate invariant representation on the training environments generalizes with high probability to the test environment.

Proof. We again begin with our expression for the gradient of the loss given by Lemma 8.6.

Rθ Φ(f ) = Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et).

We show in Lemma 9.1 that we can write this as Rθ Φ(f ) = Uθr(Φ)( 2) for some matrix Uθ. From Lemma 7.3 we can write the squared norm as Rθ Φ(f ) 2 2 = r(U θ Uθ)(r(Φ)( 2))(2). Finally, we can now write our function class as a class of halfspace classiﬁers. gϵ Φ(θ) = 1 r(U θ Uθ)(r(Φ)( 2))(2) ϵ.

Now consider the map Ψ : diag(Φ( 4)) r(Φ)( 2)(2). Because Φ is diagonal, this is well deﬁned. Each term in r(Φ)( 2)(2)

contains at most 4 terms of Φ, and each of these is an element of Φ( 4). Also, since the relation between entries of r(Φ)( 2)(2)

and entries of diag(Φ( 4)) is ﬁxed, Ψ is actually a linear transformation (that is, it is some sparse 0/1 matrix). We can then write gϵ Φ(θ) = 1 r(U θ Uθ)Ψdiag(Φ( 4)) ϵ.

Finally, note that diag(Φ( 4)) R(n+1)4, that is, we have written the function class G as corresponding to some subset of all halfspace classiﬁers in R(n+1)4. By Lemma 3.2, we know that the VC dimension of G is bounded by (n + 1)4. The result follows from Corollary 3.4.

PAC Generalization via Invariant Representations

9.2. Atomic interventions

Here we consider the instance in which our family of interventions consists of atomic interventions on a total of k nodes At. An atomic intervention at node i replaces the generative equation Xi = γi(e) Pai + ηi by the assignment Xi = ai for some scalar ai. Note that we can interpret this as zeroing out the corresponding entries of B and changing the noise variable to be a constant ai. Lemma 9.3. For some matrix U, we have

Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) 2 2 = r(Φ)( 4)Ua( 2).

Proof. We will examine the dependence of Eθ[X t X t] and Eθ[X t X t ] on ai θ starting from Lemma 7.5.

For the former, note that (Σθ)ij can be interpreted as having one term corresponding to every path from i backwards to some node k and then forwards to j. Because atomic interventions essentially cut off the connections between a node and its parents, any path that consists initially of backwards edges and then forward edges can only have a quadratic dependence on the parameters of the intervention, regardless of how many nodes are intervened on. It is only such paths that contribute to the covariance matrix, and thus to the thresholded polynomials that determine approximate invariance. We will make this explicit below.

The nature of atomic interventions is that Bθ = BΘ = BIΘ for each of the interventions θ Θ, where IΘ is an identity matrix with zeros on the diagonal entries corresponding to At (we are disconnecting each of the sites of intervention from their parents). In addition, the noise covariance changes. Previously, with independent unit variance noise, the covariance was Ξθ = Eθ[ηη ] = I. Since we are incorporating the assignments for the atomic interventions into the noise, it is now given by

1, i = j At 0, i At, j At, i = j 0, i At, j At, i = j aθ i aθ j i, j At

That is, we have replaced the submatrix corresponding to the sites of the interventions with aθ(aθ) where aθ denotes the vector of assigments aθ = (aθ 1, aθ 2, , aθ |At|).

(Φ ΣθΦf )j =

r+s=k r,s<n

(Φ JBr ΘΞθ(B Θ)s J Φf )j

r+s=k r,s<n

j1,j2,j3,j4,j5,j6,j7 Φj1,j Jj1,j2(Br Θ)j2,j3(Ξθ)j3,j4 (Bs Θ)j5,j4 J j5,j6Φj6,j7(f )j7

r+s=k r,s<n

j1,j2,j3,j5,j6,j7 j2 At

Φj1,j Jj1,j2(Br Θ)j2,j3(Ξθ)j3,j3 (Bs Θ)j5,j4 J j5,j6Φj6,j7(f )j7+

j1,j2,j3,j4,j5,j6,j7 j2,j3 At

Φj1,j Jj1,j2(Br Θ)j2,j3(Ξθ)j3,j4 (Bs Θ)j5,j4 J j5,j6Φj6,j7(f )j7

r+s=k r,s<n

j1,j2,j3,j5,j6,j7 j2 At

Φj1,j Jj1,j2(Br θ)j2,j3 (Bs Θ)j5,j4 J j5,j6Φj6,j7(f )j7+

j1,j2,j3,j4,j5,j6,j7 j2,j3 At

Φj1,j Jj1,j2(Br Θ)j2,j3aθ j3aθ j4 (Bs Θ)j5,j4 J j5,j6Φj6,j7(f )j7

r+s=k r,s<n

r(Φ)(2)U 1,Θ,r,s j a(2)

PAC Generalization via Invariant Representations

= r(Φ)(2)U 1,Θ j a(2)

Stacking the U 1,Θ j together gives us Φ ΣθΦf = r(Φ)(2)U 1,Θa(2).

Similarly to Lemma 9.1, we have

r+s=k r,s<n

j1,j2,j3,j4 Φj1,j(Br θ)j1,j2(Ξθ)j2,j3(Bs θ)j4,j3βj4

r+s=k r,s<n

j1,j2,j4 j2 At

Φj1,j(Br θ)j1,j2(Ξθ)j2,j3(Bs θ)j4,j3βj4 + X

j1,j2,j3,j4 j2,j3 At

Φj1,j(Br θ)j1,j2(Ξθ)j2,j3(Bs θ)j4,j3βj4

r+s=k r,s<n

j1,j2,j4 j2 At

Φj1,j(Br θ)j1,j2(Bs θ)j4,j3βj4 + X

j1,j2,j3,j4 j2,j3 At

Φj1,j(Br θ)j1,j2aj2aj3(Bs θ)j4,j3βj4

r+s=k r,s<n

r(Φ)(2)U 2,r,s j a(2)

= r(Φ)(2)U 2 j a(2)

Stacking the U 2 j together, we get Φ Σθγ = U 2a(2). Now,

Φ J(1 Bθ) 1et = r(Φ)(2)U 3,Θa( 2)

simply by noting that there is no dependence on the parameters ai. Taking the difference, we have for some U 4,Θ,

Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) = r(Φ)(2)U 4,Θa(2).

Finally, the squared norm consists of terms that contain up to two terms of r(Φ)( 2) and two terms of a( 2), this can be written as Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) 2 2 = r(Φ)( 4)U Θa( 4).

Theorem 9.4. Given |Θtrain| = O( k4+log 1

δ δ ) interventions drawn independently from a distribution D over all atomic interventions on some ﬁxed set At of k nodes that with probability at least 1 δ over the randomness in Θtrain, with probability at least 1 δ over the randomness in D, we have for θ D, Iϵ(Θtrain) Iϵ(θ)

Proof. Recall Equation (8.6):

Rθ Φ(f ) 2 = 2 Φ ΣθΦf Φ (Σθβ J(1 BΘ) 1et)

We show in Lemma 9.3 that we can simplify this expression down to one of the form

Rθ Φ(f ) 2 2 = r(Φ)( 4)U Θa(4)

for some different matrix U θ. The function class G now consists of half-spaces,

f ϵ Φ(θ) = 1 r(Φ)( 4)U Θa(4) ϵ

We can thus re-parameterize G using r(Φ)(4) rather than Φ. Since these half-spaces live in Rk4 rather than Rn2, by Lemma 3.2 the VC-dimension is bounded by k4. From Corollary 3.4, we have that O((k4 + log 1

δ )/δ ) interventions sufﬁces to ensure that any Φ satisfying f ϵ Φ(θ) = 1 for all θ Θtrain satisﬁes f ϵ Φ(θ) = 1 with probability 1 δ for θ D.

PAC Generalization via Invariant Representations

9.3. Soft Interventions

Soft interventions are those in which the weights of the SEM are modiﬁed while the underlying causal structure remains the same. That is, a soft intervention at node vi is equivalent to replacing the equation Xi = (βθ i ) X + ηi with Xi = (βθ i ) X + ηi. Note that this is equivalent also to changing the ith row of Bθ. β R P

i At |Pai| denote the vector of all weights involved in the set of soft interventions we are considering. Let β(2k) denote the vector consisting of every combination of up to 2k entries of β modulo terms of the form βθ l2,l1βθ l3,l1βθ l4,l1. In other words, β(2k) does not contain any three weights from the same site. Lemma 9.5. For some matrix U, we have

Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) 2 = r(Φ)( 4)Uβ(2k)( 2).

Proof. We will look at how Σθ and (1 Bθ) 1 depend on the parameters of the intervention.

Denote by Γki the set of directed paths from node vk to vi in G. From Lemma 7.5 we have

γki Γki γkj Γkj

(l1,l2) γki l2 At

(l1,l2) γki l2 At

(l 1,l 2) γkj l 2 At

βθ l 1,l 2 Y

(l 1,l 2) γki l 2 At

We can bound the number of terms in this expansion that depend on the intervention using a simple counting argument. There are no terms that are multiples of βθ l2,l1βθ l3,l1βθ l4,l1 since there can be no directed path that passes through a single node twice (else we could ﬁnd a cycle in the acyclic graph G). Thus each term in (Σθ)ij consists of at most two terms from each site of intervention. There are k total sites, and d2 choices of two terms at each, for a total of (d2)k distinct possible terms in the covariance.

So each entry in the covariance matrix is an inner product of some vector with β(2k).

Similarly, consider (1 Bθ) 1. We can write this as Pn 1 k=0 Br θ (since Bθ is lower triangular, higher order terms are 0). We have ((1 Bθ) 1)ij = X

(l1,l2) γki l2 At

(l1,l2) γki l2 At

Similarly, each entry here is an inner product of some vector with β(k). In this case, no terms of the form βθ l2,l1βθ l3,l1 occur, since each backwards path can only pass through each site of intervention once.

We can then argue analogously to Lemma 9.3 that the complete each term in the gradient of the loss can be written using terms consisting of elements of β(2k) and elements of r(Φ)( 2). The squared norm of the gradient can then be written using terms consisting of pairs of β(2k) and pairs of elements of r(Φ)( 2). In other words,

Φ ΣθΦf Φ (Σθβ J(1 Bθ) 1et) 2 = r(Φ)( 4)Uβ(2k)( 2)

Theorem 9.6. Given O( d4k+log 1

δ δ ) interventions drawn independently from a distribution D over all soft interventions on some ﬁxed set At of k nodes such that each intervened node has in-degree at most d, we have that any ϵ approximately invariant representation Φ satisﬁes that with probability at least 1 δ over the randomness in Θtrain, with probability at least 1 δ over the randomness in D, we have for θ D,Iϵ(Θtrain) = Iϵ(θ)

Proof. We show in the Lemma 9.5 that we can now write the norm of the gradient of the loss in terms of a product in a lower dimensional space again, similar to what we did previously.

Rθ Φ(f ) 2 2 = r(Φ)(4)V β(4k).

to get an upper bound on the VC dimension of d4k. The theorem follows from Corollary 3.4.

PAC Generalization via Invariant Representations

9.4. Indirect Observations

For this section, we change the observation model. There is assumed to be an underlying linear SEM, deﬁned as in 1, but we observe Z t, Zt = (S t(X), St(X)) for some S t, St. Here Zt denotes the target variable, and Z t denotes the features . For instance, If St(X) was just Xt, and S t was diagonal with elements in {0, 1}, then this would correspond to a setting with latent (that is, hidden) variables. We need the following assumption on S t to ensure that the function class is still realizable, as in Lemma 3.5.

Assumption 9.7. There exists a matrix Φ such that ΦS t(X) = Pat.

In the latent variables setting, this corresponds to the assumption that each of the true parents are actually visible. Finally, because we are in a more general setting, we no longer restrict to representations Φ that are diagonal (that is, feature selectors).

We use the same deﬁnition for the invariant representations, namely,

Φ Iϵ f (θ) Eθ[f ΦZ t Zt] |f 2 ϵ.

Here lemma 8.6 should be modiﬁed to account for the indirect observations.

Lemma 9.8. We have,

Eθ[ f ΦZ t Zt 2] |f 2 = 2 Φ Eθ[S t(X)S t(X) ]Φf Φ Eθ[S t(X)St(X)] 2

Proof. The proof is immediate. The expansions in Lemma 8.6 are due to linearity, which we do not leverage here.

The following Lemma is similar to Lemma 9.1; we have included it for completeness.

Lemma 9.9. For some matrix Uθ, we have

Φ Eθ[S t(X)S t(X) ]Φf 2Φ Eθ[S t(X)St(X)] = Uθr(Φ)(2).

Proof. Observe that Φ Eθ[S t(X)S t(X)]Φf = U 1 θ r(Φ)( 2)

(U 1 θ )a,((ij),(kl)) =

( Eθ[S t(X)S t(X) ]

jk(f )l, a = i and (ij) = and (kl) =

0, a = i or (ij) = or (kl) =

The second term is almost in the form we would like already

Φ Eθ[S t(X)St(X)] = U 2 θ r(Φ)( 2)

(U 2 θ )a,((ij),(kl)) =

( Eθ[S t(X)St(X)]

ij (ij) = and (kl) =

0, (ij) = or (kl) =

Our result follows by setting Uθ = U 1 θ + U 2 θ .

Theorem 9.10. Suppose that we are given O( n8+log 1

δ δ ) interventions drawn independently from distribution D over the intervention index set Θ. Then with probability at least 1 δ over the randomness in Θtrain, the following statement holds:

Pθ D(Iϵ f (Θtrain) / Iϵ f (θ)) δ .

Proof. We again begin with our expression for the gradient of the loss given by Lemma 9.8. We show in Lemma 9.9 that we can write this as Rθ Φ(f ) = Uθr(Φ)( 2) for some matrix Uθ. From Lemma 7.3 we can write the squared norm as Rθ Φ(f ) 2 2 = r(U θ Uθ)(r(Φ)( 2))(2). The result follows identically to Theorem 3.7. Note that the representation now has n2 parameters already, so the half-space embedding is in dimension n8.

PAC Generalization via Invariant Representations

10. Finite sample bounds

Lemma 10.1. Given d L2

δ samples from θ, we can compute ˆ Rθ Φ(f ) such that with probability 1 δ over the samples drawn in each interventional distribution, for any ﬁxed Φ,

Rθ Φ(f ) ˆ Rθ Φ(f ) 2 ϵ.

Proof. Suppose we had 2m samples from θ. Split these samples into m pairs (X2i, X2i+1)m i=1 randomly. Denote by Rθ Φ(f, X) the loss at a point X, given by

Rθ Φ(f, X) = Φ XX Φf Φ XXt

Then we can estimate Rθ Φ(f ) as

ˆ Rθ Φ(f ) = 1

i=1 Rθ Φ(f , X2i)

i=1 Rθ Φ(f , X2i+1)

Each of the inner terms concentrate about their mean as given by Lemma 7.1.

i=1 Rθ Φ(f , X2i)

i=1 Rθ Φ(f , X2i+1)

i=1 Rθ Φ(f , X2i) Rθ Φ(f )

i=1 Rθ Φ(f , X2i+1) Rθ Φ(f )

where the ﬁnal inequality is true for m d L2

δ with probability 1 δ.

Lemma 10.2. Consider the function class consisting of functions parameterized by Φ for a ﬁxed θ

{Rθ Φ(f) := Φ ΣeΦf Φ Eθ[XXt] Φ 2 = 1, f = n}

Let Nϵ denote the covering number of this set with respect to the metric dist(Φ1, Φ2) = Rθ Φ1(f) Rθ Φ2(f) . Then

log Nϵ = n2 log 1 + 4n3/2

Proof. We have

dist(Φ1, Φ2) = Rθ Φ1(f) Rθ Φ2(f)

= Φ 1 ΣθΦ1f Φ 2 ΣθΦ2f + Φ 2 Eθ[X t Xt] Φ 1 Eθ[X t Xt]

Φ 1 ΣθΦ1f Φ 2 ΣθΦ2f + Φ 2 Eθ[ t XXt] Φ 1 Eθ[X t Xt]

Φ 1 ΣθΦ1 Φ 2 ΣθΦ2 f + Φ1 Φ2 Eθ[X t Xt]

(Φ1 Φ2) Σθ(Φ1 Φ2) f + Φ1 Φ2 Eθ[X t Xt]

Φ1 Φ2 2 Σθ f + Φ1 Φ2 Eθ[X t Xt]

We have f = n. Because X 1, we know that Σθ n and Eθ[X t Xt] n. Now we take φ to be an ϵ 2n3/2 -covering of the space of representations Φ. We know from Lemma 7.2 that we can ﬁnd one containing at most 1 + 4n3/2

ϵ n2 representations. That is, given any Φ, Φ 1, we can ﬁnd Φ φ such that Φ Φ ϵ 2n3/2 . Then we have (assuming ϵ < 1, n > 1)

dist(Φ, Φ ) ϵ 2n3/2

2 n 3 2 + ϵ 2n3/2 n ϵ.

PAC Generalization via Invariant Representations

Putting these together,

Lemma 10.3. Given 4d L2

δ + log Nϵ/2 samples in a dataset, with probability 1 δ, for every representation Φ simultaneously we have ˆI

ϵ 2 f (θ) Iϵ f (θ)

Proof. Consider Φ ˆI

ϵ 2 f (θ). From the deﬁnition,

ˆ Rθ Φ(f ) < ϵ

By Lemma 7.2 and Lemma 10.1, we know that

Rθ Φ(f ) ˆ Rθ Φ(f ) < ϵ

Putting these together, we see that Rθ Φ(f ) ϵ = Φ Iϵ f (θ).

Theorem 10.4. Given

O( k4+log 1

δ δ ) Θ is k nodes, hard interventions

O( d4k+log 1

δ δ ) Θ is k nodes, soft interventions

O( n4+log 1

δ δ ) Θ any interventions

interventional datasets, and 4n L2

δ + n2 log(1 + 8n3/2

ϵ ) samples in each dataset, we have that with probability

1 δ, with probability 1 δ for θ D, ˆIϵ f (Θtrain) I2ϵ f (θ).

Proof. From Lemma 4.2 we know that ˆIϵ f (θ) I2ϵ f (θ). (10)

With a union bound over Θtrain, we can establish Equation (10) at once over all interventions in the training set. Since Iϵ f (Θtrain) = T

θ Θtrain Iϵ f (θ) and ˆIϵ f (Θtrain) = T

θ Θtrain ˆIϵ f (θ), we get that ˆIϵ f (Θtrain) Iϵ f (Θtrain). Finally, from Theorems 3.7, 3.8, 3.9, we have Iϵ f (Θtrain) Iϵ f (Θ) for each of the cases listed.

11. Empirical study

Construction of training and test interventional distributions: For the class of hard interventions, we consider assignments drawn from Gaussians on nodes {v1, v2, v3}. Speciﬁcally, we take Xi = ai for i {3, 4, 5} with ai N(0, σ2) independently for each node. For the class of soft interventions, we assign each nonzero entry of βθ i N(0, σ2) for i {3, 4, 5} independently. In our experiments, σ2 = 1. In this way, we generate m atomic-interventional datasets {Nj,hard}j [m] and m soft-interventional datasets {Nj,soft}j [m] for m varying from 3 to 20.

Subsets as Representations: We iterate over the power-set of the nodes to make a list of approximately invariant representations. For every subset S of the non-target nodes, we consider ΦS to be the corresponding representation, diagonal, with a one on each index in S (a projection onto S). We count the fraction of these representations that are approximately invariant, as deﬁned below.

Approximately Invariant Representations: Here, these are deﬁned as being such that there is not much variation over the least squares solutions on top of the representations. For every such representation ΦS, we denote by f S,j the least squares solution on top of the representation ΦS for dataset Nj. We take as a measure of invariance the quantity

j1,j2 [m] f S,j1 f S,j2 2 (m 1) P

j [m] f S,j 2 .

PAC Generalization via Invariant Representations

generalization percentage

N = 10000 N = 15000

3 5 8 11 14 17 20 number of training interventions

generalization percentage

3 5 8 11 14 17 20 number of training interventions

intervention

hard+hard hard+rad soft+soft soft+rad ERM+hard

Figure 5. Comparison of generalization for various combinations of training and test interventional distributions. Here hard refers to the i.i.d. interventions setting with hard interventions from the same distribution in both training and test interventions, similarly soft refers to i.i.d. soft interventions, while hard+rad (respectively soft+rad ) refer to hard interventions (respectively soft) in training with rademacher interventions in test. We repeat the above for N = 10, 000, 15, 000, 30, 000, and 45, 000.

PAC Generalization via Invariant Representations

Note that this is the ratio of the average distance between the various least squares solutions and the norm of the average least squares solution. We expect this to be low when the various least squares solutions are close to one another. An approximately invariant subset S is then taken to be one such that ρS < 0.02. Let shard, ssoft denote the approximately invariant subsets given the training sets {Nj,hard}j [m], {Nj,soft}j [m]

Test distributions: We construct mtest test datasets induced by interventions drawn from each of the same families, that is, {N test j,hard}j mtest for atomic interventions, and {N test j,hard}j mtest for soft interventions, as well as a set of datasets {N test α,j,rad}j mtest, drawn from the linear SEM constructed by ﬂipping each edge weight in the original SEM with probability α (except the ones connecting the parents of the target to the target). Note that this last dataset is not drawn from interventions that are drawn from the same distribution as the training datasets. For each set in shard, we return the percentage of datasets in {N test j,hard}j mtest, {N test j,soft}j mtest, and {N test α,j,rad}j mtest such that the least squares solution on top of the subsets continues to be approximately the same. The labels of the plots reﬂect in order the trianing and test interventional ditributions, so soft+rad(0.5) means we are evaluating ssoft on 100 datasets {N0.5,j,rad}.

ERM generalization: To demonstrate that generalization in this sense is indeed non-trivial, we also plot the fraction of least squares solutions to the observational datasets (where we pool together the training distributions) that exhibit low variance in test interventional distributions. This box plot is essentially trivial at 0.

Norm of gradient: Rather than using an RMS variance, we can use the norm of the gradient. The empirical results are similar (in fact, these are related to each other, since R1(f1) , R2(f2) = 0, f1 f2 2 ϵ = R1( f1+f2

2 ) , R2( f1+f2

2 ) βϵ where β is the smoothness of the loss functions).