# interpolation_and_regularization_for_causal_learning__7a0676f9.pdf Interpolation and Regularization for Causal Learning Leena Chennuru Vankadara* University of Tübingen, Tübingen AI Center Luca Rendsburg* University of Tübingen, Tübingen AI Center Ulrike von Luxburg University of Tübingen, Tübingen AI Center Debarghya Ghoshdastidar Technical University of Munich Recent work shows that in complex model classes, interpolators can achieve statistical generalization and even be optimal for statistical learning. However, despite increasing interest in learning models with good causal properties, there is no understanding of whether such interpolators can also achieve causal generalization. To address this gap, we study causal learning from observational data through the lens of interpolation and its counterpart regularization. Under a simple linear causal model, we derive precise asymptotics for the causal risk of the min-norm interpolator and ridge regressors in the high-dimensional regime. We find a large range of behavior that can be precisely characterized by a new measure of confounding strength. When confounding strength is positive, which holds under independent causal mechanisms a standard assumption in causal learning we find that interpolators cannot be optimal. Indeed, causal learning requires stronger regularization than statistical learning. Beyond this assumption, when confounding is negative, we observe a phenomenon of self-induced regularization due to positive alignment between statistical and causal signals. Here, causal learning requires weaker regularization than statistical learning, interpolators can be optimal, and optimal regularization can even be negative. 1 Introduction We consider the problem of learning the causal influence of multivariate covariates x Rd on a scalar target variable y R purely from observational data and under the presence of hidden confounders. Formally, given finite samples {(xi, yi)}n i=1 drawn independently and identically (i.i.d) from the joint observational distribution p(x, y) = p(x)p(y|x), the goal of causal learning is to predict the effects on the target variable y under interventions on the covariates x. In other words, the goal is to learn a predictive model that minimizes the expected loss on a random draw from the interventional distribution pdo(x, y) = p(x)p(y|do(x)), which can be different from the observational distribution. Recently, Janzing (2019) established a close analogy between statistical and causal learning (albeit under a highly constructed confounded model). As a consequence, Janzing (2019) suggested that standard statistical learning-theoretic techniques (such as norm-based regularization) may also help learn good causal models. However, the classical statistical principles of bias-variance trade-off have been challenged in recent years by highly complex classes of models that are trained to interpolate the data and yet achieve remarkable generalization properties across a broad range of problem domains (Zhang et al., 2021). A large volume of recent work suggests that interpolation can be compatible with and may even be necessary to achieve optimal statistical generalization in the high-dimensional regime (Belkin et al., 2018; Belkin et al., 2019a; Liang et al., 2020; Feldman, 2020). Despite the surge in interest, causal properties of such interpolators have not yet been explored. In this work, we * denotes equal contribution. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). consider a simple linear causal model in the high-dimensional regime (n, d , d/n O(1)) and ask: can interpolators achieve good causal generalization? 1.1 Motivation and Related Work Resemblance between statistical and causal generalization Causal learning can be regarded as an instance of the general problem of learning under distribution shifts, where the training (observational) distribution is shifted from the test (interventional) distribution. In the framework of out-of-distribution generalization, an interesting proposition for causal learning arises from the following high-level idea. Observing small sample sizes may induce a similar bias as distribution shifts. Therefore, techniques for learning models with good out-of-sample generalization (such as regularization) may also help to learn models with good out-of-distribution generalization and vice-versa. The literature provides plentiful evidence to support this general principle for different classes of distribution shifts. For instance, under a broad class of distribution shifts, distributionally robust optimization is equivalent to norm-based regularization (Xu et al., 2009; Shafieezadeh Abadeh et al., 2015; Gao et al., 2017; Shafieezadeh-Abadeh et al., 2019; Blanchet et al., 2019; Kuhn et al., 2019). Analogously, distributionally robust optimization techniques are also employed for statistical learning under limited samples (Zhu et al., 2020). Particularly relevant to our work is Janzing (2019), which formally establishes a close analogy between generalizing from empirical to observational distributions and generalizing from observational to interventional distributions under a highly constructed confounding model. This analogy suggests that standard norm-based regularization such as lasso or ridge, typically used for statistical learning, may also help learn better causal models. Interpolation can be compatible with statistical learning Explicit norm-based regularization techniques were initially motivated by the classical learning theory principle of bias-variance tradeoff, which is characterized by a U-shaped generalization curve. This principle recommends to avoid interpolation and instead suggests to balance data fitting with the complexity of the hypothesis class. Recently, however, these classical principles have been challenged by deep learning models. Despite being highly complex with the ability to fit even random labels and often trained to interpolate the training data, they achieve state-of-the-art out-of-sample generalization across many domains (Zhang et al., 2021). A partial explanation is provided by the double-descent phenomenon (Belkin et al., 2019b; Belkin, 2021). Extending the generalization curve beyond the interpolation threshold reveals two regimes: the classical U-curve in the underparameterized regime and a decreasing curve in the overparameterized regime. This behaviour is not limited to deep neural networks, but extends to other settings such as random feature models and random forests (Belkin et al., 2019b; Hastie et al., 2022; Mei et al., 2021). Follow-up work suggests that in the overparameterized regime, interpolators can indeed achieve low statistical risk (Belkin et al., 2019a; Liang et al., 2020; Bartlett et al., 2020; Tsigler et al., 2020; Muthukumar et al., 2020). Is interpolation compatible with causal learning? On account of the parallels between statistical (out-of-sample) and causal (out-of-distribution) learning, it is therefore natural to ask: can interpolators also learn good causal models? One line of empirical work suggests that naively applying distributionally robust learning techniques such as importance reweighting or distributionally robust optimization approaches (which are equivalent to certain forms of regularization) offers vanishing benefits over empirical risk minimization in overparameterized model classes (Byrd et al., 2019; Sagawa et al., 2020; Gulrajani et al., 2021). However, there is also empirical evidence that suggests that augmenting such techniques with additional explicit norm-based regularization may help to learn distributionally robust models in the overparameterized regime (Sagawa et al., 2020; Donhauser et al., 2021). In the context of causal learning, it has been suggested that explicit regularization can be beneficial and might even need to be stronger than for statistical learning (Janzing, 2019; Vankadara et al., 2021). Existing work, therefore, remains unclear about the role of explicit regularization in causal learning, or correspondingly, whether interpolation is compatible with causal learning. In this work, we take a theoretical approach to systematically address these questions. 1.2 Our Contributions We provide a first analysis of causal generalization from observational data in the modern, overparameterized and interpolating regime under a simple linear causal model. Specifically, we consider the interpolating minimum l2 norm least-squares estimator and the family of regularized ridge regression estimators in the proportional asymptotic regime. We seek answers to the following questions: is there a regime where the optimal causal regularization parameter is 0, that is, can we observe benign causal overfitting? Furthermore, if the optimal causal regularization parameter is positive, how strongly do we need to regularize? How does the optimal causal regularization compare to the optimal statistical regularization? While our analysis is exhaustive, we emphasize the results under the assumption of independent causal mechanisms (Janzing et al., 2010), a standard assumption in causal learning. Precise asymptotics of the causal risk (Section 3). We provide precise asymptotics of the causal risk of the ridge regression estimator as well as the minimum l2 norm interpolator in the high-dimensional setting: n, d , d/n γ (0, ). Our results confirm that, similar to the statistical setting, the causal generalization curve of the min-norm estimator exhibits the double-descent phenomenon. This is because the variance term diverges at the interpolation threshold and is decreasing in the overparameterized regime (γ > 1). A measure of confounding strength ζ (Section 2.1). We introduce a new measure of confounding strength ζ that quantifies the relative contribution of the confounding signal to the causal signal . It can be interpreted as the strength of the distribution shift between the observational and interventional distributions. While ζ can take any real value in general, it is restricted to [0, 1] under the assumption of independent causal mechanisms. There, it induces a strict, model-independent ordering of all causal models that entail the same observational distribution. Benign causal overfitting (Section 4). When the causal signal dominates the statistical signal (ζ < 0), we observe a phenomenon of self-induced regularization due to the confounding signal. As a consequence, the optimal causal regularization can be 0 or negative even if the optimal statistical regularization is strictly positive. Under the assumption of independent causal mechanisms, however, we show that there is no benign causal overfitting. This is in contrast to benign statistical overfitting, which can occur in the highly underparameterized regime (γ 0). Optimal causal vs. statistical regularization (Section 5). We show that causal learning requires weaker regularization than statistical learning when the confounding strength ζ is negative. However, when ζ > 0 and in particular under the principle of independent causal mechanisms, we show that causal learning requires stronger regularization than statistical learning. More specifically, the optimal causal regularization is strictly increasing in confounding strength. 2 Problem Setup Figure 1: (a) Graphical model of the causal model defined in (1). (b) The usual statistical model. In both figures, observed random variables are shaded and unobserved variables are white. We consider a linear causal model with parameters M Rd l, α Rl, β Rd with l d and σ2 > 0 described via the structural equations z N(0, Il) , ε N(0, σ2) , x = Mz , y = x T β + z T α + ε . (1) The covariates x Rd and the target y R are confounded through z Rl, which follows a standard normal distribution. This structure implies that Ex = 0 and the covariance of x is Σ := Cov x = MM T . A graphical representation of this causal model is given in Figure 1a. The observational joint distribution of this causal model is given by p(x, y) = p(x)p(y|x), where x N(0, Σ) and y|x N(x T β, σ2). Here, the statistical parameter β := β + Γ consists of the causal parameter β and a confounding parameter Γ := M +T α, where M +T is shorthand for (M +)T and M + denotes the Moore-Penrose inverse of M. The statistical noise is given by σ2 := σ2 + α 2 Γ 2 Σ, where x 2 Σ := x T Σx denotes the generalized norm. 1 Note that the observational distribution alone cannot distinguish the causal model from the one in Figure 1b. The 1Note that α 2 Γ 2 Σ = α 2 I M+M 0, where I M +M is the orthogonal projection onto ker M. goal of statistical learning is to predict y after observing x, which is captured by the conditional distribution p(y|x). In contrast, the goal of causal learning is to predict y after manipulating or intervening on x. This is formally captured by Pearl s do-calculus (Pearl, 2009), which describes interventions on random variables as a shift in the joint distribution. Intervening on x with the value x0, denoted as do(x = x0), removes all arrows to x and sets x = x0. In our causal model (1), the intervention do(x = x0) removes the arrow z x and yields the updated structural causal equations z N(0, Il) , ε N(0, σ2) , x = x0 , y = x T 0 β + z T α + ε . The corresponding distribution of y after intervening on x is therefore given by y|do(x = x0) N(x T 0 β, σ2 + Γ 2 Σ). Since arbitrary interventions can introduce arbitrary distribution shifts, we consider the natural class of interventions drawn from the observational marginal distribution on x. This yields the interventional joint distribution pdo(x, y) = p(x)p(y|do(x)) with the slight abuse of notation do(x) in which the random variable x and its value coincide. Causal learning from observational data Assume we are given i.i.d. samples {(xi, yi)}n i=1 from the observational joint distribution p(x, y), which we collect in X Rn d and Y Rn. The usual statistical learning aims for the observational conditional p(y|x), which means that train and test distributions coincide. Causal learning aims for the interventional conditional p(y|do(x)), a distribution shift problem for which train and test distributions differ. We define the corresponding causal risk RC and statistical risk RS of any linear regressor ˆβ Rd under the squared loss as RC(ˆβ) := Ex Ey|do(x)(x T ˆβ y)2 and RS(ˆβ) := Ex Ey|x(x T ˆβ y)2 . (2) The following proposition (proven in Appendix A) characterizes the risks under the model (1). Proposition 2.1 (Causal and Statistical Risk). For any ˆβ Rd, the risks defined in Eq. (2) satisfy RC(ˆβ) = ˆβ β 2 Σ + σ2 + Γ 2 Σ and RS(ˆβ) = ˆβ β 2 Σ + σ2 . Therefore, β is the optimal causal parameter and β is the optimal statistical parameter. In the following, we simply refer to them as causal and statistical parameters. 2.1 A New Measure of Confounding Strength Since the interventional distribution generally differs from the observational distribution, we require a measure that quantifies how this shift influences causal learning from observational data. Signal-to-noise ratios (SNRs) Before we define our measure of confounding strength, we first define the statistical and causal signal-to-noise ratios, which help to intuitively understand our confounding strength measure. Recall that every causal model entails a statistical model since the causal parameter β and the confounding parameter Γ jointly specify the statistical parameter β = β + Γ. The statistical SNR is defined as usual by SNRS := β 2/ σ2. For the causal SNR, a natural notion would be β 2/( σ2 + Γ 2 Σ) if the learning algorithm had access to data from the interventional distribution y|do(x) N(x T β, σ2 + Γ 2 Σ); but since we are constrained to data from the observational conditional y|x N(x T β, σ2), the corresponding causal SNR, which quantifies the hardness of the learning problem, needs to take this into consideration. Accordingly, we consider the causal SNR as the ratio of the alignment between the statistical and causal parameters and the variance of the observational conditional. Formally, we define it as SNRC := β, β / σ2. In what follows, we therefore often refer to β, β as the causal signal and β 2 as the statistical signal. Correspondingly, we refer to β β, β = Γ, β as the confounding signal, which is the alignment between the confounding parameter Γ and the statistical parameter β. Confounding strength Regression on observational data implicitly assumes that the interventional distribution coincides with the observational distribution, while it can be shifted in general. To quantify the impact of this distribution shift on the corresponding causal risk, we introduce a new confounding strength measure ζ. It measures the relative contribution of the confounding signal to the statistical signal and is defined by ζ := Γ, β Γ, β + β, β = Γ, β Other notions of confounding strength are possible, but we will see later that this definition is well-suited to capture the shift strength for causal learning from observational data. Without further restrictions, ζ can take any value in R. This measure divides the causal models into the following three regimes, depending on the relationship between causal and statistical signal: ζ 1: the causal signal β, β is non-positive, which implies that causal and statistical parameters are orthogonal or negatively aligned. Statistical learning is adversarial to causal learning. 0 < ζ < 1: causal and statistical parameters are positively aligned but the causal signal is weaker than the statistical signal β 2, for example β = β/2. ζ 0: the causal signal dominates the statistical signal, for example β = 2 β. The SNRs are related to the confounding strength measure via SNRC = (1 ζ) SNRS. In particular, the causal signal decreases as the confounding strength increases. The regime 0 ζ 1 is practically most relevant Causal learning often requires strong assumptions because causal models cannot be uniquely identified by their observational distribution. A standard assumption is the principle of independent causal mechanisms (ICM) (Janzing et al., 2010; Lemeire et al., 2013; Peters et al., 2017), which informally asserts that causal mechanisms share no information. In our causal model (1), a corresponding assumption could be that the causal mechanisms α and β are drawn from rotationally invariant distributions. This implies that β, Γ 0 as d , which in turn falls in the regime 0 ζ 1. While our following analysis covers all possible causal models, we pay special attention to this regime because it might be of highest practical relevance. Note that for β, Γ = 0, our measure of confounding strength coincides with the measure ζ = Γ 2/( Γ 2 + β 2) introduced by Janzing et al. (2017). It measures the relative contribution of causal and confounding signal in terms of lengths rather than inner products. 3 Causal and Statistical Risk of High-Dimensional Regression Models Causal learning is extremely challenging, because it requires scarcely available interventional data or has to rely on other information such as exogenous (Rothenhäusler et al., 2021) or instrumental variables (Angrist et al., 1991). In our setting where only observational data are available, causal learning requires additional model assumptions. One such approach has been followed by the Concorr method (Janzing, 2019) which leverages the ICM assumption to make an improved choice of regularization parameter under a linear regression model. To fully characterize the effect of regularization on causal generalization, we consider two estimators for learning causal models from observational data (X, Y ) Rn d Rn: the min-norm interpolator and ridge regressors. The min-norm interpolator is the minimum l2 norm solution to the least squares regression problem ˆβ0(X, Y ) := arg min{ ˆβ 2 : ˆβ arg min ˆβ Rd Y X ˆβ 2}. (4) A closed form is given by ˆβ0(X, Y ) = (XT X)+XT Y . For λ > 0, the ridge regressor solves ˆβλ(X, Y ) := arg min ˆβ Rd 1 n Y X ˆβ 2 + λ ˆβ 2 , (5) which has the explicit solution ˆβλ(X, Y ) = (XT X +nλId) 1XT Y . The min-norm interpolator can be obtained as a limiting case from the ridge regression solution via ˆβ0(X, Y ) = limλ 0+ ˆβλ(X, Y ). Whenever it is clear from the context, we drop the dependence of the predictors on X and Y . Before proceeding with the analysis, we motivate the idea that appropriate regularization can help to learn causal models from purely observational data. To this end, we compare regularization chosen by statistical cross validation to regularization based on an interventional validation set in Figure 2. Since cross validation implicitly assumes that there is no confounding, it is close to Bayes optimal for ζ = 0 when n d. However, as confounding increases, it falls behind regularization based on the interventional validation set. The latter even yields Bayes optimal risk again in the purely confounded setting ζ = 1, where the lack of causal signal (β = 0) is encoded by infinite regularization. While we might not have access to an interventional validation set in practice, our theory will show that knowledge of confounding strength is sufficient for choosing appropriate regularization. Finally, we want to caution that even though regularization can help, it does not remove the hardness of causal learning. Reliable causal inference still requires stronger assumptions or additional data. 0.0 0.2 0.4 0.6 0.8 1.0 Confounding strength ζ Causal excess risk ||ˆβ β||2 Σ Interventional validation Cross validation Null predictor Figure 2: Causal excess risk of ridge predictors based on n = 30, 000 samples from the observational distribution. Regularization is chosen either by cross validation or based on a validation set from the interventional distribution of same size. Each model has fixed dimensions d = 300, l = 350 and SNRS = 5, but different underlying confounding strengths under the constraint β, Γ = 0. The benefits of optimal regularization over cross validation increase with confounding strength. 3.1 Precise Asymptotics of the Causal and Statistical Risks In this section, we provide precise asymptotics for the causal and statistical risks of the min-norm interpolator and ridge regression solutions in the high-dimensional regime. This regime is characterized by both n, d such that d/n γ (0, ), where γ is called the overparameterization ratio. We distinguish between the underparameterized regime (γ < 1) and the overparameterized regime (γ > 1). All proofs for this section are deferred to Appendix B. Since the predictors ˆβ = ˆβ(X, Y ) are random variables in the training data X and Y , so is their corresponding causal risk. We consider the expectation of this risk under Y conditioned on X. According to Proposition 2.1, it is given by RC X(ˆβ) := EY |XRC(ˆβ) = EY |X ˆβ β 2 Σ + σ2 + Γ 2 Σ . Due to its simple form, similar to the usual statistical risk, the causal excess risk can be decomposed into bias and variance: EY |X ˆβ β 2 Σ = EY |X ˆβλ β 2 Σ | {z } =:BC X( ˆβλ) + EY |X ˆβλ EY |X ˆβλ 2 Σ | {z } =:V C X ( ˆβλ) The next theorem is one of our main results. It gives a closed-form expression for the limiting causal bias and variance of the min-norm interpolator and ridge regression estimators. We make the simplifying assumption of isotropic covariance Σ = Id. The proof relies on recent techniques from random matrix theory. It employs arguments similar to Dicker (2016), Dobriban et al. (2018), and Hastie et al. (2022) and can correspondingly be extended to arbitrary covariances under boundedness assumptions on the spectrum. We leave such extensions for future work and focus on thoroughly understanding the isotropic causal model, because it already exhibits rather rich behavior. Theorem 3.1 (Limiting Causal Bias-Variance Decomposition for the Ridge Estimator). Let β 2 = r2, Γ 2 = ω2, Γ, β = η, and fix σ2. Then as n, d such that d/n γ (0, ), it holds almost surely in X for every λ > 0 that BC X(ˆβλ) BC λ = ω2 + r2λ2m ( λ) 2(ω2 + η)λm( λ) and (7) V C X (ˆβλ) VC λ = σ2γ(m( λ) λm ( λ)) , (8) where m(λ) = ((1 γ λ) p (1 γ λ)2 4γλ)/(2γλ) and r2 = r2 + ω2 + 2η. Therefore RC X(ˆβλ) RC λ = BC λ + VC λ + σ2 + ω2. The corresponding limiting quantities for the min-norm interpolator can be obtained by taking the limit λ 0+ in (7) and (8). From these limiting expressions we can see that the causal risk curve of the min-norm interpolator exhibits the double descent phenomenon: it diverges at the interpolation threshold γ = 1 due to the variance term and decreases again for γ > 1. A corresponding visualization is given in Figure 4. Explicit regularization dampens the divergence of the variance term. While we are primarily interested in the causal risk, the corresponding statistical risk serves as a natural baseline. An analogue set of results for the statistical risk is given in Appendix C. These results have already been derived by Hastie et al. (2022) and can also be recovered as a special case of our causal results: for fixed statistical parameters β and σ2, the statistical risk coincides with the causal risk of an unconfounded causal model defined with β = β, σ2 = σ2, and α = 0. In particular, the corresponding statistical limiting expressions are the same as in Theorem 3.1 after setting η = ω2 = 0. Optimal statistical and causal regularization By directly optimizing the closed form expressions for limiting causal and statistical risks we can find the optimal causal and statistical regularization. 0 1 2 3 4 5 Overparameterization ratio γ Causal risk RC 0 S = 3 S = 0.5 S = 1 Figure 3: Limiting causal excess risk RC 0 (without the constant σ2+ω2) of the min-norm interpolator for different causal signal strengths S. Dashed lines are the corresponding null-risks ω2, which are outperformed more often as S increases. For γ < 1, all three curves coincide. 0 1 2 3 4 5 Overparameterization ratio γ Risk Bias Variance Figure 4: Limiting bias-variance decomposition and causal excess risk of the min-norm interpolator (black) and optimally regularized ridge regression (red). Crosses indicate finite-sample risks of n = d/γ samples with d = 300. The finite risks are well-predicted by their theoretical limit. For any γ (0, ), the optimal statistical regularization λ S(γ) := arg infλ (0, ) RS λ can be expressed in closed-form as λ S(γ) = SNRS 1 γ. The closed-form expression for the optimal causal regularization parameter λ C(γ) := arg infλ (0, ) RC λ is a root of a 4th order polynomial and as such considerably intricate. For readability, we do not include it here. We investigate the behavior of the optimal causal and statistical regularization in Section 4 and 5. 3.2 Basic Behavior of the Limiting Risk We start to analyze the results by assessing the basic behavior of the limiting causal risk. The causal risk of the null estimator ˆβ = 0 serves as a natural baseline to evaluate the performance of the the min-norm interpolator and the ridge regression estimators. Regimes of the min-norm interpolator Theorem 3.1 characterizes the limiting causal risk of the min-norm interpolator. Its behavior is controlled by the causal signal-to-noise ratio, which we defined as SNRC = (1 ζ) SNRS. However, as we will later see, the causal risk of the min-norm interpolator can be lower than null risk when ζ < 0.5. To distinguish the regimes of the min-norm interpolator, it is therefore convenient to consider the closely related quantity S = (1 2ζ) SNRS. It distinguishes between three different regimes (visualized in Figure 3). For S > 1, the causal signal dominates the noise and the min-norm interpolator can perform better than null risk in both underand overparameterized regime. For 0 S 1, the causal signal is weaker than the noise. Only the underparameterized regime can beat the null risk, whereas the overparameterized regime is always worse. The previous two cases resemble the behavior of the statistical risk in the corresponding regimes of the statistical SNR. Contrary to the statistical risk, however, the causal risk admits a third regime S < 0. In this case, the min-norm interpolator always performs worse than null risk. Here, the causal signal β, β is dominated by the confounding signal Γ, β , and interpolating the observational data overfits to the confounding. Bias and variance The bias-variance decomposition of the causal risk given in Theorem 3.1 is visualized in Figure 4 for the min-norm interpolator and the optimally ridge-regularized regressor. The figure also shows the causal risk based on finite samples from the model, which is in high agreement with our asymptotic results. We compare the causal risk to the corresponding statistical risk. First note that the causal and statistical variance terms coincide exactly for both the min-norm interpolator and ridge regressors. This is because the variance term of the squared loss depends only on the variance in the training data, but not on the target parameter β or β. Since the training data are the same for both causal and statistical learning, the variance terms trivially coincide. For the min-norm interpolator, as in the statistical case, the variance term is responsible for the double-descent behavior of the causal risk curve because it explodes at the interpolation threshold γ = 1 and decreases in the overparameterized regime γ > 1. In the statistical setting, the bias strictly increases in the overparameterized regime and, as a consequence, the best risk is always achieved in the underparameterized setting. In contrast, the causal bias of the min-norm interpolator can be decreasing in the overparameterized regime and therefore the optimal causal risk can be achieved in the highly overparameterized regime γ . However, this only happens in the regime S < 0 where the risk of the min-norm interpolator is always worse than null risk. Figure 4 shows the causal risk of the optimally regularized ridge regression estimator which trivially is always below that of the min-norm risk. Similar to the statistical setting, the corresponding generalization curve does not exhibit the double descent phenomenon. There are qualitatively different reasons for why regularization helps in statistical and causal learning. For both statistical and causal learning, regularization decreases the shared variance, which corresponds to the finitesample error. However, while the statistical bias always increases with regularization, the causal bias can actually decrease. This implies that regularization not only helps with the finite-sample error, but can also reduce the error due to confounding. Higher confounding implies higher causal risk for all λ So far, we have investigated the causal risk under a single causal model. Now we can compare different causal models using the confounding strength measure ζ introduced in Section 2.1. The next proposition shows that ζ governs the hardness of causal learning from observational data. Specifically, the causal risk of the ridge regression for any λ (0, ) increases as the causal model becomes more confounded. A proof is given in Appendix D. Proposition 3.2 (Causal Risk Increases with Confounding Strength). Consider the family of causal models parameterized as in (1) that entail the same observational distribution. Let C1 and C2 be two such causal models with confounding strengths ζ1 and ζ2 and alignments η1 and η2 (defined in Theorem 3.1), respectively. Then for all λ, γ (0, ), ζ1 > ζ2, η1 η2 = RC1 λ > RC2 λ . In particular, for any fixed η, the measure of confounding strength ζ establishes a strict ordering of causal models. This includes the ICM under which η = 0. 4 Benign Causal Overfitting A large number of recent works suggest that minimum-norm interpolators can be optimal for statistical generalization (Belkin et al., 2018; Belkin et al., 2019a; Muthukumar et al., 2020). This phenomenon is often referred to as benign overfitting. Moreover, the optimal statistical generalization may even be achieved for negative regularization λ < 0 (Kobak et al., 2020; Bartlett et al., 2020; Tsigler et al., 2020). It is unclear, however, if such interpolators, which have implicit small-norm biases, can also be optimal when there is a shift between the training and test distributions. In particular, we ask: can the optimal causal regularization be 0 or even negative, that is, do we observe benign causal overfitting? To show that the optimal regularization can be negative, we simply show that the derivative of the causal risk at 0 is positive. We summarize our key findings in Theorem 4.1. Theorem 4.1 (Optimal Regularization can be Negative). For any causal model parameterized as in (1), the following cases distinguish between whether the min-norm interpolator is optimal or not. 1. For negative confounding strength ζ < 0 the optimal causal regularization λ C can be 0 or even negative. A necessary and sufficient condition for λ C 0 depends on the difference in causal and statistical signal-to-noise ratios and is given by SNRC SNRS γ max {1, γ} 2. For positive confounding strength ζ > 0 the optimal causal regularization is positive λ C > 0 and RC 0 > RC λ C, hence regularization is beneficial. This includes the ICM. In the highly overparameterized regime (γ ), the benefit of explicit regularization vanishes and both the causal and statistical risks of the ridge regression estimator converge to their corresponding null risks, independently of the regularization. We do not refer to this as benign overfitting. However, we can observe benign causal overfitting when the causal SNR is larger than the statistical SNR (ζ < 0), which happens when causal and statistical parameter are strongly aligned. This implies that the norm of the statistical parameter is smaller than the norm of the causal parameter. Consequentially, statistical regressors are implicitly biased towards solutions of smaller norm and causal learning exhibits self-induced regularization. Compare this to benign statistical overfitting, which happens for certain alignments between the regression parameter β and the covariance matrix Σ. In our isotropic setting Σ = Id, we can therefore never observe benign statistical overfitting, but we can observe benign causal overfitting. This phenomenon occurs in both the underparameterized as well as the overparameterized regime. The range of γ for which the optimal causal regularization is negative increases with the dominance of the causal signal over the statistical signal. As γ approaches the interpolation threshold, it becomes harder for the optimal causal regularization to be negative. When the causal SNR is smaller than the statistical SNR (ζ > 0) and in particular under the ICM (0 < ζ 1), the optimal causal regularization is strictly positive and the benefit of explicit regularization does not vanish. This can be the case even when the optimal statistical regularization vanishes. To see this consider the statistical risk in the highly underparameterized regime γ 0. In this regime, the benefit of explicit regularization vanishes and the min-norm interpolator indeed achieves the optimal statistical risk. The optimal causal regularization is given explicitly by λ C = ζ/(1 ζ) for 0 ζ 1 and λ C = for ζ > 1. This is strictly positive and increasing in the confounding strength ζ, and in fact diverges as ζ approaches 1 (see Theorem 5.2). 5 On Optimal Regularization In this section, we investigate two key questions which are natural in the context of our work. How does the optimal causal regularization λ C compare to the optimal statistical regularization λ S? What is the dependence of the optimal causal regularization λ C on the confounding strength ζ? Optimal statistical vs. causal regularization When the training and test distributions coincide, approaches such as cross-validation or information criteria (for example AIC or BIC) can be used to estimate the regularization parameter for optimal out-of-sample generalization. However, choosing the correct regularization parameter for causal learning can be challenging without interventional data. To understand the optimal causal regularization, it is natural to compare it to the optimal statistical regularization, which can usually be estimated from data. Interestingly, our analysis reveals that when confounding strength is positive ζ > 0 and in particular under the ICM one needs to regularize more strongly for causal generalization than for statistical generalization. However, when the confounding strength is negative, that is, when the causal signal dominates the statistical signal, the optimal causal regularization λ C can actually be smaller than the optimal statistical regularization λ S. We formally present this result in Theorem 5.1. Theorem 5.1 (Optimal Statistical vs. Causal Regularization). For any causal model parameterized as in (1), the condition ζ = 0 defines a phase transition for the optimal regularization via ζ < 0 λ C < λ S, ζ = 0 λ C = λ S, and ζ > 0 λ C > λ S. In particular under the ICM, the optimal causal regularization λ C is always strictly larger than the optimal statistical regularization λ S, unless ζ = 0, in which case they coincide. Dependence on confounding strength ζ The problem of causal learning from observational data is a distribution shift problem where the distribution of the training data is shifted from that of the test distribution. As discussed earlier in Proposition 3.2, the confounding strength measure ζ quantifies the strength of this distribution shift. Therefore, we expect the optimal causal regularization to increase with confounding strength. Theorem 5.2 indeed confirms this intuition. Theorem 5.2 (Increasing Confounding Strength Requires Stronger Regularization). Consider the family of causal models parameterized as in (1) that entail the same observational distribution. The optimal causal regularization λ C only depends on the confounding strength ζ and λ C is an increasing function in ζ. More specifically, using ϱ = SNRS 1 γ max {1, γ}/(1 γ)2: ϱ < ζ < 1 = λ C (0, ) with ζλ C > 0 , λ C = 0 if ζ ϱ and λ C = for ζ 1. 6 Summary and Extensions We characterize the role of explicit regularization for causal learning from observational data by computing the asymptotic risk of ridge-regularized regressors and the min-norm interpolator (Theorem 3.1). Under the principle of independent causal mechanisms (ICM), we find that causal learning requires stronger regularization than statistical learning (Theorem 5.1). A practical implication is that the regularization parameter for causal learning should be chosen larger than what is suggested by cross-validation. We can precisely state how much larger based on an estimate of confounding strength (Janzing et al., 2017; Janzing et al., 2018). Beyond ICM, we show that strong alignments between causal and statistical parameters can cause self-induced regularization and lead to benign causal overfitting (Theorem 4.1). One could consider generalizing our assumptions: arbitrary covariances, shifts in the marginal distributions of covariates, soft interventions, more complex hypothesis classes, or non-linear causal relationships. Since the linear model already exhibits rich behavior, we focus in this paper on understanding the simple setting. Below, we briefly discuss extensions of our analysis to causal learning under soft interventions, non-linearity, and non-Gaussianity. Soft interventions It is not always appropriate to consider causal learning under hard interventions. Instead, it is often of interest to consider soft interventions. In these settings, the qualitative statements derived from our analysis still hold. To illustrate this, we consider the class of shift interventions where the structural dependence of the covariates x is not destroyed as in the case of hard interventions but the observed covariates are merely perturbed (i.e., interventions of the form do(x := x + ν)). Then it turns out that Causal risksoft = Causal riskhard + Statistical risk. From our results, it then follows that under ICM, λstatistical λcausal soft λcausal hard This also supports our intuition since under soft interventions, we typically aim to achieve a tradeoff between statistical and causal predictability. We include a complete analysis under shift interventions in Appendix F. Extensions to non-linear models It is feasible to extend the analysis to structural causal models that arise in a reproducing kernel Hilbert space corresponding to a positive definite kernel (i.e, where the best statistical model f and the best causal model f are functions in some RKHS). There are two major technical challenges to deriving the theoretical analysis in such non-linear settings. Both are beyond what can be done in this paper and are left for future work, but we briefly outline them below. 1. Extend the definition of confounding strength ζ beyond the linear setting. Since such a definition is non-trivial already in the linear setting, it is challenging to meaningfully generalize this to the non-linear setting. However, under non-linear causal models in the RKHS, we can naturally extend this definition by replacing the Euclidean norms with functional norms in the RKHS. Generalizing the analysis beyond this setting would require further careful consideration. 2. Derive limiting expressions for causal risk of regularized regressors in a non-linear hypothesis class. In the case of kernel regression, this would still be feasible via recent random matrix theory results [27]. By optimizing the limiting expressions with respect to the regularization parameter, one can obtain the parameter that achieves the optimal causal risk and subsequently identify the relationship between optimal causal regularization and confounding strength. Beyond Gaussianity The analysis can be extended beyond the Gaussian setting by considering random variables generated by finite mixtures of Gaussians. Due to the universality phenomenon in the high-dimensional limit, we believe that our limiting expressions (and the qualitative messages derived henceforth) would be rather robust to shifts in the marginal distribution as long as moments of order (4 + δ) for some δ > 0 are bounded. We conducted experiments to verify this claim and the corresponding results can be found in Appendix G. They show that for distributions with finite 4th moments, the finite-sample risks of the min-norm interpolator and causally optimally regularized ridge regressor closely match the theoretically derived asymptotic risks. Acknowledgments and Disclosure of Funding This work has been supported by the German Research Foundation through the Cluster of Excellence Machine Learning New Perspectives for Science (EXC 2064/1:390727645) and the BMBF Tübingen AI Center (FKZ: 01IS18039A). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Leena C Vankadara and Luca Rendsburg. Angrist, Joshua D and Alan B Keueger (1991). Does compulsory school attendance affect schooling and earnings? The Quarterly Journal of Economics. Bai, Zhidong and Jack W Silverstein (2010). Spectral analysis of large dimensional random matrices. Springer. Bartlett, Peter L et al. (2020). Benign overfitting in linear regression . Proceedings of the National Academy of Sciences. Belkin, Mikhail (2021). Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation . Acta Numerica. Belkin, Mikhail, Siyuan Ma, and Soumik Mandal (2018). To understand deep learning we need to understand kernel learning . International Conference on Machine Learning (ICML). Belkin, Mikhail, Alexander Rakhlin, and Alexandre B Tsybakov (2019a). Does data interpolation contradict statistical optimality? International Conference on Artificial Intelligence and Statistics (AISTATS). Belkin, Mikhail et al. (2019b). Reconciling modern machine-learning practice and the classical bias variance trade-off . Proceedings of the National Academy of Sciences. Blanchet, Jose, Yang Kang, and Karthyek Murthy (2019). Robust Wasserstein profile inference and applications to machine learning . Journal of Applied Probability. Byrd, Jonathon and Zachary Lipton (2019). What is the effect of importance weighting in deep learning? International Conference on Machine Learning (ICML). Dicker, Lee H (2016). Ridge regression and asymptotic minimax estimation over spheres of growing dimension . Bernoulli. Dobriban, Edgar and Stefan Wager (2018). High-dimensional asymptotics of prediction: Ridge regression and classification . The Annals of Statistics. Donhauser, Konstantin et al. (2021). Interpolation can hurt robust generalization even when there is no noise . Advances in Neural Information Processing Systems (Neur IPS). Feldman, Vitaly (2020). Does learning require memorization? a short tale about a long tail . Symposium on Theory of Computing (STOC). Gao, Rui, Xi Chen, and Anton J Kleywegt (2017). Distributional robustness and regularization in statistical learning . ar Xiv preprint ar Xiv:1712.06050. Gulrajani, Ishaan and David Lopez-Paz (2021). In Search of Lost Domain Generalization . International Conference on Learning Representations (ICLR). Hachem, Walid, Philippe Loubaton, and Jamal Najim (2007). Deterministic equivalents for certain functionals of large random matrices . The Annals of Applied Probability. Hastie, Trevor et al. (2022). Surprises in high-dimensional ridgeless least squares interpolation . The Annals of Statistics. Janzing, Dominik (2019). Causal Regularization . Advances in Neural Information Processing Systems (Neur IPS). Janzing, Dominik and Bernhard Schölkopf (2010). Causal Inference Using the Algorithmic Markov Condition . IEEE Transactions on Information Theory. (2017). Detecting Confounding in Multivariate Linear Models via Spectral Analysis . Journal of Causal Inference. (2018). Detecting non-causal artifacts in multivariate linear regression models . International Conference on Machine Learning (ICML). Kobak, Dmitry, Jonathan Lomond, and Benoit Sanchez (2020). The Optimal Ridge Penalty for Realworld High-dimensional Data Can Be Zero or Negative due to the Implicit Ridge Regularization. Journal of Machine Learning Research (JMLR). Kuhn, Daniel et al. (2019). Wasserstein distributionally robust optimization: Theory and applications in machine learning . Operations research & management science in the age of analytics. Lemeire, Jan and Dominik Janzing (2013). Replacing Causal Faithfulness with Algorithmic Independence of Conditionals . Minds and Machines. Liang, Tengyuan and Alexander Rakhlin (2020). Just interpolate: Kernel ridgeless regression can generalize . The Annals of Statistics. Marˇcenko, Vladimir A and Leonid Andreevich Pastur (1967). Distribution of eigenvalues for some sets of random matrices . Mathematics of the USSR-Sbornik. Mei, Song and Andrea Montanari (2021). The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve . Communications on Pure and Applied Mathematics. Muthukumar, Vidya et al. (2020). Harmless interpolation of noisy data in regression . IEEE Journal on Selected Areas in Information Theory. Pearl, Judea (2009). Causal inference in statistics: An overview . Statistics surveys. Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf (2017). Elements of causal inference: foundations and learning algorithms. The MIT Press. Rothenhäusler, Dominik et al. (2021). Anchor regression: Heterogeneous data meet causality . Journal of the Royal Statistical Society: Series B (Statistical Methodology). Rubio, Francisco and Xavier Mestre (2011). Spectral convergence for a general class of random matrices . Statistics & probability letters. Sagawa, Shiori et al. (2020). An investigation of why overparameterization exacerbates spurious correlations . International Conference on Machine Learning (ICML). Shafieezadeh Abadeh, Soroosh, Peyman M Mohajerin Esfahani, and Daniel Kuhn (2015). Distributionally robust logistic regression . Advances in Neural Information Processing Systems (Neur IPS). Shafieezadeh-Abadeh, Soroosh, Daniel Kuhn, and Peyman Mohajerin Esfahani (2019). Regularization via mass transportation . Journal of Machine Learning Research (JMLR). Silverstein, Jack W (1995). Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices . Journal of Multivariate Analysis. Tsigler, Alexander and Peter L Bartlett (2020). Benign overfitting in ridge regression . ar Xiv preprint ar Xiv:2009.14286. Vankadara, Leena Chennuru et al. (2021). Causal Forecasting: Generalization Bounds for Autoregressive Models . ar Xiv preprint ar Xiv:2111.09831. Xu, Huan, Constantine Caramanis, and Shie Mannor (2009). Robustness and Regularization of Support Vector Machines. Journal of Machine Learning Research (JMLR). Zhang, Chiyuan et al. (2021). Understanding deep learning (still) requires rethinking generalization . Communications of the ACM. Zhu, Shixiang et al. (2020). Distributionally Robust Weighted k-Nearest Neighbors . ar Xiv preprint ar Xiv:2006.04004. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] While our analysis is exhaustive, we paid special attention to the additional ICM assumption described at the end of Section 2. Further possible generalizations of our model are discussed in Section 6. (c) Did you discuss any potential negative societal impacts of your work? [Yes] Right before Section 3.1, we cautioned against using our results on optimal causal regularization as an excuse to dismiss the hardness of causal learning. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] All proofs are deferred to the appendix and referenced correspondingly in the main paper. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]