# causal_regularization__ffed98fd.pdf Causal Regularization Dominik Janzing Amazon Research Tübingen Germany janzind@amazon.com We argue that regularizing terms in standard regression methods not only help against overfitting finite data, but sometimes also help in getting better causal models. We first consider a multi-dimensional variable linearly influencing a target variable with some multi-dimensional unobserved common cause, where the confounding effect can be decreased by keeping the penalizing term in Ridge and Lasso regression even in the population limit. The reason is a close analogy between overfitting and confounding observed for our toy model. In the case of overfitting, we can choose regularization constants via cross validation, but here we choose the regularization constant by first estimating the strength of confounding, which yielded reasonable results for simulated and real data. Further, we show a causal generalization bound which states (subject to our particular model of confounding) that the error made by interpreting any non-linear regression as causal model can be bounded from above whenever functions are taken from a not too rich class. 1 Introduction Predicting a scalar target variable Y from a d-dimensional predictor X := (X1, . . . , Xd) via appropriate regression models is among the classical problems of machine learning [1]. In the standard supervised learning scenario, some finite number of observations, independently drawn from an unknown but fixed joint distribution PY,X, are used for inferring Y -values corresponding to unlabelled X-values. To solve this task, regularization is known to be crucial for obtaining regression models that generalize well from training to test data [2]. Deciding whether such a regression model admits a causal interpretation is, however, challenging. Even if causal influence from Y to X can be excluded (e.g. by time order), the statistical relation between X and Y cannot necessarily be attributed to the influence of X on Y . Instead, it could be due to possible common causes, also called confounders . For the case where common causes are known and observed, there is a huge number of techniques to infer the causal influence1, e.g., [3], addressing different challenges, for instance, high dimensional confounders [4] or the case where some variables other than the common causes are observed [5], just to mention a few of them. If common causes are not known, the task of inferring the influence of X on Y gets incredibly hard. Given observations from any further variables other than X and Y , conditional independences may help to detect or disprove the existence of common causes [5], and so-called instrumental variables may admit the identification of causal influence [6]. Here we consider the case where only observations from X and Y are given. In this case, naively interpreting the regression model as causal model is a natural baseline. We show that strong regularization increases the chances that the regression model contains some causal truth. We are aware of the risk that this result could be mistaken as a justification to ignore the hardness of the problem and blindly infer causal models by strong regularization. Our goal is, instead, to inspire a 1often for d = 1 and with a binary treatment variable X 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. discussion on to what extent causal modelling should regularize even in the infinite sample limit due to some analogies between generalizing across samples from the same distribution and generalizing from observational to interventional distributions, which appear in our models of confounding, while they need not apply to other confounding scenarios. The idea is not entirely novel since it is tightly linked to several ideas that are floating around in the machine learning community for a while. It is believed (and can be proven subject to appropriate model assumptions) that finding statistical models that generalize well across different background conditions is closely linked to finding causal models [7, 8, 9, 10].2 It is then natural to also believe that generalizing across different environment is related to generalizing across different samples. Accordingly, [12] describes regularization techniques for linear regression that help generalizing across certain shift perturbations. Here we describe a scenario for which the analogy between regularizing against overfitting and regularizing against confounding gets as tight as possible in the sense that the same regularization helps for both purposes. Due to this theoretical focus, we prefer to work with the simplest non-trivial scenario rather than looking for the most relevant or most realistic case. Scenario 1: inferring a linear statistical model To explain the idea, we consider a linear statistical relation between X and Y : Y = Xa + E, (1) where a is a column vector in Rd and E is an uncorrelated unobserved noise variable, i.e., ΣXE = 0. Let ˆY denote the column vector of centred renormalized observations yi of Y , i.e., with entries (yi 1 n Pn i=1 yi)/ n 1, and similarly, ˆE denotes the centred renormalized values of E. Likewise, let ˆX denote the n d matrix whose j-th column contains the centred renormalized observations from Xj. Let, further, ˆX 1 denote its (Moore-Penrose) pseudoinverse. To avoid overfitting, the least ordinary squares estimator3 ˆa := argmina ˆY ˆXa 2 = ˆX 1 ˆY = a + ˆX 1 ˆE, (2) is replaced with the Ridge and Lasso estimators ˆaridge λ := argmina {λ a 2 2 + ˆY ˆXa 2} (3) ˆalasso λ := argmina {λ a 1 + ˆY ˆXa 2}, (4) where λ is a regularization parameter [13]. So far we have only described the standard scenario of inferring properties of the conditional PY |X from finite observations ˆX, ˆY without any causal semantics. Scenario 2: inferring a linear causal model We now modify the scenario in three respects. First, we assume that E and X in (1) correlate due to some unobserved common cause. Second, we interpret (1) in a causal way in the sense that setting X to x lets Y being distributed according to xa + E. Using Pearl s do-notation (a crucial concept for formalizing causality) [5], this can be phrased as Y |do(X=x) = xa + E = Y |X=x, (5) where we don t have equality because E needs to be replaced with E|X=x for the observational conditional. Third, we assume the infinite sample limit where PX,Y is known. We still want to infer the vector a because we are interested in causal statements but regressing Y on X yields ˆa instead which describes the observational conditional on the right hand side of (5). Conceptually, Scenario 1 and 2 deal with two entirely different problems: inferring PY |X=x from finite samples ( ˆX, ˆY ) versus inferring the interventional conditional PY |do(X=x) from the observational distribution PY,X. Nevertheless both problems amount to inferring the vector a and for both scenarios, the error term ˆX 1 ˆE causes failure of ordinary least squares regression. Only the reason why this term is non-zero differs: in the first scenario it is a finite sample effect, while it results from confounding in the second one. The idea of the present paper is simply that standard regularization 2After submission I became aware of a preprint with the same title as mine, [11] where regularizers are constructed that are tailored particularly for causal features. 3Here we have, for simplicity, assumed n > d. = Xa + E X ZM = Y = Xa + E E = Zc Z N(0, I) Figure 1: Left: In scenario 1, the empirical correlations between X and E are only finite sample effects. Right: In scenario 2, X and E are correlated due to their common cause Z. We sample the structural parameters M and c from distributions in a way that entails a simple analogy between scenario 1 and 2. techniques do not care about the origin of this error term. Therefore, they can temper the impact of confounding in the same way as they help avoiding to overfit finite data. The paper is structured as follows. Section 2 fleshes out scenarios 1 and 2 in a way that entails that the regression error follows the same distributions. Section 3 proposes a way to determine the regularization parameter in scenario 2 by estimating the strength of confounding via a method proposed by [14]. Section 4 describes some empirical results. Section 5 describes a modified statistical learning theory that states that regression models from not too rich function classes generalize from statistical to causal statements. 2 Analogy between overfitting and confounding The reason why our scenario 2 only considers the infinite sample limit of confounding is that mixing finite sample and confounding significantly complicates the theoretical results. The supplement sketches the complications of this case. For a concise description of the population case, we consider the Hilbert space H of centred random variables (on some probability space without further specification) with finite variance. The inner product is given by the covariance, e.g., Xi, Xj := cov(Xi, Xj). (6) Accordingly, we can interpret X as an operator4 Rd H via (b1, . . . , bd) 7 P j bj Xj. Then the population version of (2) reads a = argmina { Y Xa 2} = X 1Y = a + X 1E, (7) where the square length is induced by the inner product (6), i.e., it is simply the variance. Extending the previous notation, X 1 now denotes the pseudoinverse of the operator X [15]. To see that X 1E is only non-zero when X and E are correlated it is helpful to rewrite it as X 1E = Σ 1 XXΣXE, (8) where we have assumed ΣXX to be invertible (see supplement for a proof). One can easily show that the empirical covariance matrix [ ΣXE causing the overfitting error is distributed according to N(0, [ ΣXXσ2 E/n).5 To get the desired analogy between scenarios 1 and 2, we just need a generating model for confounders for which ΣXE is distributed according to N(0, γΣXX) for some parameter γ. The independent source model for confounding described in [14] turned out to satisfy this requirement after some further specification. Generating model for scenario 1 The following procedure generates samples according to the DAG in Figure 1, left: 4Readers not familiar with operator theory may read all our operators as matrices with huge n without loosing any essential insights except for the cost of having to interpret all equalities as approximate equalities. To facilitate this way of reading, we will use ( )T also for the adjoint of operators in H although ( ) or ( ) is common. 5E N(0, σ2 E) and thus ej N(0, 1/n), which implies ˆE N(0, I/n) and thus [ ΣXE = ˆXT ˆE N(0, ˆXT ˆXσ2 E/n) = N(0, [ ΣXXσ2 E/n). 1. Draw n observations from (X1, . . . , Xd) independently from PX 2. Draw samples of E independently from PE 3. Draw the vector a of structure coefficients from some distribution Pa 4. Set ˆY := ˆXa + ˆE. Generating model for scenario 2 To generate random variables according to the DAG in Figure 1, right, we assume that both variables X and E are generated from the same set of independent sources by applying a random mixing matrix or a random mixing vector, respectively: Given an ℓ-dimensional random vector Z of sources with distribution N(0, I). 1 . Choose an ℓ d mixing matrix M and set X := ZM. 2. Draw c Rℓfrom some distribution Pc and set E := Zc. 3. Draw the vector a of structure coefficients from some distribution Pa 4. Set Y := Xa + E. We then obtain: Theorem 1 (population and empirical covariances). Let the number ℓof sources in scenario 2 be equal to the number n of samples in scenario 1 and PM coincide with the distribution of sample matrices ˆX induced by PX. Let, moreover, Pc in scenario 2 coincide with the distribution of ˆE induced by PE in scenario 1, and Pa be the same in both scenarios. Then the joint distribution of a, ΣXX, ΣXY , ΣXE in scenario 2 coincides with the joint distribution of a, [ ΣXX, [ ΣXY , [ ΣXE in scenario 1. Proof. We have [ ΣXX = ˆXT ˆX and ΣXX = XT X = M T ZT ZM = M T M, where we have used that Z has full rank due to the uncorrelatedness of its components. Likewise, [ ΣXE = ˆXT ˆE and ΣXE = (ZM)T Zc = M T c. Further, [ ΣXY = ˆXT ˆXa + [ ΣXE and ΣXY = XT Xa + ΣXE. Then the statement follows from the correspondences M ˆX, c ˆE, a a. Theorem 1 provides a canonical way to transfer any Bayesian approach to inferring a from [ ΣXX, [ ΣXY in scenario 1 to inferring a from ΣXX, ΣXY in scenario 2. It is known [16], for instance, that (3) and (4) maximize the posterior p(a| ˆX, ˆY ) for the priors pridge(a) exp 1 2τ 2 a 2 plasso(a) exp 1 respectively, if E N(0, σ2 E) and λ = σ2 E/τ 2. Some algebra shows that the only information from ˆX and ˆY that matters is given by [ ΣXX and [ ΣXY , see supplement. Therefore, (3) and (4) also maximize the posterior p(a|[ ΣXX, [ ΣXY ) and, employing Theorem 1, the population versions of Ridge and Lasso aridge λ := argmina {λ a 2 2 + Y Xa 2} (10) alasso λ := argmina {λ a 1 + Y Xa 2}, (11) maximize p(a|ΣXX, ΣXY ) after substituting all the priors accordingly. These population versions, however, make it apparent that we now face the problem that selecting λ by cross-validation would be pointless since λ = 0 had the best cross sample performance. Instead, we would need to know the strength of confounding to choose the optimal λ. 3 Choosing the regularization constant by estimating confounding The only approaches that directly estimate the strength of confounding6 from PX,Y alone we are aware of are given by [19, 14]. The first paper considers only one-dimensional confounders, which is complementary to our confounding scenario, while we will use the approach from the second paper 6[17] constructs confounders for linear non-Gaussian models and [18] infer confounders of univariate X, Y subject to the additive noise assumption. because it perfectly matches our scenario 2 in Section 2 with fixed M. [14] use the slightly stronger assumption that a and c are drawn from N(0, σ2 a I) and N(0, σ2 c I), respectively. We briefly rephrase the method. Using a in (7) (i.e. the population version of the ordinary least squares solution), they define confounding strength by a a 2 + a 2 [0, 1]. (12) It attains 0 iff a coincides with a and 1 iff a = 0 and the correlations between X and Y are entirely caused by confounding. The idea to estimate β is that the unregularized regression vector follows the distribution N(0, σ2 a I + σ2 c M 1M T ). This results from a = a + X 1E = a + M 1c, (see proof of Theorem 1 in [14]). Then the quotient σ2 c/σ2 a can be inferred from the direction of ˆa (intuitively: the more ˆa concentrates in small eigenvalue eigenspaces of ΣXX = M T M, the larger is this quotient). Using some approximations that hold for large d, β can be estimated from (ΣXX, a). Further, the approximation a a 2 + a 2 a 2 from [19] yields a 2 (1 β) a 2. Hence, the length of the true causal regression vector a can be estimated from the length of a. This way, we can adjust λ such that ˆaλ coincides with the estimated length. Since the estimation is based on a Gaussian (and not a Laplacian) prior for a, it seems more appropriate to combine it with Ridge regression than with Lasso. However, due to known advantages of Lasso7 (e.g. that sparse solutions yield more interpretable results), we also use Lasso. After all, the qualitative statement that strong confounding amounts to vectors ˆa that tend to concentrate in low eigenvalue subspaces of ΣXX still holds true as long as c is still chosen from an isotropic prior. Confounding estimation via the algorithm of [14] requires the problematic decision of whether the variables Xj should be rescaled to variance 1. If different Xj refer to different units, there is no other straightforward choice of the scale. It is not recommended, however, to always normalize Xj. If ΣXX is diagonal, for instance, the method would be entirely spoiled by normalization. The difficulty of deciding whether data should be renormalizing beforehand will be inherited by our algorithm. Our confounder correction algorithm reads: Algorithm Con Corr 1: Input: i.i.d. samples from P(X, Y ). 2: Rescale Xj to variance 1 if desired. 3: Compute the empirical covariance matrices [ ΣXX and [ ΣXY 4: Compute the ordinary least squares regression vector ˆa := [ ΣXX 1 [ ΣXY 5: Compute an estimator ˆβ for the confounding strength β via the algorithm in [14] from [ ΣXX and ˆa and estimate the squared length of a via a 2 (1 ˆβ) ˆa 2 (13) 6: find λ such that the squared length of ˆaridge/lasso λ coincides with the square root of the right hand side of (13). 7: Compute Ridge or Lasso regression model using this value of λ. 8: Output: Regularized regression vectors ˆaridge/lasso λ . 4 Experiments 4.1 Simulated data For some fixed values of d = ℓ= 30, we generate one mixing matrix M in each run by drawing its entries from the standard normal distribution. In each run we generate n = 1000 instances of the 7[20] claims, for instance, If ℓ2 was the norm of the 20th century, then ℓ1 is the norm of the 21st century ... OK, maybe that statement is a bit dramatic, but at least so far, there s been a frenzy of research involving the ℓ1 norm and its sparsity-inducing properties.... Figure 2: From left to right: RSE versus unregularized RSE (that is, ordinary least square regression) for Concorr with Ridge, standard cross-validated Ridge (top, left and right, respectively), and Con Corr with Lasso, standard cross-validaded Lasso (bottom, left and right, respectively) for 100 runs (each point representing one run). ℓ-dimensional standard normal random vector Z and compute the X values by X = ZM. Afterwards we draw the entries of c and a from N(0, σ2 c) and N(0, σ2 a), respectively, after choosing σa and σc from the uniform distribution on [0, 1]. Finally, we compute the values of Y via Y = Xa + Zc + E, where E is random noise drawn from N(0, σ2 E) (the parameter σE has previously been chosen uniformly at random from [0, 5], which yields quite noisy data). While such a noise term didn t exist in our description of scenario 2, we add it here to also study finite sample effects (without noise, Y depends deterministically on X for ℓ d). To assess whether the output ˆaλ is close to a we define the relative squared error (RSE) of any regression vector a by ϵa := a a 2 a a 2 + a 2 [0, 1] This definition is convenient because it yields the confounding strength β for the special case where a is the ordinary least squares regression vector a. Figure 2 shows the results. The red and green lines show two different baselines: first, the unregularized error, and second, the error 1/2 obtained by the trivial regression vector 0. The goal is to stay below both baselines. Apart from those two trivial baselines, another natural baseline is regularized regression where λ is chosen by cross-validation, because this would be the default approach for the unconfounded case. We have used leave-one-out CV from the Python package scikit for Ridge and Lasso, respectively. Con Corr clearly outperforms cross-validation (for both Ridge and Lasso), which shows that crossvalidation regularizes too weakly for causal modelling, as expected. One should add, however, that we increased the number of iterations in the λ-optimization to get closer to optimal leave-one-out performance since the default parameters of scikit already resulted in regularizing more strongly than that (Note that the goal of this paper is not to show that Con Corr outperforms other methods. Figure 3: Results for Ridge (left) and Lasso (right) regression for the data from the optical device. X Y Y = Y + Zc Figure 4: Confounding where Z influences Y in a linear additive way, while the influence on X is arbitrary. Instead, we want to argue that for causal models it is often recommended to regularize more strongly than criteria of statistical predictability suggests. If early stopping in common CV algorithms also yields stronger regularization,8 this can be equally helpful for causal inference, although the way Con Corr choses λ is less arbitrary than just bounding the number of iterations). Results for other dimensions were qualitatively comparable if d and ℓwere above 10 with slow improvement for larger dimensions, but note that the relevance of simulations should not be overestimated since inferring confounding critically depends on the distribution of eigenvalues of ΣXX, which is domain dependent in practical applications. 4.2 Real data In absence of better data sets with known ground truth, we considered two sets used in [14], where ground truth was assumed to be known up to some uncertainty discussed there. Optical device Here, a Laptop shows an image with extremely low resolution (in their case 3 3pixel9) captured from a webcam. In front of the screen they mounted a photodiode measuring the light intensity Y , which is mainly influenced by the pixel vector X of the image. The confounder W is a random voltage controlling two LEDs, one in front of the webcam (and thus influencing X) and the second one in front of the photodiode (thus influencing Y ). Since W is also measured, the vector a X,W obtained by regressing Y on (X, W) is causal (no confounders by construction), if one accepts the linearity assumption. Dropping W yielded significant confounding, with β ranging from 0 to 1. We applied Con Corr to X, Y and compared the output with the ground truth. Figure 3, left, show the results for Ridge and Lasso. The y-axis is the relative squared error achieved by Con Corr, while the x-axis is the cross-validated baseline. The point (0, 0) happened to be met by three cases, where no improvement was possible. One can see that in 3 out of the remaining nine cases (note that the point (1, 1) is also met by two cases), Con Corr significantly improved the causal prediction. Fortunately, there is no case where Con Corr is worse than the baseline. 8See also [21] for regularization by early stopping in a different context. 9In order to avoid overfitting issues, we decided in Ref. [14] to only generate low-dimensional data with d around 10. Taste of wine This data has been extracted from the UCI machine learning repository [22] for the experiments in [14]. The cause X contains 11 ingredients of different sorts of red wine and Y is the taste assigned by human subjects. Regressing Y on X yields a regression vector for which the ingredient alcohol dominates. Since alcohol strongly correlates with some of the other ingredients, dropping it amounts to significant confounding (assuming that the correlations between alcohol and the other ingredients is due to common causes and not due to the influence of alcohol on the others). After normalizing the ingredients10, Con Corr with Ridge and Lasso yielded a relative error of 0.45 and 0.35, respectively, while [14] computed the confounding strength β 0.8, which means that Con Corr significantly corrects for confounding (we confirmed that CV also yielded errors close to 0.8 which suggests that finite sample effects did not matter for the error). Although one-dimensional confounding heavily violates our model assumptions, the results of both real data experiments look somehow positive. 5 Causal learning theory So far, we have supported causal regularization mainly via transferring Bayesian arguments for regularization from scenario 1 to scenario 2. An alternative perspective on regularization is provided by statistical learning theory [2]. Generalization bounds guarantee that the expected error is unlikely to significantly exceed the empirical error for any regression function f from a not too rich class F. If L(Y, f(X)) denotes some loss function, they guarantee, for instance, that the following inequality holds with a certain probability uniformly for f F: E[L(Y, f(X)] 1 j=1 L(yi, f(xi)) + C(F), where C(F) is some capacity term . In the same way, as these bounds relate empirical loss with expected loss, we will relate the expected (statistical) loss above with the interventional loss Edo(X)[L(Y, f(X)] := Z L(y, f(x))p(y|do(x))p(x)dx, (14) (which quantifies how well f describes the change of Y for interventions on X) via a causal generalization bound of the form Edo(X)[L(Y, f(X)] E[L(Y, f(X))] + C(F), for some capacity term C(F). Note that the type of causal learning learning theory developed here should not be confused with [23], which considers the generalization error of classifiers that infer cause-effect directions after being trained with multiple data sets of cause-effect pairs. Figure 4 shows our confounding model that significantly generalizes our previous models. Z and X are arbitrary random variables of dimensions ℓand d, respectively. Apart from the graphical structure, we only add the parametric assumption that the influence of Z on Y is linear additive: Y = Y + Zc, (15) where c Rℓ. The change of Y caused by setting X to x via interventions is given by Pearl s backdoor criterion [5] as p(y|do(x)) = Z p(y|x, z)p(z)dz. (16) Note that the observational conditional p(y|x) would be given by replacing p(z) with p(z|x) in (16). Interventional conditionals destroy the dependences between the confounder Z and the treatment variable X by definition of an intervention. The supplement shows that the difference between interventional and observational loss can be concisely phrased in terms of covariances if we choose the loss L(Y, f(X)) = (Y f(X))2: Lemma 1 (interventional minus observational loss). Let g(x) := E[Y |x]. Then Edo(X)[(Y f(X))2] E[(Y f(X))2] = (Σ(f g)(X)Z)c. 10Note that [14] also used normalization to achieve reasonable estimates of confounding for this case. For every single f, the vector Σ(f g)(X)Z is likely to be almost orthogonal to c if c is randomly drawn from a rotation invariant distribution in Rℓ. In order to derive statements of this kind that hold uniformly for all functions from a function class F we introduce the following concept quantifying the capacity of F: Definition 1 (correlation dimension). Let F be some class of functions f : Rd R. Given the distribution PX,Z, the correlation dimension dcorr of F is the dimension of the span of {Σf(X)Z |f F}. To intuitively understand this concept it is instructive to consider the following immediate bounds: Lemma 2 (bounds on correlation dimension). The correlation dimension of F is bounded from above by the dimension of the span of F. Moreover, if F consists of linear functions, another upper bound is given by the rank of ΣXZ. In the supplement we show: Theorem 2 (causal generalization bound). Given the causal structure in Figure 4, where Z is ℓdimensional with covariance matrix ΣZZ = I, influencing X in an arbitrary way. Let the influence of Z on Y be given by a random linear combination of Z with variance V . Explicitly, Y 7 Y = Y + Zc, where c Rℓis randomly drawn from the sphere of radius V according to the Haar measure of O(ℓ). Let F have correlation dimension dcorr and satisfy the bound (f g)(X) H b for all f F (where g(x) := E[Y |x]). Then, for any β > 1, Edo(X)[(Y f(X))2] E[(Y f(X))2] + b V β dcorr + 1 holds uniformly for all f F with probability 1 en(1 β+ln β)/2. Note that ΣZZ = I can always be achieved by the whitening transformation Z 7 (ΣZZ) 1/2Z. Normalization is convenient just because it enables a simple way to define a random linear combination of Z with variance V , which would be cumbersome to define otherwise. Theorem 2 basically says that the interventional loss is with high probability close to the expected observational loss whenever the number of sources significantly exceeds the correlation dimension. Note that the confounding effect can nevertheless be large, that is, it would heavily spoil ordinary least square (e.g. unregularized) regression. Consider, for instance, the case where ℓ= d and X and Z are related by X = Z. Let, moreover, Y = Xa for some a Rd. Then the confounding can have significant impact on the correlations between Y and X due to Y = X(a + c), whenever c is large compared to a. However, whenever F has low correlation dimension, the selection of the function f that optimally fits observational data is not significantly perturbed by the term Xc. This is because Xc looks like random noise since F contains no function that is able to account for such a complex correlation . For the simple case where ΣXZ has low rank, for instance, the term Zc almost behaves like noise for typical c (w.r.t. any class F of linear functions), because the majority of components of Z are uncorrelated with X, after appropriate basis change. Since ℓ, dcorr, b in Theorem 2 are unobserved, its value will mostly consist in qualitative insights rather than providing quantitative bounds of practical use. 6 What do we learn for the general case? Despite all concerns against our hand-tuned confounder model, we want to stimulate a general discussion about recommending stronger regularization than criteria of statistical predictability suggest, whenever one is actually interested in causal models. Our theoretical results suggest that this helps in particular when a type of confounding is expected that if present generates complex dependences, which would strongly regularized regression treat as noise. The advice of limiting the complexity of models to capture some causal truth could also be relevant for modern deep learning, since the goal of interpretability of algorithms for classification or other standard tasks could possibly be improved by having causal features rather than purely predictive ones. It is, however, by no means intended to suggest that this simple recommendation would solve any of the hard problems in causal inference. [1] B. Schölkopf and A. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002. [2] V. Vapnik. Statistical learning theory. John Wileys & Sons, New York, 1998. [3] D. Rubin. Direct and indirect causal effects via potential outcomes. Scandinavian Journal of Statistics, 31:161 170, 2004. [4] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018. [5] J. Pearl. Causality. Cambridge University Press, 2000. [6] G. Imbens and J. Angrist. Identification and estimation of local average treatment effects. Econometrica, 62(2):467 475, 1994. [7] J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 78(5):947 1012, 2016. [8] C. Heinze-Deml, J. Peters, and N. Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6:20170016, 2017. [9] C. Heinze-Deml and N. Meinshausen. Conditional variance penalties and domain shift robustness. ar Xiv:1710.11469, 2017. [10] Z.. Shen, P. Cui, K. Kuang, and B. Li. On image classification: Correlation v.s. causality. ar Xiv:1708.06656, 2017. [11] M. Bahadori, K. Chalupka, E. Choi, R. Chen, W. Stewart, and J. Sun. Causal regularization. ar Xiv:1702.02604 , 2017. [12] D. Rothenhäusler, N. Meinshausen, P. Bühlmann, and J. Peters. Anchor regression: heterogeneous data meets causality. ar Xiv:1801.06229, 2018. [13] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: Data mining, inference, and prediction. Springer-Verlag, New York, NY, 2001. [14] D. Janzing and B. Schölkopf. Detecting non-causal artifacts in multivariate linear regression models. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018. [15] F. Beutler. The operator theory of the pseudo-inverse I. Bounded operators. Journal of Mathematical Analysis and Applications, 10(3):451 470, 1965. [16] A. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42(1):80 86, 2000. [17] P. Hoyer, S. Shimizu, A. Kerminen, and M. Palviainen. Estimation of causal effects using linear non-gaussian causal models with hidden variables. International Journal of Approximate Reasoning, 49(2):362 378, 2008. [18] D. Janzing, J. Peters, J. Mooij, and B. Schölkopf. Identifying latent confounders using additive noise models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), 249-257. (Eds.) A. Ng and J. Bilmes, AUAI Press, Corvallis, OR, USA, 2009. [19] D. Janzing and B. Schölkopf. Detecting confounding in multivariate linear models via spectral analysis. Journal of Causal Inference, 6(1), 2017. [20] R. Tibshirani and Wasserman L. Course on Statistical Machine Learning, chapter: Sparsity and the Lasso , 2015. http://www.stat.cmu.edu/~ryantibs/statml/. [21] G. Raskutti, M. Wainwright, and B. Yu. Early stopping for non-parametric regression: An optimal data-dependent stopping rule. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1318 1325, Sep. 2011. [22] D. Dua and C. Graff. UCI machine learning repository, 2017. http://archive.ics.uci. edu/ml. [23] D. Lopez-Paz, K. Muandet, B. Schölkopf, and I. Tolstikhin. Towards a learning theory of causeeffect inference. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, page 1452 1461. JMLR, 2015.