# boosting_causal_additive_models__5433d7ad.pdf

Journal of Machine Learning Research (2025) 1-49 Submitted 1/24; Revised 6/25; Published 6/25

Boosting Causal Additive Models

Maximilian Kertel maximiliankertel@gmail.com Technology Development Battery Cell BMW Group Munich, Germany

Nadja Klein nadja.klein@kit.edu Scientific Computing Center Karlsruhe Institute of Technology Karlsruhe, Germany

Editor: Pradeep Ravikumar

We present a boosting-based method to learn additive Structural Equation Models (SEMs) from observational data, with a focus on the theoretical aspects of determining the causal order among variables. We introduce a family of score functions based on arbitrary regression techniques, for which we establish sufficient conditions that guarantee consistent identification of the true causal ordering. Our analysis reveals that boosting with early stopping meets these criteria and thus offers a consistent score function for causal orderings. To address the challenges posed by high-dimensional data sets, we adapt our approach through a component-wise gradient descent in the space of additive SEMs. Our simulation study supports the theoretical findings in low-dimensional settings and demonstrates that our high-dimensional adaptation is competitive with state-of-the-art methods. In addition, it exhibits robustness with respect to the choice of hyperparameters, thereby simplifying the tuning process.

Keywords: causal discovery; directed acyclic graph; boosting; high-dimensional data; reproducing kernel Hilbert space

1 Introduction

Causal discovery is the process of deriving causal relationships between variables in a system. These help in improving decisions, predictions and interventions (Kyono et al., 2020). With the rapid growth of large-scale data sets in fields such as healthcare, genetics (Aibar et al., 2017), or manufacturing (Kertel et al., 2023), causal discovery has become increasingly important. Traditionally, the researcher designs an experiment and intervenes on certain variables for example, by assigning a drug or a placebo which allows to estimate the causal effect of the manipulated variables. However, it is often more practical, less timeand resource-intensive, or ethically preferable to collect data from the system in a steady state, where no variables are manipulated. For instance, in complex manufacturing domains, the number of input variables might be too large to conduct a design of experiments, and additionally those experiments would lead to production downtime, resulting in high costs.

2025 Maximilian Kertel and Nadja Klein.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v/24-0052.html.

Kertel and Klein

Thus, although observational data is often complex, noisy, or has confounding variables (Bhattacharya et al., 2021), it is still preferable to derive causal relationships from observational data rather than having to rely on impractical experimental studies.

In this work, we follow the assumption that the causal relationships between variables can be modeled as a Directed Acyclic Graph (DAG). This assumption implies that the impact between two variables flows in at most one direction, and there are no cyclic or self-reinforcing pathways. It is the goal of the present work to identify the DAG from observational data. Existing algorithms for this task can be broadly classified into two categories. First, constraint-based methods rely on statistical tests for conditional independence, but these approaches become computationally expensive and difficult apart from multivariate normal or multinomial distributions (Zhang et al., 2011; Shah and Peters, 2018). Additionally, many of them assume some kind of faithfulness (Peters et al., 2017; Raskutti and Uhler, 2018), and the identified graph is typically not unique (Spirtes et al., 1993; Kalisch and B uhlman, 2007).

Instead, we focus on the second category for identifying DAGs from observational data, namely on Structural Equation Models (SEMs). SEMs rely on the assumption that each variable is a function of other variables in the system and a perturbing noise term. Peters et al. (2014) show that if one restricts the functional relationships and the noise, called the Additive Noise Model (ANM), then the implied distribution is unique. Specifically, this is the case when the functional relationships are non-linear, and the noise term is Gaussian, which we assume throughout this work. Under these conditions, the underlying SEM can be uniquely identified from observational data.

Causal discovery using ANMs is an active field of research (Vowels et al., 2022). Many recent works leverage continuous acyclity characterisations (Lachapelle et al., 2019; Yu et al., 2019; Zheng et al., 2020; Kalainathan et al., 2022; Ng et al., 2022a,b) and find the DAG using gradient-descent. However, in contrast to earlier works (as for example, Shimizu et al., 2011; B uhlmann et al., 2014) the statistical properties of most machine learning-based methods remain poorly understood (Kaiser and Sipos, 2022). In this context, a notable contribution is that of Aibar et al. (2017), which successfully derives the cyclic graphical representation of a gene regulatory network using boosting. Our work is an extension of this approach towards acyclic graphs.

In this paper, we leverage the success of gradient-based methods towards statistical boosting for causal discovery in ANMs and investigate the underlying statistical properties. In particular, we show consistency and robustness of our proposed method.

Our main contributions are the following.

Many existing consistency results are derived for specific regression techniques such as maximum likelihood or penalized regression where consistency of the causal discovery algorithm follows from the statistical properties of the chosen regression method (B uhlmann et al., 2014; Nowzohour and B uhlmann, 2016; van de Geer, 2014). In contrast, we take a more general approach by providing sufficient conditions on generic regression techniques under which the causal discovery algorithm is consistent; see Proposition 5.

Boosting Causal Additive Models

Theorem 15 establishes that L2 boosting with early stopping satisfies the conditions of Proposition 5. Consequently, L2 boosting can be employed for consistent causal discovery.

We show in Theorem 18 that L2 boosting with early stopping avoids overfitting even in the presence of misspecification.

We propose a variant of the boosting procedure for high-dimensional settings, where the number of variables p is large.

We conduct a simulation study demonstrating that our approach is competitive with state-of-the-art methods. We furthermore show that it is robust to the choice of the hyperparameters, making it easy to tune in practice.

The remainder of this paper is organized as follows. Section 2 introduces SEMs and outlines the necessary assumptions. Section 3 reviews L2 boosting and Reproducing Kernel Hilbert Spaces (RKHSs). In Section 4, we present the proposed novel causal discovery method and show its consistency. Section 5 provides an empirical evaluation of our method and benchmarks its performance to various state-of-the-art algorithms. Finally, Section 6 concludes the paper. Technical details and proofs can be found in the Appendix.

2 Causal Discovery

In this section we present a class of SEMs called Causal Additive Models (CAMs). We summarize the key results from B uhlmann et al. (2014) and Peters et al. (2014) which show how regression estimators can be utilized for causal discovery. Finally, while existing contributions focus on specific regression estimators as maximum likelihood in B uhlmann et al. (2014) or van de Geer (2014), we adopt a more general perspective: Proposition 5 provides sufficient conditions on a generic regression estimator so that the derived causal discovery algorithm is consistent. Let X = (X1, . . . , Xp) be a p-dimensional random vector. For any S = {s1, . . . , s T } {1, . . . , p} and a vector x Rp we define x S := (xs1, . . . , xs T ) . Analogously, for the random vector X we set XS := (Xs1, . . . , Xs T ) . We assume that the elements of S are ordered to make the representations unique, such that s1 < s2 < . . . < s T . Let G be a Directed Acyclic Graph (DAG) over the variables X1, . . . , Xp. For each k {1, . . . , p} we denote by pa G(k) the set of parents of node k in G, i.e., the set of indices j {1, . . . , p} for which the edge Xj Xk exists in G. We assume that there exists a DAG G such that X follows a SEM with additive noise, that is

Xk = fk Xpa G(k) + εk. (1)

Here, εk are i.i.d. noise terms for k = 1, . . . , p. Every DAG has at least one topological ordering π (that is, a permutation) on {1, . . . , p}. We denote the nonempty set of topological orderings for G by Π(G). With π Π(G) there can only be a directed path in G from Xj to Xk if π(j) < π(k) but not vice versa; see Figure 1 for an illustrating example.

Kertel and Klein

X1 = f1(X2) + ε1 X2 = ε2 X3 = f3(X2) + ε3

Figure 1: An example of a SEM on the left-hand side and its corresponding DAG G on the right-hand-side for p = 3. The set of possible topological orderings is {(2, 1, 3), (2, 3, 1)}. For π0 = (2, 3, 1) the structural equation for X1 is given by X1 = f1(X2) + ε1 = f12(X2) + f13(X3) + ε1 with f12 = f1 and f13 = 0.

2.1 Identifiability

The goal of our analysis is to identify the DAG G from the distribution of X. In general there is no one-to-one correspondence between the distribution P(X) and the underlying SEM or G. However, if we impose appropriate restrictions on the noise terms {εk : k = 1, . . . , p} and the functions {fk : k = 1, . . . , p}, then the desired one-to-one correspondence between a distribution and a SEM exists (see Peters et al., 2014, Corollary 31). In this case, we call the SEM identifiable. We consider SEMs that satisfy the following assumptions, which ensure identifiability.

Assumption 1 For the SEM of Equation (1) assume that fk has the additive decomposition

fk(xpa G(k)) = X

j pa G(k) fkj(xj), k = 1, . . . , p,

where the fkj are three times differentiable, non-linear and non-constant for any k = 1, . . . , p and j pa G(k). Further, let (ε1, . . . , εp) be a random vector of independent components, which are normally distributed with mean zero and standard deviations σ1, . . . , σp > 0. We call SEMs of this form Causal Additive Models (CAMs; B uhlmann et al., 2014).

Define ϖπ(k) = {j : π(j) < π(k)} as the predecessors of k with respect to a permutation π. For example, when p = 3 and π = (2, 1, 3), we have: ϖπ(1) = {2}, ϖπ(2) = , ϖπ(3) = {1, 2}. Consider a CAM, which is characterized by functions f1, . . . , fp, standard deviations σ1, . . . , σp, and a DAG G with topological ordering π. The implied density is

k=1 p xk|xpa G(k) =

k=1 p xk|xϖπ(k)) . (2)

Here, p (xk|x S) is the density of the conditional distribution of Xk given XS = x S, which we assume to exist throughout this work. For S = , we define p (xk|x S) = p (xk). The second equality in Equation (2) holds because Xk is independent of its predecessors in π given its parents (see Peters et al., 2017, Proposition 6.31), that is

Xk Xϖπ(k)\pa G(k)|Xpa G(k). (3)

Boosting Causal Additive Models

For any combination of f1, . . . , fp, G and any topological ordering π of G, it trivially holds

j pa G(k) fkj(Xj) = X

where f fkj = fkj if j pa G(k) and f fkj = 0 if j / pa G(k). Consequently, any CAM characterized by f1, . . . , fp, G, σ1, . . . , σp can be reparameterized by f1, . . . , fp, π, σ1, . . . , σp, where π is a topological ordering of G. Note that the latter parametrization is not unique, since π can be chosen arbitrarily from the set of topological orderings of G. However, once the topological ordering π is known, G can be found by identifying the parents of Xk (those j for which fkj = 0) within ϖπ(k) for any k = 1, . . . , p. This is straightforward using pruning or feature selection methods (Teyssier and Koller, 2005; Shojaie and Michailidis, 2010; B uhlmann et al., 2014). Thus, we simplify our objective and instead of searching for G, we aim to identify its topological ordering π. Consider a CAM characterized by f1, . . . , fp, G, σ1, . . . , σp. It can be reparameterized by the parameter tuple θ = (f1, . . . , fp, π, σ1, . . . , σp). Using Equation (3), it follows that the implied conditional distribution of Xk|Xϖπ(k) = xϖπ(k) is Gaussian with mean fk xϖπ(k)

and standard deviation σk. The implied log-density log(pθ) is given by

log(pθ(x)) =

σk ϕ xk fk(xϖπ(k))

where ϕ denotes the density function of a univariate standard normal distribution. From now on let X follow a CAM characterized by θ0 = f0 1 , . . . , f0 p , π0, σ0 1, . . . , σ0 p . To identify θ0, we define the population score function

θ 7 Epθ0 [ log (pθ(x))] .

Epθ0 [ log (pθ0(x))] Epθ0 [ log (pθ(x))] ,

with equality if and only if pθ0 = pθ by the properties of the Kullback-Leibler divergence. For such minimizing θ, their ordering π must lie in Π(G0) by the identifiability. Let us consider the problem of minimizing the score function with respect to θ. Fixing π and f1, . . . , fp in θ, and minimizing with respect σ1, . . . , σp leads to the minimizers

σ2 k,pθ0,fk,π := Epθ0 h Xk fk(xϖπ(k)) 2i = Epθ0

j ϖπ(k) fkj(Xj)

for k = 1, . . . , p. Hence, when minimizing the score function we only need to consider the subset of the parameter space (f1, . . . , fp, π, σ1, . . . , σp), where σk = σk,pθ0,fk,π, k = 1, . . . , p,

Kertel and Klein

that is, the relevant parameter space reduces to (f1, . . . , fp, π). Thus, it holds

arg min θ Epθ0 [ log (pθ(x))] = arg min θ=(f1,...,fp,π,σ1=σ1,pθ0 ,f1,π,...,σp=σp,pθ0 ,fp,π) Epθ0 [ log (pθ(x))]

= arg min (f1,...,fp,π)

k=1 log(σ2 k,pθ0,fk,π) + C

= arg min (f1,...,fp,π)

k=1 log(σ2 k,pθ0,fk,π),

with C only depending on p. Denote the functions aligning with π by

(f1, . . . , fp) : fk = X

π(j)<π(k) fkj, fkj : R R, fkj is three times differentiable, non-linear, and non-constant.

We fix π and optimize with respect to f1, . . . , fp to define a population score on the orderings

S(π) = min (f1,...,fp) ϑ(π)

k=1 log(σ2 k,pθ0,fk,π). (4)

By identifiability, S(π) is minimal if and only if π0 Π(G0), that is,

k=1 log σ0 k 2 = S(π0) < S(π) π0 Π(G0), π / Π(G0). (5)

Intuitively, the score S(π) measures how much variance remains when any Xk is regressed on its predecessors Xϖπ(k). An example for p = 2 is depicted in Figure 2.

2.2 Estimation of the Ordering

In practice the true parameter tuple θ0 is unknown. Instead, we observe N realizations x N := (x1, . . . , x N) of X with density pθ0, where xℓ Rp, ℓ= 1, . . . , N and xℓ= (xℓ1, . . . , xℓp) . It is natural to consider the empirical version of the population score function (4)

k=1 log(bσ2 k, bfk,π), (6)

where bσ2 k, bfk,π, k = 1, . . . , p is defined as

bσ2 k, bfk,π = 1

xℓk bfk,π(xℓϖπ(k)) 2 = 1

bfkj,π(xℓj)

Here bfk,π = P j ϖπ(k) bfkj,π is a regression function estimate using data (xℓϖπ(k), xℓk), ℓ= 1, . . . , N with the convention xℓS := xℓS = (xℓs1, . . . , xℓs T ) for S = {s1, . . . , s T } introduced before. Although the regression estimates depend on the data and thus N, we omit the additional index for better readability.

Boosting Causal Additive Models

Figure 2: The blue dots represent 500 realizations of a distribution following a SEM with p = 2 and X1 = ε1 N(0, 1) and X2 = 3 cos(X1) + ε2 with ε2 N(0, 1). In the left-hand plot, X2 is shown on the y-axis and X1 on the x-axis, while in the right-hand-plot, the axes are reversed. The red lines give the conditional mean functions. We observe that arg min(f1,f2) ϑ((1,2)) P2 k=1 log(σ2 k,pθ0,fk,(1,2)) =

(0, 3 cos(x1)) and arg min(f1,f2) ϑ((2,1)) P2 k=1 log(σ2 k,pθ0,fk,(2,1)) = (0, 0). The distribution X1 E [X1|X2 = x2] becomes bi-modal for larger values of x2. The unexplained noise (distance of blue dots to red line) is smaller in the left plot, which is the correct ordering, thus S((1, 2)) < S((2, 1)).

Remark 2 In contrast to the population version (4), it is unclear whether (6) is also minimized at π0 Π(G0) even for N . This is due to the fact, that the regression functions bfk,π and the prediction errors bσ2 k, bfk,π are estimated from a finite sample.

Since any regression function estimator induces a score function b S, the following remark gives some intuition on required properties through two extreme examples.

1. Overfitting: Let the regression estimator interpolate the data, that is, bfk,π(xℓ,ϖπ(k)) = xℓk for all ℓ= 1, . . . , N, k = 1, . . . , p and all π. Then the regression estimator is overfitting, and the score diverges to for any permutation π..

2. Underfitting: Let bfk,π(xℓ,ϖπ(k)) = 0 for all ℓ= 1, . . . , N, k = 1, . . . , p and all π. Then the regression estimator is underfitting, and all permutations have the same score Pp k=1 log 1 N PN ℓ=1 x2 ℓk .

In both cases, b S cannot identify a correct ordering π0 Π(G0) even with infinite data. Intuitively, a suitable regression estimator should:

1. avoid overfitting and preserve the unexplained noise, and

2. produce estimates that are close to the true regression functions f0 1 , . . . , f0 p .

Kertel and Klein

In this work, we apply L2 boosting regression in conjunction with early stopping (B uhlmann and Yu, 2003; Raskutti et al., 2014) and show that the resulting score in Equation (6) consistently favors a permutation π0 Π(G0). Proposition 5 below formalizes the intuition of Remark 3 and states sufficient conditions on the regression function estimator. In the following Definition 4, Y takes the role of Xk, while e X takes the role of Xϖπ(k) .

Definition 4 (Non-overfitting) Let bf be a regression estimate based on N i.i.d. samples (ex1, y1) , . . . , (ex N, y N) from ( e X, Y ). We say the estimator is not overfitting w.r.t. ( e X, Y ), if 1 N

yℓ bf( exℓ) 2 E e X,Y

Y bf( e X) 2

converges to 0 in probability for N .

Proposition 5 Let Assumption 1 hold. Then, if the regression estimator is such that

1. bfk,S is not overfitting with respect to (XS, Xk) for any combination of k = 1, . . . , p and S {1, . . . , p} \ {k} according to Definition 4 and

2. bσ2 k, bfk,π0

P σ2 k,pθ0,f0 k,π0 = σ0 k 2 for all π0 Π(G0) and k = 1, . . . , p, that is

xℓk bfk,ϖπ0(k)(xℓϖπ0(k)) 2 P σ0 k 2 = E

j pa G0(k) f0 kj(Xj)

Then it holds for the derived score function b S that

b S(π0) < b S(π)

for any π0 Π(G0) and π / Π(G0) with probability going to 1 for N .

Remark 6 Proposition 5 is more general than existing results such as Theorem 1 in B uhlmann et al. (2014) as it does not specify a particular regression estimator. The challenge in applying Proposition 5 is to find a combination of a regression estimator and corresponding distributional assumptions on X so that both conditions are fulfilled. For nonparametric maximum likelihood regression, B uhlmann et al. (2014) give these conditions in (A1) (A4). In contrast, for L2 boosting we present the required conditions via Assumptions 10, 11, and 13.

Sketch of the proof of Proposition 5 Our goal is to show that for any π / Π(G0) and π0 Π(G0) it holds asymptotically

k=1 log(bσ2 k, bfk,π0) = b S(π0) < b S(π) =

k=1 log(bσ2 k, bfk,π).

By inequality (5) this is fulfilled if for N and π / Π(G0)

lim N b S(π) S(π) (7)

Boosting Causal Additive Models

and if at the same time for π0 Π(G0) it holds that

lim N b S(π0) = S(π0). (8)

Applying the continuous mapping theorem, (8) is ensured by Condition 2. Contrariwise for relation (7) when π / Π(G0), the non-overfitting Condition 1. of Proposition 5 ensures that for any k = 1, . . . , p and for N

bσ2 k, bfk,π = 1

bfkj,π (xℓj)

bfkj,π (Xj)

It thus follows that for N

k=1 log bσ2 k, bfk,π

bfkj,π (xℓj)

bfkj,π (Xj)

min (f1,...,fp) ϑ(π)

k=1 log(σ2 k,pθ0,fk,π) = S(π).

A detailed proof can be found in the Appendix A.

3 Boosting and Reproducing Kernel Hilbert Spaces

As our main result in Theorem 15 relies on boosting-based Kernel Hilbert space regressions, we briefly review relevant known concepts and results on boosting (Section 3.1) and Reproducing Kernel Hilbert Spaces (RKHSs) (Section 3.2) next. Further details on the two concepts can be found in B uhlmann and Yu (2003); Schapire and Freund (2012) for boosting, and in Wahba (1990); Sch olkopf and Smola (2001); Wainwright (2019) for RKHSs.

3.1 Boosting

L2 boosting addresses the problem of finding a function f in some function space H that minimizes the expected L2 loss

1 2EX,Y h (Y f(X))2i . (9)

In practice, (9) is replaced by the empirical minimizer of

ℓ=1 (yℓ f(xℓ))2 (10)

based on N i.i.d. samples (x1, y1), . . . , (x N, y N). L2 boosting employs a functional gradient descent approach using a base learner S that maps y N := (y1, . . . , y N) to the estimates bf

Kertel and Klein

and by N = bf(x N), where bf(x N) := bf(x1), . . . , bf(x N) . More precisely, after initializing bf(0), in each boosting step m = 1, . . . , mstop, the residuals at the current bf(m)

f 1 2N (yℓ f(xℓ))2 |f= bf(m) = 1

yℓ bf(m)(xℓ)

are computed and bf = S(u) = S(u1, . . . , u N) is determined. The solution is then used to update the estimate of the regression function

bf(m+1) = bf(m) + υ bf,

where 0 < υ 1 is the step size commonly fixed at a small value (B uhlmann and Yu, 2003). For many base learners S this leads to overfitting for fixed N as mstop . Hence, stopping earlier is desirable and early stopping is often applied (Schapire and Freund, 2012). Following B uhlmann and Yu (2003); Raskutti et al. (2014), we consider linear and symmetric base learners S, that is, S : y N 7 by N is a linear and symmetric mapping. Spline regression and linear regression, including the generalized ridge regression sense, fall under this definition. The same holds for (additive) kernel ridge regression (Raskutti et al., 2014; Kandasamy and Yu, 2016), which we will consider in the following. In contrast, the popular choice of decision trees is not linear. Proposition 1 of B uhlmann and Yu (2003) shows that the estimate by after m boosting steps for y is given by

bf(m)(x N) = by N = B(m)y = (I (I S)m)y.

Since we assume S to be symmetric, there exists an orthogonal U RN N containing the eigenvectors of S, such that for a diagonal matrix D with the eigenvalues of S on the diagonal, it holds that S = UDUT . It follows that B(m) = U(I (I D)m)UT .

3.2 Reproducing Kernel Hilbert Spaces

We choose the function estimates bf from a Reproducing Kernel Hilbert Space (RKHS) H, while S is a kernel regression estimator. We start by introducing kernel functions.

Definition 7 We call a symmetric function K : Rp Rp R a positive definite (p.d.) kernel on Rp if N X

ℓ=1 αkαℓK(xk, xℓ) 0

for any {α1, . . . , αN} R and any {x1, . . . , x N} Rp.

For any p.d. kernel K, there exists a unique Hilbert space H with K( , x) H x Rp and where further it holds for any f H and x Rp

f(x) =< f, K( , x) >H . (11)

Equation (11) is called the reproducing property. Consequently H is called a RKHS. By the reproducing property it holds for f = 1

N PN k=1 αk K( , xk) and g = 1

N PN k=1 βk K( , xk), that < f, g >H= αT Gβ, (12)

Boosting Causal Additive Models

where G RN N is symmetric with Gjk = K(xj,xk)

N . We call G the Gram matrix. By the representation theorem (Wainwright, 2019, Proposition 12.33) the minimizer of

bf = arg min f H

ℓ=1 (yℓ f(xℓ))2 + γ||f||2 H (13)

can be expressed by

ℓ=1 βℓK( , xℓ),

where β = 1

N (G + γNI) 1y N. By Equation (12), || bf||2 H = βT Gβ holds. Clearly, the

mapping S : y N 7 G(G + λI) 1y N = bf(x N) is linear and symmetric. It can be derived that S has the eigenvalues dℓ= c µℓ c µℓ+γN , where c µ1, . . . , c µN are the eigenvalues of G. The regularization parameter λ = γN shall be constant in N in this work. The boosting estimate bf(m) is sequentially built by adding small amounts of the current estimates. These current estimates are of the form 1

N PN ℓ=1 bαℓK( , xℓ). Thus, if S is the base learner used for boosting, it holds by the construction of the boosting estimator, that there exists a bβ RN with

ℓ=1 bβℓK( , xℓ). (14)

Here, bf(m) is the boosting regression estimate after m boosting steps. In this context we define

f H : f = 1

ℓ=1 βℓK( , xℓ), ||f||H = 1, {x1, . . . , x N} Rp )

and bf(m) h FN for some h > 0. For a continuous kernel K L2(R R), we can define an integral operator K : L2(R) L2(R) by

f( ) 7 (K(f)) (y) = Z

R K(x, y)f(x)d P(x).

Under assumptions on P and K, Sun (2005) shows that the operator K has eigenvalues µk 0 and eigenfunctions ϕk L2(R), so that K(ϕk) = µkϕk. If the eigenvalues are ordered non-increasingly, then µk goes to 0. The decay rate of the eigenvalues will be important in our analysis. We close this subsection with two examples.

Example 1 (Kernel functions)

1. Gaussian kernels on R: K : R R is defined for some ς > 0 by

K(x, x ) = exp |x x |2

Its eigenvalues exist under mild assumptions on P(X) (Sun, 2005, Section 4) and follow an exponential decay of the form

µk exp( Ck)

Kertel and Klein

for some C > 0. For further details, see Bach and Jordan (2002, Section C) and Example 3 of Cucker and Smale (2002).

2. Additive Kernel: Let H1, . . . , Hp be RKHSs with kernels Kk on Xk, k = 1, . . . , p. The space H1 Hp := {f : Rp R : f(x1, . . . , xp) = Pp k=1 fk(xk), fk Hk} is a RKHS with kernel K = Pp k=1 Kk. Its norm is defined by ||f||2 H = Pp k=1 ||fk||2 Hk (see Wainwright, 2019, Proposition 12.27). For Gaussian kernels K1, . . . , Kp, we assume that the eigenvalues of K can be upper bounded by

µk p exp( Ck) (15)

for some C > 0. Note that for p fixed, this is of type

µk exp( C k)

for some C > 0. The solution of (13) can then be written by bf(x1, . . . , xp) = Pp k=1 bfk(xk), where

ℓ=1 bβℓKk( , xℓk)

and bβ is shared among the different components, that is, does not depend on k (see Kandasamy and Yu, 2016, for further details).

The idea behind inequality (15) is the following. Each Kk is a self-adjoint and compact operator for any k = 1, . . . , p. Let A, B be linear and self-adjoint operators with nonincreasing eigenvalues λ1, λ2, . . . and µ1, µ2, . . ., respectively. It holds by Zwahlen (1965/66) for the non-increasing eigenvalues γ1, γ2, . . . of the self-adjoint and linear operator A + B that for any 1 r, s p

γr+s γr+s 1 λr + µs.

Now consider the non-increasing eigenvalues µk ℓof the operator A1 + . . . + Ak. Let λj ℓ, ℓ= 1, 2, . . . be the eigenvalues of Aj. Then it holds that

µp ℓp µp 1 (p 1)ℓ+ λp ℓ . . . λ1 ℓ+ . . . + λp ℓ.

Inequality (15) follows under the assumption that λj ℓ exp( Cℓ) for j = 1, . . . , p for some C > 0. The Gaussian (and its additive counterpart) kernel is bounded, which allows to uniformly upper-bound the supremum norm of the unit ball on H.

Remark 8 If H is a RKHS with kernel K such that K(x, x) B for some B > 0, then it holds

sup ||f||H 1 ||f|| B < .

Boosting Causal Additive Models

4 Boosting DAGs

In this section, we prove Theorem 15, which is our main result. It states that if the regression procedure in Section 2 is chosen as L2 boosting with early stopping, then the estimator for the topological ordering is consistent. This holds for a uniform and asymptotically increasing number of boosting iterations. We provide the assumptions in Section 4.1 and prove the statement in Section 4.2. In Section 4.3 we propose an adaption of the procedure that is effective in high dimensions.

4.1 Assumptions

Proposition 5 shows that in order to consistently estimate the causal ordering, we must avoid overfitting whenever we regress Xk onto XS for any k = 1, . . . , p and S {1, . . . , p} \ {k}. This poses the main challenge in applying Proposition 5. We assume that Xk E [Xk|XS = x S] is sub-Gaussian, which later allows one to control the regression estimates.

Definition 9 We call a random variable ε sub-Gaussian if its Orlicz norm defined by

s(ε) = inf r (0, ) : E exp ε2

Assumption 10 For any k and S {1, . . . , p} \ {k} consider the decomposition

Xk = µk,S(XS) + εk,S,

where µk,S(XS) = E [Xk|XS] and εk,S = Xk E [Xk|XS] .

We assume that

1. εk,S|XS = x S is sub-Gaussian with Orlicz norm sk,S(x S), 0 < sk,S(x S) smax for all x S R|S|, and

2. ||µk,S|| µmax < .

The constants are assumed to hold uniformly for all k {1, . . . , p} and all S {1, . . . , p} \ {k}.

We will prove that the regression estimate will lie in the function class h FN for some radius h > 0. We denote the ball of radius h in H by Bh. In Theorem 15 we use the radius h > 0 and the function complexity measures Rademacher complexity and covering numbers to derive Condition 1. of Proposition 5. Both measures quantify the richness of a function class. While the Rademacher complexity of FN can be upper-bounded by Theorem 39 given in the Appendix, we also need to upper bound the covering number of FN. The complexity measures for h FN can then be upper-bounded using scaling arguments. We thus make the following assumption:

Kertel and Klein

Assumption 11 For any k = 1, . . . , p, let Hk be a RKHS on Xk and Bk 1 := {f H : ||f||Hk 1}. Then it shall hold for any z > 0 and k = 1, . . . , p Z 1

2 , Bk 1 du < .

Here, N( , Bk 1) is the covering number with respect to || || , which is defined in Section B.2.

Remark 12 Let HS = Hs1 . . . Hsk and denote the unit ball of Hsk by Bsk 1 and the unit ball of HS by BS 1 . Assume that for any j = 1, . . . , k it holds that Z 1

2 , Bj 1 du C(z) <

for some 0 < C(z) < . Furthermore, one can derive that

k , Bsj 1 .

Thus, using Jensen s inequality and the fact that N( , Bj 1) is non-increasing, it holds that

2 , BS 1 du Z 1

j=1 log N uz

2k , Bsj 1 du p C z

We assume that the eigenvalues of the random Gram matrix G decay with the same rate as the eigenvalues of the operator K. For details on the connection between the eigenvalues of G and K, consult Section C of Bach and Jordan (2002).

Assumption 13 For S = {s1, . . . , sk} {1, . . . , p} let H = Hs1 . . . Hsk be the RKHS on XS generated by the additive kernel Ks1 + . . . + Ksk. Assume there exist events BN with lim N P(BN) = 1, such that on BN, and for K0 = 1 2Cd+1 ln(N) the empirical eigenvalues of the Gram matrix G satifsy

k=K0 exp( Cuk),

and additionally bµK0 exp( Cd K0)

for some Cd > Cu > 0. The constants hold uniformly for any S {1, . . . , p}.

Remark 14 Bach and Jordan (2002, Section C) investigate the eigenvalue decay of the Gram matrix for different tails of the density on XS. They show that for densities on XS with tails in the order of exp x T S x S

2 , the eigenvalues of the Gram matrix of a Gaussian kernel decay as in Assumption 13. Furthermore, they show that for the first k K0(N) eigenvalues µk is close to exp( Ak), where A depends on the density of XS and K0(N) is in the order of log(N). This shows that Assumption 13 is fulfilled for densities XS with appropriately decaying tails. An example of such a density decay can be observed when XS is Gaussian distributed and the function f : R R approximately satisfies f(x) cx for |x| , for some c R.

Boosting Causal Additive Models

To give some intuition, recall that for the kernel regression estimator S : y N 7 by N, it holds that S = UDU , where U contains the eigenvectors of G, and D is a diagonal matrix with entries dℓ= bµℓ λ+bµℓ. Clearly, dℓdecays at a rate similar to bµℓ. For simplicity, consider w N = U y N. Assumption 13 then ensures that only a few entries of w N largely influence by N = UDw N, while most entries of w N contribute little. Thus, by N is mainly influenced by a low-dimensional subspace of RN.

4.2 Main Theorem

We now state the main theorem.

Theorem 15 Let Hk be a RKHS on Xk with Gaussian kernel Kk for each k = 1, . . . , p. Assume that X = (X1, . . . , Xp) follows a CAM as described in Assumption 1 for functions f0 1 H1, . . . , f0 p Hp, where Hk = Hk1 . . . Hkq, and {k1, . . . , kq} = pa(k). Given

Assumptions 10, 11, and 13 and assuming we estimate bf(mstop) k,π = bfk,π = P j ϖπ(k) bfkj,π using L2 boosting with Xk as response, Xϖπ(k) as predictors, and the number of boosting

steps chosen as mstop = N

1 4 Cu+Cd+1/2

Cd+1 , it holds that

b S(π0) < b S(π)

for N with probability going to 1 for any π0 Π0 and π / Π0.

Remark 16 Theorem 2 of Minh (2010) ensures that f0 1 , . . . , f0 p are smooth, non-constant, and non-linear, and therefore satisfy Assumption 1.

Remark 17 If the eigenvalues of the Gram matrix decay rapidly, that is, if Cu Cd and Cd is large, then a low-dimensional functional subspace contains a large fraction of the variation of f. This reduces the risk of overfitting. Hence, the number of boosting iterations m(N) can be chosen in the order N 1 2 . On the other hand, if the decay of the eigenvalues is slow, then the risk of overfitting is larger. Thus, m(N) should be chosen in the order N 1 4 . To this end, we compare our assumptions required for Theorem 15 with the assumptions of Theorem 1 in B uhlmann et al. (2014). Most importantly, B uhlmann et al. (2014) assume that the number of basis functions grows sufficiently slowly and that the choice of basis functions is deterministic and does not depend on the data . In addition, B uhlmann et al. (2014) require that this choice of functions can approximate the true functions f0 k on any compact set in terms of the L2-norm, see their Assumption (A4); and that the search space of additive functions is closed (Lemma 2 of B uhlmann et al., 2014) in L2. In contrast, our method is considerably more flexible since the regression estimates are directly dependent on the data, eliminating the need to manually restrict the function space s dimensionality. However, our Assumptions 10 and 11 are stricter than Assumption (A2) in B uhlmann et al. (2014). Specifically, we impose a stronger moment condition (A2)(ii) of B uhlmann et al. (2014) and require conditional distributions in Assumption 10. Furthermore, Assumption 11 in this work bounds the covering number entropy integral in supremum norm, while Assumption (A2)(i) of B uhlmann et al. (2014) does so in L2-norm, which is a weaker condition. Finally, Assumption 13 depends on the distribution of XS and is therefore related to Assumption (A3) of B uhlmann et al. (2014) although the two are difficult to compare.

Kertel and Klein

Proof We apply Proposition 5 and show the following two conditions.

1. Boosting under Misspecification: bf(mstop) k,S is not overfitting with respect to (XS, Xk) for any k {1, . . . , p} and S {1, . . . , p} \ {k}, and

2. Consistency of Variance Estimation: For any k = 1, . . . , p and π0 Π0 it holds

xℓk bf(mstop)(xℓϖπ0(k) 2 E Xk f0 k(Xpa G0(k) 2 0

in probability.

We will see that 1. follows from Theorem 18 in Section 4.2.1 and 2. is shown in Theorem 19 in Section 4.2.2.

4.2.1 Boosting under Misspecification

In Theorem 18 below, we show Condition 1. of Theorem 15. This is a novel result that specifies conditions so that L2 boosting is not overfitting even in a misspecified scenario. We fix S = {s1, . . . , sd} and k and define e X1, . . . , e Xd = e X = XS and Y = Xk. Let e XN

be the random element consisting of N i.i.d. observations of e X and denote the realizations of e XN by ex N = (ex1, . . . , ex N). Analogously, we define Y N and y N. Our goal is to prove that 1 N

bf(mstop)(exℓ) yℓ 2 E e X,Y

bf(mstop)( e X) Y 2

for N for the boosting estimate bf(mstop). The left term is an expectation with respect to the empirical distribution PN (and thus depends on the realizations), whereas the right term is under the population distribution induced by e X, Y denoted by P. Thus, the l.h.s. of Equation (16) can be written as

(PN P) bf(mstop)( e X) Y 2 . (17)

Assumptions 10 , 11 , and 13 are reformulated versions of Assumptions 10, 11, and 13, respectively, using the notation introduced above.

Assumption 10 Let Y = µ( e X) + ε, for which we define the random variables

µ( e X) = E h Y | e X i and ε = Y E h Y | e X i .

We assume that ε| e X = ex is sub-Gaussian with Orlicz norm s(ex) and 0 < s(ex) smax for all ex Rd and ||µ|| µmax < . Let σ2 max := maxex Rd E h ε2| e X = ex i .

Boosting Causal Additive Models

Assumption 11 Let B1 := {f F : ||f||H 1}, where H is the additive RKHS on e X = e X1, . . . , e Xd . Then, for any z > 0 it holds

2 , B1 du < .

Assumption 13 There exist events BN on e XN with lim N P(BN) = 1, so that on BN and for K0 = 1 2Cd+1 ln(N) it holds for the empirical eigenvalues of the Gram matrix G computed from ex N BN

k=K0 exp( Cuk),

and additionally

bµK0 exp( Cd K0)

for some Cd > Cu > 0.

Note that all constants in Assumptions 10 , 11 , 13 are independent of the choice of k, S. It is the purpose of this subsection to prove the following theorem (and thus Condition 1).

Theorem 18 Under the Assumptions 10 , 11 and 13 it holds for mstop(N) = N

1 4 Cu+Cd+1/2

(P PN) Y bf(mstop)( e X) 2 P 0. (18)

As bf(mstop) depends on the realizations ex N and y N, the convergence above is not trivial. For simplicity, we drop the dependency of f, bf(mstop) on e X in the proof below.

Proof To prove (18), we decompose

|(PN P) Y bf(mstop) 2 |

|(PN P)Y 2| | {z } I

+ 2|(PN P) bf(mstop)Y | | {z } II

+ |(PN P) bf(mstop) 2 | | {z } III

Kertel and Klein

and show the convergence in probability for I III separately. Term I goes to 0 in probability since Y has a finite fourth moment. For term II it holds for any ξ > 0 that

P |(PN P)Y bf(mstop)| ξ

= P |(PN P)Y bf(mstop)| ξ || bf(mstop)||H h(N)FN

+ P |(PN P)Y bf(mstop)| ξ || bf(mstop)||H / h(N)FN n e XN BN o

+ P |(PN P)Y bf(mstop)| ξ || bf(mstop)||H / h(N)FN n e XN / BN o

P |(PN P)Y bf(mstop)| ξ || bf(mstop)||H h(N)FN

+ P || bf(mstop)||H / h(N)FN n e XN BN o

+ P n e XN / BN o

sup f h(N)FN |(PN P)Y f| ξ

+ P || bf(mstop)||H / h(N)FN n e XN BN o

+ P n e XN / BN o .

For h(N) o N1/4 , the first line on the r.h.s. goes to 0 by Corollary 34 and the second line vanishes by Lemma 27. The third line converges to 0 due to Assumption 13 . This shows the convergence in probability of term II. Similarly, for term III it holds for any ξ > 0 that

P |(PN P) bf(mstop) 2 | ξ

= P |(PN P) bf(mstop) 2 | ξ || bf(mstop)||H h(N)FN

+ P |(PN P) bf(mstop) 2 | ξ || bf(mstop)||H / h(N)FN { e XN BN}

+ P |(PN P) bf(mstop) 2 | ξ || bf(mstop)||H / h(N)FN { e XN / BN}

sup f h(N)FN |(PN P)f2| ξ

+ P || bf(mstop)||H / h(N)FN { e XN BN}

+ P { e XN / BN} .

For h(N) o N1/4 , the first line on the r.h.s. converges to 0 due to Corollary 40 and the other terms behave as described for term II. This shows the convergence in probability for term III.

Boosting Causal Additive Models

Overall, this proves Condition 1.

4.2.2 Consistency of Variance Estimation

In Theorem 19 below, we proof Condition 2. of Theorem 15 using the techniques in B uhlmann and Yu (2003); Raskutti et al. (2014). We fix again k and assume that ϖπ0(k) has size d, that is, π0(k) = d + 1. We set e X = e X1, . . . , e Xd = Xϖπ0(k) and Y = Xk.

Theorem 19 Let Y = f0( e X) + ε, ε N(0, σ2),

where f0 lies in a RKHS H for which Assumption 13 holds with ||f0||H = R. Then, it holds

with the number of boosting steps chosen as mstop = m(N) = N

1 4 Cu+Cd+1/2

ℓ=1 (yℓ bf(mstop)(exℓ))2 E Y f0( e X) 2 = bσ2 σ2 0

in probability.

Proof We define the semi-norm

||g||2 2,N := 1

ℓ=1 g(yℓ, exℓ)2.

By the triangle inequality it holds that

||Y bf(mstop)||2,N ||Y f0||2,N + ||f0 bf(mstop)||2,N. (19)

Next, we show how to asymptotically lower and upper bound ||Y bf(mstop)||2,N by σ.

Lower bound: As ||f0||H = R, ||f0|| is bounded by Remark 8. Further, f0(ex) = E h Y | e X = ex i and the noise Y f0(ex) = ε is Gaussian and thus sub-Gaussian with a uniformly bounded Orlicz norm. Hence, by Theorem 18 and the continuous mapping theorem the l.h.s. of (19) converges and it holds

||Y bf(mstop)||2,N = q

||Y bf(mstop)||2 2,N

E h ||Y bf(mstop)||2 2 i q

E ||Y f0||2 2 = σ

in probability.

Upper bound: The term ||Y f0||2,N = q

1 N PN ℓ=1 ε2 ℓconverges to σ in probability.

Thus, it remains to show that ||f0 bf(mstop)||2,N P 0 for N , which follows from Lemma 20 below.

Kertel and Klein

Lemma 20 For Y = f0( e X) + ε, ε N(0, σ2), f0 H, and if bf is the boosting estimate

with mstop = m(N) = N

1 4 Cu+Cd+1/2

Cd+1 boosting steps, bf(mstop) converges to f0 in a fixed design, that is,

||f0 bf(mstop)||2,N =

f0( exℓ) bf(mstop)( exℓ) 2 !1/2 P 0.

Proof The proof is in the Appendix C.

Remark 21 Lemma 20 also holds for heteroscedastic and sub-Gaussian noise.

Remark 22 The combination of Theorem 18 and 19 is insightful. If we are unsure whether f0 H and the noise is independent, then limiting appropriately the number of boosting iterations leads to:

1. a consistent estimator for f0 if the assumptions hold, and

2. an estimator, such that the prediction error on the samples 1

N PN ℓ=1 yℓ bf(mstop)(exℓ) 2

is asymptotically close to the L2-error E Y bf(mstop)( e X) 2 for yet unobserved re-

alizations of ( e X, Y ). Here, 1 N PN ℓ=1 yℓ bf(mstop)(exℓ) 2 depends only on the training

data used to learn bf(mstop) and thus no hold-out set is necessary.

We emphasize that although the results are stated for the Gaussian kernel they can be extended to kernels with other eigenvalue decay rates.

4.3 Boosting DAGs for Large Dimensions

Theorem 15 shows that we can asymptotically identify the true causal ordering using boosting regressions. However, there are p! possible permutations on {1, . . . , p} that constitute the search space. Thus, beyond a very small p or without extensive prior knowledge on the topological order the computational costs are prohibitive. We address this issue through component-wise boosting in an additive noise model.

Additive Noise Model The true graph G0 and the true additive structural equations f0 1 , . . . , f0 p with

f0 k(xpa(k)) = X

j pa(k) f0 kj(xj)

can be represented by a function F 0 : Rp Rp, where

F 0(x) = f0 1 (x), . . . , f0 p (x) = f0 1 (xpa(1)), . . . , f0 p (xpa(p)) .

Thus, the graph G0 has an edge from Xj to Xk if and only if the function f0 k is not constant in its j th component. Note that the set of functions F : Rp Rp corresponding to a DAG is non-convex (Zheng et al., 2020).

Boosting Causal Additive Models

Component-wise Boosting Instead of applying L2 boosting to estimate f0 k, k = 1, . . . , p one-by-one as in Theorem 15, we employ component-wise boosting to estimate F 0. This means, we define the loss function on the functions F : Rp Rp as the log-likelihood function given x N

L(F, x N) = L (f1, . . . , fp) , x N =

ℓ=1 (xℓk fk(xℓ))2 !

and proceed as follows. Choose a step size 0 < ν 1 and let F (1) = 0 be the starting value. Then, for m = 1, 2, . . . we set f(m+1) k (x) = f(m) k (x) + ν bfkj(xj), where (j, k) {1, . . . , p} {1, . . . , p} is the solution of

arg min (j,k)/ Nm S(j, k; F (m)),

and S(j, k; F (m)) is the score on the edges defined by

S(j, k; F (m)) := log

bfkj(xℓj) xℓk f(m) k (xℓ) 2 !

Here, the candidate functions bfkj are determined by solving the kernel ridge regression

bfkj = arg min gkj Hj

gkj(xℓj) xℓk f(m) k (xℓ) 2 + λ||gkj||2 Hj. (21)

That is, we run a RKHS regression on the marginal residuals. Hence, the resulting functions fkj, j, k = 1, . . . , p belong to an RKHS. In the set Nm we track the edges (j, k) that would cause a cycle when added to F (m). Hence, F (m) corresponds to a DAG for any m = 1, 2, . . .. Note that if the edge (j, k) is chosen, then we only need to update S(j , k; F (m+1)) for (j , k) / Nm+1, while this score remains unchanged for k = k, that is, S(j, k ; F (m+1)) = S(j, k ; F (m)). This reduces the computational burden. We stop the procedure after mstop steps, which is the crucial tuning parameter of (component-wise) boosting. As the RKHS regression scales poorly with N, it could be approximated by some spline regression instead.

Choosing mstop Inspired by results from boosting for regression (Tutz and Binder, 2006; B uhlmann and Hothorn, 2007) we use the Akaike Information Criterion (AIC) to select mstop. For any f(m) k we calculate the trace of the mapping B(m) k : (x1k, . . . , x Nk) 7 f(m) k (x1), . . . , f(m) k (x N) . Then we define the AIC score by

AIC F (m), x N =

k=1 AICk f(m) k , x N =

xℓk f(m) k (xℓ) 2 + tr B(m) k !

(22) We stop the procedure and set mstop = m if AIC(F (m), x N) increases with m. We emphasize that this corresponds only to a local minimum w.r.t the AIC score and the global optimum

Kertel and Klein

is hard to find due to the non-convexity of the search space. The algorithm using the AIC is outlined in Algorithm 1 in Appendix D. It can be understood as a component-wise functional gradient descent in the space of those additive functions Rp Rp that imply a DAG.

Pruning We prune the estimated graph b G by performing an additive model regression of each node on its parents. Finally, we keep only those nodes as parents whose p-value is below 0.001. For more details on pruning, see B uhlmann et al. (2014).

5 Simulation Study

In this section we empirically investigate the proposed algorithms. To this end, we first describe the data-generating processes in Section 5.1. In Section 5.2 we verify Theorem 15 for data sets of small dimensions. In Section 5.3 we benchmark our algorithm of Section 4.3 denoted by DAGBoost on high-dimensional data sets against state-of-the-art methods, and conduct a sensitivity analysis on the effect of the step size ν, penalty parameter λ, and the stopping criterion mstop. Every result presented is based on 100 randomly generated data sets unless otherwise stated. Throughout, we set the step size ν = 0.3 and the penalty parameter λ = 0.01. While it is known that boosting is commonly robust with respect to the step size (as long as it is small enough), we find in Section 5.3.2, that our method DAGBoost is also robust against the specific choice of λ. We therefore refrain from further tuning. The code and the datageneration procedure are publicly available at https://github.com/mkrtl/Boosting DAGs.

5.1 Sampling SEMs and Data Generation

The generation of synthetic data for causal discovery can lead to data sets leaking information on the underlying graph. For example, Reisach et al. (2021) show that for many simulated data sets, the causal ordering can be obtained by the marginal variance of the nodes. Further, the signal-to-noise ratio for downstream nodes typically increases (Agrawal et al., 2023). To address these issues, we adapt the generation of synthetic data sets of B uhlmann et al. (2014); Lachapelle et al. (2019); Zheng et al. (2020); Ng et al. (2022b) to our aims. To this end, we apply a normalization step as described in Agrawal et al. (2023) to generate data sets as follows.

1. Generate underlying graph G0 with one of the following two methods.

(a) Generate a DAG according to the Erd os-Renyi (ER) model (Erd os and R enyi, 1959) as follows. We first generate a random DAG with the maximal number of edges and keep every edge with a constant probability. Every node has the same distribution for the number of its neighbors.

(b) Generate a scale-free (SF) graph using the model of Barab asi and Albert (1999). There exist hubs of nodes with large degree, while other nodes have a smaller degree. This graph structure is observed in various applications (Jeong et al., 2000; Wille et al., 2004; Kertel et al., 2023).

2. Generate additive or non-additive structural equations (SEs).

Boosting Causal Additive Models

(a) Additive: For any edge (j, k) in the graph, sample f0 kj from a Gaussian process

with zero mean and covariance function cov(f0(xℓj), f0(xℓ j)) = exp (xℓj xℓ j)2

and set f0 k(xℓpa G0(k)) = P j pa G0(k) f0 kj(xℓj) for ℓ, ℓ = 1, . . . , N.

(b) Non-additive: For every node k consider its parents pa G0(k), and sample f0 k from a Gaussian Process with zero mean and covariance function

cov(f0(xℓpa G0(k)), f0(xℓ pa G0(k))) = exp

||xℓpa G0(k) xℓ pa G0(k)||2 2 2

for ℓ, ℓ = 1, . . . , N.

3. Normalization: Scale f0 k(Xℓpa G0(k)), such that

E h f0 k(Xpa G0(k))2i = 1.

4. Generate the standard deviations σ0 k for all k = 1, . . . , p from a uniform distribution on

Finally, we generate the variables with no incoming edges from a centered Gaussian distribution with standard deviation as chosen in Step 4. Following the topological order of G0, we generate the data sets recursively according to the SEM. Since the scaling constant for the normalization step is difficult to attain, we generate 1000 observations that are used for the normalization step only. For a more detailed description, see Agrawal et al. (2023, Appendix E). We emphasize, that non-additive SEs conflict with Assumption 1. Moreover, neither the signal-to-noise ratio nor the variance of the nodes do not reveal the causal order.

5.2 Low-Dimensional Data

In this low-dimensional simulation study we generate data by setting p = 5 from an ER graph with on average five edges. Then, we calculate the score for any of the p! permutations π as described in Section 2. We choose the number of boosting iterations by applying the regression AIC score of B uhlmann and Hothorn (2007). While the AIC score of Section 4.3 penalizes the complexity of the full SEM, this score only penalizes the complexity of one regression. Recall that f(m) k is a function on xϖπ(k) and we use regular (not component-wise) boosting. We generate data sets with N = 10, 20, 50, 100, 200 observations with either additive or non-additive SEs. To evaluate the estimated permutations, we use the transposition distance

dtrans(π1, π2) := min |{transpositions σ1, . . . , σJ : σ1 . . . σJ π1 = π2}| .

For an estimated permutation bπ we then set

dtrans bπ, Π0 := min π0 Π0 dtrans(bπ, π0).

Kertel and Klein

Additve SEs Non-additve SEs N (dtrans(bπ, Π0)) SD dtrans(bπ, Π0)

(dtrans(bπ, Π0)) SD dtrans(bπ, Π0)

10 1.59 1.69 1.57 1.82 20 0.80 1.39 0.73 1.21 50 0.07 0.29 0.36 0.76 100 0.01 0.10 0.31 0.74 200 0.01 0.10 0.30 0.66

Table 1: Mean ((dtrans(bπ, Π0))) and standard deviation (SD) of the transposition distances (dtrans) between the estimated permutation bπ and the set of true permutations Π0

for SEMs with additive and non-additive SEs. The underlying graphs are of ER type with on average five edges.

Thus, we calculate the minimal number of adjacent swaps so that the estimated permutation aligns with the underlying topological order. The results are shown in Table 1 and, as expected, increasing the sample size decreases the transposition distances. This supports Theorem 15. Further, the results indicate that convergence even holds under non-additive SEs and thus that the algorithm is robust towards misspecifcation via non-additve SEs.

5.3 High-Dimensional Data

5.3.1 Comparison with Existing Methods

In the following we compare DAGBoost with B uhlmann et al. (2014) (denoted by CAM) and the non-parametric extension of NOTEARS by Zheng et al. (2020) (denoted by DAGSNOTEARS). For CAM we employ the default configuration and for DAGSNOTEARS we set λ = 0.03 and the cutoff to 0.3. For a comparison with further benchmark methods, we refer the reader to B uhlmann et al. (2014). As performance measures, we calculate the Structural Hamming Distance (SHD) and the Structural Interventional Distance (SID; Peters and B uhlmann, 2015) between the true and estimated graphs for data sets containing N = 200 observations of p = 100 variables. Their means and standard deviations (SDs) are given in Tables 2 and 3. The SHD counts the number of edge additions, deletions, and reversals to transpose one graph into the other. Conversely, the SID describes the quality of the causal inferential statements derived from the estimated graph. For both, a lower value indicates a better result. Finally, Table 4 summarizes the means of precision and recall. Precision is the ratio between the correctly identified edges and all identified edges. Recall is the share of the correctly identified edges among all true edges. From Table 2 we make the following observations. In terms of the SHD, DAGSNOTEARS does not provide satisfying results, while CAM and DAGBoost perform notably better in all simulation scenarios. Generally, all methods suffer from an uneven edge distribution (SF graphs). Further, DAGBoost achieves better results across all simulation scenarios. The advantage is largest in the most complex scenario of non-additve SEs and a non-even distribution of the edges among the nodes (SF graphs). In this case, DAGBoost outper-

Boosting Causal Additive Models

CAM DAGBoost DAGSNOTEARS Additive Graph SHD SD(SHD) SHD SD(SHD) SHD SD(SHD) True SF 28.82 9.10 25.57 8.98 87.47 3.61 True ER 9.62 12.18 3.33 7.26 93.63 11.02 False SF 74.04 7.77 63.37 5.88 87.79 4.10 False ER 34.84 7.28 28.30 7.12 87.24 9.95

Table 2: Mean (SHD) and standard deviation (SD) of SHDs between the true graph and the graphs estimated by the three presented algorithms.

CAM DAGBoost DAGSNOTEARS Additive Graph SID SD(SID) SID SD(SID) SID SD(SID) True SF 97.24 67.45 106.23 46.43 442.32 118.69 True ER 8.96 12.18 16.71 52.08 791.91 257.79 False SF 236.13 66.71 224.02 49.38 294.78 58.73 False ER 191.97 98.67 239.31 100.40 655.74 227.55

Table 3: Mean (SID) and standard deviation (SD) of SIDs between the true graph and the graphs estimated by the three presented algorithms.

forms CAM by more than one standard deviation. In contrast to the SHD, Table 3 shows

CAM DAGBoost DAGSNOTEARS Additive Graph Precision Recall Precision Recall Precision Recall True SF 0.872 0.825 0.984 0.751 0.424 0.135 True ER 0.919 0.990 0.985 0.980 0.252 0.395 False SF 0.689 0.421 0.925 0.373 0.587 0.120 False ER 0.814 0.812 0.932 0.748 0.506 0.242

Table 4: Mean of precision and recall for the three presented algorithms. Precision is the ratio between the correctly identified edges and all identified edges. Recall is the share of the correctly identified edges among all true edges.

that the SID values come with high SDs. Ignoring these, the table provides some evidence that CAM outperforms DAGBoost slightly in terms of the SID. A notable difference is the most complex scenario of non-additive functions and an SF graph, where DAGBoost is the best method. Together with the results for the SHD, this is an important insight for real-world applications, which often follow SF graphs. This is a strong argument in favor of our DAGBoost over CAM for scenarios with locally non-sparse graphs. Table 4 reveals that precision, that is, the ratio of the correctly identified edges to all identified edges, is higher for DAGBoost. However, the recall, which is the share of the identified edges among all true edges, is larger for CAM. Hence, the edges of DAGBoost are more reliable, while CAM misses a lower number of the underlying relationships. This observation also explains the

Kertel and Klein

different observations for the SID and SHD. As SID does not penalize superfluous edges, the ability of DAGBoost to identify edges more reliably is not rewarded.

In summary, DAGBoost is a strong competitor to CAM, both of which outperform DAGSNOTEARS. In particular, DAGBoost tends to estimate more reliable edges and performs superior compared to CAM in complex non-additive and SF settings, which are commonly observed in real scenarios. Additionally, DAGBoost requires fewer tuning parameters. While CAM has additional tuning parameters for the preliminary neighborhood selection (PNS, see B uhlmann et al., 2014), DAGBoost only requires tuning the step size ν and the penalty parameter λ which we found to be rather robust (see the sensitivity analysis below), thus relatively easy to tune. Pruning the resulting graph improves the SHD between the estimated and the true graph for DAGBoost. However, this effect is much larger for CAM as the graph before pruning contains many more edges. Thus, DAGBoost is less reliant on the hyperparameters of the pruning step compared to CAM.

5.3.2 Sensitivity Analysis

We investigate the performance of DAGBoost with respect to a variation of the step size ν and the penalty parameter λ. When varying λ, we set ν = 0.3. When varying ν, we fix λ = 0.01. We conduct our analysis with ER graphs with p = 100 nodes and additive SEs. The means and SDs of the SHD between the estimated and true graph are reported in Tables 5 and 6.

ν mean(SHD) SD(SHD) mean(runtime in s) SD(runtime in s) 0.3 3.33 7.26 89.34 13.20 0.5 2.40 2.20 44.74 10.36 0.7 2.37 2.29 27.61 3.94 0.9 2.34 2.09 22.41 2.97

Table 5: Mean and standard deviation (SD) of SHDs between estimated and true graph for a varying step size ν. The penalty parameter is fixed at λ = 0.01. The graphs are of ER type and the SEs are additive. The runtime statistics are based on 10 experiments.

λ mean(SHD) SD(SHD) mean(runtime in s) SD(runtime in s) 0.001 2.59 2.51 64.05 13.61 0.01 3.33 7.26 89.34 13.20 0.1 3.87 2.77 255.81 49.20

Table 6: Mean and standard deviation (SD) of SHDs between estimated and true graph for a varying penalty parameter λ. The step size is fixed at ν = 0.3. The graphs are of ER type and the SEs are additive. The runtime statistics are based on 10 experiments.

Boosting Causal Additive Models

From these tables, one can see that the influence of the hyperparameters on the SHDs is minor. The AIC score controls the number of boosting steps mstop very efficiently. Thus, it accounts well for the different base learners, which depend on the step size ν and the penalty parameter λ. An increase in the step size or a reduction in the penalty parameter leads to a smaller number of boosting iterations which, in turn, reduces runtime. At the same time, the quality of the estimated graph is largely unaffected. We thus recommend to use DAGBoost with a large step size or a low penalty parameter if the computational resources or time are limited. Although the impact of the hyperparameters is shown for one specific setting, based on our observations, they similarly hold in a wide range of data-generating processes.

5.3.3 Early stopping based on hold-out set

Our default choice for determining mstop is the AIC. However, motivated by a referee s suggestion, we also consider an alternative early stopping strategy using a hold-out set for determining mstop. Hereby, we split the data into a train and a hold-out set of the same size and monitor the mean squared error (MSE) on the hold-out set. The algorithm stops if the MSE calculated on the hold-out set increases. All numbers presented below are based on ER graphs with additive SEs.

Small p The results of this approach are worse than early stopping with the AIC score. For example, for N = 50, the mean permutation distance with early stopping via the MSE on the hold-out set is 0.73 with standard deviation 1.11. With the AIC score, the mean permutation distance is 0.07 with a standard deviation of 0.29. The permutation distance for the hold-out set uses a training set of size N = 25 and the mean and standard deviation are close to early stopping with the AIC score with N = 20. Hence, stopping based on a hold-out cannot compensate for the reduced training set size.

Large p In the example of ER graphs with additive SEs, we see a very high mean precision of 0.996. However, the mean recall is low at 0.261. We conclude that the hold-out set usually stops too early and leads to inferior results. Although both results could potentially be improved by tuning the stopping criterion or adapting the train-to-test-set ratio, we observe that the AIC score requires no such tuning, and provides stronger results.

6 Conclusion

In this work we investigated boosting for causal discovery and for the estimation of causal ordering. We proposed a generic score function on the orderings that depends on a regression estimator. We presented two sufficient conditions on the regression estimator, so that the score function can consistently distinguish between compatible and incompatible orderings:

1. The regression estimator must consistently find the true regression function in a correctly specified scenario with homoscedastic noise.

2. The mean squared prediction error on the samples must converge to the expected L2 prediction error for yet unseen observations even under model misspecification.

Kertel and Klein

Together, these conditions provide a theoretical safety net for the regression estimator, which is interesting on its own. On the one hand, in a misspecified setting, the fit of the regression function to the observed samples gives a good estimate for the expected squared prediction error for yet unobserved realizations. In a correctly specified setting on the other hand, the regression estimator still identifies the underlying functional relationship. We showed that boosting with appropriate early stopping. This provides new insights into the generalization ability of boosting methods, especially in real-world scenarios where model assumptions may not hold.

To use a score function on the orderings for the identification of the topological order, one needs to score every possible permutation. This is infeasible for large p and insufficient prior knowledge on the causal structure. Thus, we proposed a greedy boosting algorithm in the space of functions of Rp Rp which correspond to a DAG. The algorithm can be understood as a functional gradient descent in the space of additive SEMs, also known as component-wise boosting.

Our simulation study underlined that the score function on the permutations consistently prefers the correct causal order and the convergence manifests already for small N. These findings were even robust to non-additive SEs. For small p or in the case of extensive prior knowledge that drastically reduces the search space of permutations, the combination of the score and a feature selection procedure can be used to derive the causal graph. The second part of our simulation study showed that the gradient descent is highly competitive with state-of-the-art algorithms. In particular, for complex data-generating processes, the algorithm provides a noticeable benefit. Moreoever, the exact choice of the hyperparameters are efficiently and automatically balanced by the number of boosting iterations and the AIC score. Thus, the procedure is easy to tune and ready to be applied to a variety of data sets.

In the future, the generalization of Theorem 15 for high dimensions would be highly desirable, and the approach of Section 4 could be enhanced by a Preliminary Neighborhood Selection (PNS) step as suggested in B uhlmann et al. (2014). However, PNS would require an additional screening algorithm with additional hyperparameters. Without PNS we would need to show, among others, Condition 1 of Proposition 5, that is, the consistency of the nonparametric L2 boosting regression for high dimensions. To the best of our knowledge, such a result does not yet exist and would be of great interest alone. Recently, Stankewitz (2024) considered early stopping L2 boosting in high-dimensional linear models and it could be a promising starting point.

Finally, many aspects of our analysis are generic. The RKHS regression could easily be replaced by other regression estimators, such as spline regression or neural networks. Thus, one could further investigate which regression estimators lead to consistent estimators for the causal order, and their empirical performance and theoretical properties could be explored. A combination of continuous optimization methods as in Zheng et al. (2018) and component-wise boosting may offer further insights into scalable and robust causal discovery.

Boosting Causal Additive Models

Acknowledgments

This work has been supported by the German Federal Ministry of Education and Research within the project Causal Net , grant no. 01IS24082 and the BMW AG.

Kertel and Klein

Appendix A. Proof of Proposition 5

Proof Recall that ϖπ(k) := {j : π(j) < π(k)}.

k=1 log(bσ2 k, bfk,π)

min (f1,...,fp) ϑ(π)

k=1 log (PN P) Xk bfk,π(Xϖπ(k)) 2 + σ2 k,pθ0,fk,π

min (f1,...,fp) ϑ(π)

k=1 log(σ2 k,pθ0,fk,π)

0, (PN P) Xk bfk,π(Xϖπ(k)) 2

σ2 k,pθ0,fk,π + (PN P) Xk bfk,π(Xϖπ(k)) 2

| {z } =: N,k

= min (f1,...,fp) ϑ(π)

k=1 log(σ2 k,pθ0,fk,π) +

k=1 N,k + S(π) S(π0) | {z } ξπ,π0>0

k=1 log σ0 k 2 bσ2 k, bfk,π0 + bσ2 k, bfk,π0

+ N,k + ξπ,π0

k=1 log bσ2 k, bfk,π0

σ0 k 2 bσ2 k, bfk,π0 σ0 k 2

| {z } =:γN,k

+ N,k + ξπ,π0

= b S(π0) +

k=1 (γN,k + N,k) + ξπ,π0

In the first inequality, we used that for any k = 1, . . . , p

P (Xk bfk,π(Xϖπ(k))2 = Epθ0 h (Xk bfk,π(Xϖπ(k))2i

min (f1,...,fp) ϑ(π) Epθ0 (Xk fk(Xϖπ(k))2

= min (f1,...,fp) ϑ(π) σ2 k,pθ0,fk,π.

In the second and last inequality Lemma 23 below was used. Further, ξπ,π0 > 0, which does not depend on N, by the identifiability of the model. By the assumptions and the

Boosting Causal Additive Models

continuous mapping theorem it holds

P | N,k| ξπ,π0

0 and P |γN,k| ξπ,π0

0 for all k = 1, . . . , p and N ,

from which we derive that

k=1 N,k + γN,k|

k=1 | N,k| +

k=1 |γN,k| < ξπ,π0

with probability going to 1 for N .

Lemma 23 For x > 0 and x + δ > 0 it holds that

log(x + δ) log(x) + max 0, δ x + δ

Proof The statement is true for δ 0. For δ < 0 it holds that

log(x) = log(x |δ| + |δ|) log(x |δ|) + |δ| x |δ| = log(x + δ) δ x + δ

= log(x + δ) max 0, δ x + δ

from which the result follows.

Appendix B. Proof of Theorem 18

The proof of Theorem 18 decomposes (PN P) Y bf(mstop) 2 into (PN P) Y bf(mstop)

and (PN P) bf(mstop) 2 . We show both convergences using the theory of empirical processes. For this purpose, we show in Section B.1 that || bf(mstop)||H can be upper-bounded. Section B.2 then derives the convergence of (PN P) Y bf(mstop) using Covering Numbers.

The convergence of (PN P) bf(mstop) 2 is then shown using Rademacher Complexities.

B.1 Upper-Bound of the RKHS Norm of Regression Function Estimate

The main result of this section is Lemma 27, which upper-bounds the Hilbert space norm of the boosting estimate || bf(m)||H with a probability going to 1 for N for a suitably chosen number of boosting iterations m. The analysis is based on the fact that || bf(m)||2 H can be expressed as a quadratic form.

Kertel and Klein

Lemma 24 Let S be the kernel regression learner with penalty parameter λ. Assume that the Gram matrix G is invertible. Then it holds for the boosting estimate after m boosting steps that

|| bf(m)||2 H = 1

N (y N)T U(I (I D)m)2Λ 1UT y N,

where U, D, Λ RN N and D, Λ are diagonal matrices. Λ has the eigenvalues bµ1, . . . , bµN of G and D has bµ1 bµ1+λ, . . . , bµN bµN+λ on the diagonal. U contains the corresponding eigenvectors of G.

Proof It holds for the linear base learner S mapping y N to bf(x N), that

S = G(G + λI) 1.

The matrix S is symmetric and has the eigenvalues d1 = bµ1 bµ1+λ, . . . , d N = bµN bµN+λ. Thus, for the orthogonal matrix U containing the eigenvectors of S and the diagonal matrix D containing the eigenvalues d1, . . . , d N of S it holds that

By Equation (14), the estimate bf(m) = B(m)y N = (I (I S)m)y N can be expressed by

bf(m)(ex) = 1

k=1 bβk K(ex, exk)

for some bβ RN. Using the results of Section 3.2 we obtain

|| bf(m)||2 H = bβ Gbβ

= bf(m)(ex N) G 1

N bf(m)(ex N)

N (y N) B(m)G 1B(m)y N.

Note that G, S and B(m) have the same eigenvectors. Besides, Λ and I (I D)m are diagonal matrices, so they commute. Hence, B(m)G 1B(m) = U(I (I D)m)U UΛ 1U U(I (I D)m)U = U(I (I D)m)2Λ 1UT . Thus,

|| bf(m)||2 H = 1

N (y N) B(m)G 1B(m)y N = 1

N (y N) U(I (I D)m)2Λ 1U y N.

The following Hanson-Wright-inequality gives probabilistic upper bounds for quadratic forms as derived in Lemma 24.

Theorem 25 (Hanson-Wright-Inequality) (Rudelson and Vershynin, 2013, Theorem 1.1) Consider a random vector Z = (Z1, . . . , ZN) RN with independent components, for which E(Zℓ) = 0, ℓ= 1, . . . , N and for which the Orlicz norm of Z1, . . . , ZN is uniformly bounded by smax. For any M RN N it holds for every t > 0

P ||MZ||2 E ||MZ||2 > t 2 exp c min t2

s4max||M2||2 F , t s2max||M2||

Boosting Causal Additive Models

Lemma 24 shows that the RKHS norm can be expressed as a quadratic form. In the next steps, we upper bound the quadratic form, that is, the RKHS norm of bf(m) using Theorem 25. We see, that we can choose M in Theorem 25 as the matrix square root of U(I (I D)m)2Λ 1UT so that ||MY ||2 = || bf(m)||2 H. Thus, these bounds depend on the number of boosting steps m and D and Λ (which are functions of G), where the latter are probabilistically depending on e XN. Controlling D, this allows us to vary m with N such that the growth of the upper bound for || bf(m)||2 H can be controlled with a high probability. Observe that Theorem 25 requires centered random variables Z1, . . . , ZN.

Lemma 26 Decompose Y = µ( e X) + ε( e X), where µ( e X) = E h Y | e X i and ε( e X) = Y

E h Y | e X i . Assume that ||µ|| < µmax and the Orlicz norm and variance of the conditional

distribution of ε( e X) given some realization ex of e X is uniformly bounded by smax and σ2 max, respectively. Let bf(m) be the boosting estimate after m boosting steps and Λ, D, U RN N

as in Lemma 24. We can upper bound

P || bf(m)||2 H > 1 + 2N(µ2 max + σ2 max)||M1/2 m (ex N)||2 F | e XN = ex N

2 exp C min 1 4s4max||Mm||2 F , 1 2s2max||Mm||

Mm(ex N) := Mm := 1

N U(I (I D)m)2Λ 1UT .

Proof By Lemma 24 it holds

|| bf(m)||2 H = 1

N (y N)T U(I (I D)m)2Λ 1UT y N = y N Mm(ex N)y N.

We emphasize that G and thus D, U and S are functions of e XN. We calculate for µN = (µ(ex1), . . . , µ(ex N)) RN and εN = (ε1(ex1), . . . , εN(ex N)) RN:

P || bf(m)||2 H 2N(µ2 max + σ2 max)||M1/2 m ( e XN)||2 F > 1| e XN = ex N

= P ||M1/2 m ( e XN)Y N||2 2N(µ2 max + σ2 max)||M1/2 m ( e XN)||2 F > 1| e XN = ex N

P 2||M1/2 m ( e XN)µN||2 + 2||M1/2 m ( e XN)εN||2

2N(µ2 max + σ2 max)||M1/2 m ( e XN)||2 F > 1| e XN = ex N

In the third line we used the decomposition y N = µN + εN and the inequality ||a + b||2 2||a||2 + 2||b||2. In the following we upper-bound the terms ||M1/2 m ( e XN)µN|| and

Kertel and Klein

N||M1/2 m ( e XN)||2 F σ2 max. For the first term observe that

||M1/2 m ( e XN)µN||2 = tr ( µN Mm( e XN)µN

= tr Mm( e XN)µN µN

tr Mm( e XN) tr µN µN

N||M1/2 m e XN ||2 F µ2 max.

For the latter term using that E h ε2 k| e XNi σ2 max, k = 1, . . . , N it holds analogously

E h ||M1/2 m ( e XN)εN||2| e XNi = E h tr εN Mm( e XN)εN | e XNi

E h tr Mm( e XN) tr εN εN | e XNi

tr Mm( e XN) E h εN εN| e XNi

N||M1/2 m ( e XN)||2 F σ2 max

and hence N||M1/2 m ( e XN)||2 F σ2 max E h ||M1/2 m ( e XN)εN||2| e XNi . Using these results and

plugging in the condition e XN = ex N we can upper-bound

P 2||M1/2 m µN||2 + 2||M1/2 m εN||2 2N(µ2 max + σ2 max)||M1/2 m ( e XN)||2 F > 1| e XN = ex N

P ||M1/2 m εN||2 E ||M1/2 m ( e XN)εN||2| e XN > 1

2| e XN = ex N

= P ||M1/2 m εN||2 E ||M1/2 m εN||2 > 1

2| e XN = ex N

2 exp C min 1 4s4max||Mm||2 F , 1 2s2max||Mm||

by Theorem 25 (Hanson-Wright-inequality) and the fact that εN| e XN = ex N is a centered, sub-Gaussian random vector with independent components with Orlicz norm bounded by smax.

We now show that if we choose m = m(N) = N

1 4 Cu+Cd+1/2

Cd+1 then the growth of || bf(m)||H with N is of lower order as N1/4 with a probability going to 1 if ex N BN.

Lemma 27 (Upper bound || bf(mstop)||H) Under Assumption 13 and for

mstop = m(N) = N

1 4 Cu+Cd+1/2

Cd+1 there exists a δ > 0 and a h(N) o(N1/4 δ) so that

P bf(mstop) / h(N)FN { e XN BN} 0 (23)

Boosting Causal Additive Models

Proof As outlined in relation (14), by the representation theorem it holds that bf(m) h(N)FN for some h(N) R. Thus we need to show that || bf(mstop)||H h(N). We prove the statement by showing the convergence

P || bf(mstop)||2 H > h(N)2| e XN = ex N 0

uniformly for any ex N BN and some h(N) o(N1/4 δ). The statement then follows by integrating out with respect to ex N. Applying Lemma 26, we need to prove the following two statements.

1. N||M1/2 m ||2 F o N1/2 2δ . This implies, ||Mm||2 F ||M1/2 m ||4 F o(1).

2. ||Mm|| o(1).

It is emphasized that Mm is a function of the random sample ex N. The statement is proven in Lemma 28.

Lemma 28 Under Assumption 13 it holds for

Mm(ex N) := Mm := 1

N U(I (I D)m)2Λ 1UT

that there exists a δ > 0 so that the following statements hold for m = m(N) = N

1 4 Cu+Cd+1/2

uniformly in ex N BN, where BN is defined in Assumption 13 .

1. N1/2+2δ||M1/2 m ||2 F = 1 N1/2 2δ tr (I (I D)m)2Λ 1 0,

2. ||Mm||2 F 0, and

3. ||Mm|| 0.

Kertel and Klein

Proof 1.: Uniformly in ex N BN it holds that:

N1/2+2δ||M1/2 m ||2 F = N1/2+2δtr(Mm)

= 1 N1/2 2δ

k=1 (1 (1 dk)m)2c µk 1

= 1 N1/2 2δ

1 1 c µk c µk + λ

(i) 1 N1/2 2δ

1, m2 c µk c µk + λ

1 λ2N1/2 2δ

k=1 min n λ2c µk 1, m2c µk o

1 λ2N1/2 2δ

k=1 min n λ2c µk 1, m2c µk o +

k= K0 min n λ2c µk 1, m2c µk o

1 λ2N1/2 2δ

K0 d µK0 1λ2 +

k= K0 m2c µk

which holds for any K0 N. The inequality (i) is shown in Lemma 30. The last inequality is due to the fact that bµ 1 k is monotonically increasing. We choose K0 = K0(N) = 1 2Cd+1 ln(N) . Uniformly in ex N BN it holds by Assumption 13 for a small δ > 0

N1/2 2δ exp(Cd K0)

For the latter part observe that it holds for any ex N BN and by Assumption 13

k= K0 m2c µk m2

k= K0 exp( Cuk)

K0 exp( Cu(z 1))dz

= m2 exp(Cu)

N1/2 2δCu exp( Cu K0)

= m2 exp(Cu)

N1/2 2δCu exp( Cu( K0 + 1 | {z } 1 2Cd+1 ln(N)

m2 exp(2Cu)

N1/2 2δCu N Cu 2Cd+1

= m2 exp(2Cu)

Cu N Cu+Cd+1/2

Boosting Causal Additive Models

Observe that for m = m(N) = N

1 4 Cu+Cd+1/2

Cd+1 where the constants are chosen independently of ex N, it holds

m2N Cu+Cd+1/2

2Cd+2 Cu+Cd+1/2

2Cd+1 = N ξ,

where ξ := Cu+Cd+1/2

2Cd+1 Cu+Cd+1/2

2Cd+2 > 0. For δ < ξ

2 it holds that 1 N1/2 2δ PN k= K0 m2c µk 0.

2. follows by ||Mm||2 F ||M1/2 m ||4 F 0. 3. follows by

N max k=1,...,N

1 1 bµk bµk + λ

N max k=1,...,N min

1, m2 c µk c µk + λ

1 λ2N max k=1,...,N min n c µk 1, m2c µk o m2bµ1

where inequality (i) follows again from Lemma 30 as bµ1 1.

The following Lemma immediately follows from the proof of Lemma 28.

Lemma 29 Under Assumption 13 it holds for m = m(N) = N

1 4 Cu+Cd+1/2

Cd+1 on BN, that

1 1 bµℓ bµℓ+ λ

Lemma 30 It holds for 0 c µk 1, that

1 (1 c µk)m 1 max {0, 1 mc µk} = min {1, mc µk}

Proof It is equivalent to show that

(1 c µk)m max{0, 1 mc µk}.

The l.h.s. and the r.h.s. are equal for c µk = 0. On the interval [0, 1

m) it holds that

c µk = m(1 c µk)m 1 m = (max{0, 1 mc µk} c µk .

(1 c µk)m max{0, 1 mc µk} for c µk [0, 1

Clearly, for µk [ 1

m, 1] (1 c µk)m max{0, 1 mc µk}.

Kertel and Klein

B.2 Results on Covering Numbers

Lemma 27 has shown that for mstop = N

1 4 Cu+Cd+1/2

Cd+1 , it holds bf(mstop) h(N)FN for some h(N) o(N1/4) with probability going to 1 for N . In this section we use the covering numbers from empirical process theory to show the convergence of the inner product

|(P PN)Y bf(mstop)|. (24)

The covering numbers measure the complexity of a function class.

Definition 31 For a function class F and a semi-metric d on F the covering number N(ε, F, d) is the minimal size of a subset S F, such that for every f F there is an s S so that d(f, s) < ε. More precisely,

N(ε, F, d) = min {S F| f F s S:d(f,s)<ε} |S|.

In this work, we choose d(f, g) = ||f g|| and thus write N (u, F, || || ) = N (u, F). For a suitable C0 (chosen as in Dudley s Theorem, see Theorem 8.4 of van de Geer, 2014) not depending on F we define the covering number entropy integral by

J (z, F) := C0z Z 1

For a constant C > 0, let H = {Ch|h G} be a scaled version of some function class G. The following remark shows that the entropy integral J (z, H) of H can be upper bounded by an expression depending on C and the entropy integral J (z, G) of G.

Remark 32 For C > 0 the identity

N(Cz, CG, || ||) N(z, G, || ||)

holds and thus J (Cz, CG) CJ (z, G), where CG = {Cg|g G}. This upper bound is not optimal (see Cucker and Smale, 2002) but sufficient for our purposes.

Eventually, we will show in Corollary 34 that if bf(mstop) is in the ball of radius h(N) o(N1/4), then this growth rate h(N) is slow enough to ensure the convergence of (24). We rely on the following theorem.

Theorem 33 (van de Geer (2014, Theorem 3.2)) Let K = supf F ||f|| and assume that Y is sub-Gaussian with Orlicz norm smaller than s. Then for t, N such that 1 q

N 1 it holds with probability 1 8 exp( t)

sup f F |(PN P)Y f|/C 2J (Ks, F) + Ks

1. There is a typo in van de Geer (2014), where it says J0(Kσ, F) instead of J (Kσ, F).

Boosting Causal Additive Models

Corollary 34 Let f1, f2, . . . be a sequence in H such that

f N h(N)FN,

where h(N) o(N1/4). If J (z, FN) J (z, B1) = C0z R 1 0

2 , B1) < for all z > 0 as in Assumption 11 , then

|(PN P)Y f N| ξ.

converges to 0 in probability.

Proof For h(N)FN it holds that

K := sup f h(N)FN ||f|| sup f h(N)B1 ||f|| Bh(N).

Recalling the definition of J (u, F) and applying Remark 32, we obtain

J (Ku, h(N)FN) J (Bh(N)u, h(N)B1) h(N)J (Bu, B1) u R+.

As the Orlicz norm fulfills the triangle inequality and as the Orlicz norm of the bounded random variable µ( e X) is finite, the Orlicz norm of Y = µ( e X) + ε, denoted by s, is bounded and Y is thus sub-Gaussian. We now apply Theorem 33 and set t = N1/2 (the condition q

N 1 is then fulfilled for N > 1

It holds with probability 1 8 exp( N1/2)

|(PN P)Y f N| sup f h(N)FN |(PN P)Y f|

J (Bh(N)s, h(N)FN) + Bh(N)smax N1/4

h(N)J (Bs, FN) + Bh(N)smax N1/4

h(N)J (Bs, B1) + Bh(N)smax N1/4

J (Bs, B1) is finite and constant in N by Assumption 11 . Let ξ > 0 be arbitrary. As h(N) o(N1/4), there exists N0 so that

h(N)J (Bs, B1)

For the second term observe that as h(N) o(N1/4)

Bh(N)s N1/4

for all N > N1 for some N1 N. Thus for N > max{N0, N1}

|(PN P)Y f N| ξ

with probability 1 8 exp( 8N1/2) 1. This proves the convergence in probability.

Kertel and Klein

B.3 Results on Rademacher Complexities

In this section we use concept of the Rademacher complexity to show the convergence of (P PN) bf(mstop) 2 .

It is again based on Lemma 27, which ensures that for mstop = N

1 4 Cu+Cd+1/2

Cd+1 , it holds bf(mstop) h(N)FN for some h(N) o(N1/4 δ) for some δ > 0 with probability going to 1 for N .

Definition 35 (Rademacher complexity) Let F be a function class on X. Let σ1, . . . , σN be i.i.d. realizations of Rademacher random variables, which are independent of X. Further let x1, . . . , x N be i.i.d realizations of X. We define the Rademacher complexity RN(F) by

ℓ=1 σℓf(xℓ)

Note that RN(F) is deterministic. Given fixed observations x1, . . . , x N, the empirical Rademacher complexity is given by

b RN(F|x1, . . . , x N) = b RN(F) = 1

ℓ=1 σℓf(xℓ)

The Rademacher complexity is a tool to upper bound supf F |(PN P)f|.

Theorem 36 (Wainwright, 2019, Theorem 4.10) Let F be uniformly bounded with constant B. Then it holds for any N N and with probability 1 exp( ε2N

2B2 ), that

sup f F |(PN P)f| 2RN(F) + ε.

The Rademacher complexity is linked to its empirical counterpart with high probability.

Theorem 37 (Bartlett and Mendelson (2002, Theorem 11)) Let F be uniformly bounded by B. Then it holds with probability 1 2 exp ε2N

b RN(F) RN(F) < ε.

We collect some helpful relationships for the Rademacher complexity.

Theorem 38 (Bartlett and Mendelson (2002, Theorem 12)) Let F, F1, F2, . . . , Fp be function classes and let F be uniformly bounded by B. Then

1. for c R: RN(c F) = |c|RN(F),

2. RN Pp j=1 Fj Pp j=1 RN (Fj),

3. for F2 := {|f|2 : f F}, it follows RN(F2) 4BRN(F).

Boosting Causal Additive Models

The Rademacher complexity can be upper bounded for kernel functions.

Theorem 39 (Bartlett and Mendelson (2002, Lemma 22)) The empirical Rademacher complexity of FN is upper bounded by

while the population Rademacher complexity can be upper bounded by

Note that if the sequence µ1, µ2, . . . is summable, then RN(FN) O(N 1/2). For this case we connect the results above in a corollary.

Corollary 40 For any sequence f1, f2, . . . in H for which

f N h(N)FN,

where h(N) o(N1/4 δ) for some δ > 0 and b RN(FN) O(N 1/2), it holds

|(PN P)f2 N| P 0 for N .

Proof Let ξ > 0 be arbitrary and fixed. By Theorem 38 it holds that

RN(h(N)FN) = h(N)RN(FN),

and as FN is uniformly bounded by Bh(N) it holds by Theorem 38

RN (h(N)FN)2 = 4Bh(N)RN(FN).

From Theorem 36 we obtain that for any ξ > 0

|(PN P)f2 N| sup f h(N)FN

(PN P)f2 2RN (h(N)FN)2 + ξ

2 = 8Bh(N)2RN(FN) + ξ

with probability 1 2 exp ξ2N 4(Bh(N))2 , which converges to 1 as h(N) o(N1/4) for N .

Similarly, as FN is uniformly bounded by B and by setting ε = N 1/2+2δ in Theorem 37, we observe that | b RN(FN) RN(FN)| N 1/2+2δ

with probability going to 1 for any δ > 0. We conclude that

|(PN P)f2 N| sup f h(N)FN

8Bh(N)2RN(FN) + ξ

8Bh(N)2|RN(FN) b RN(FN)| + 8Bh(N)2 b RN(FN) + ξ

8Bh(N)2N 1/2+2δ + 8Bh(N)2 b RN(FN) + ξ

Kertel and Klein

with probability going to 1 for N . As h(N) o(N1/4), it holds h(N)2 b RN(FN) < ξ

3 by Assumption 13 and Theorem 39 for N chosen large enough. Similarly, h(N)2N 1/2+2δ < ξ

3 for N chosen large enough, as h(N) o N1/4 δ . This proves the statement.

Appendix C. Proof of Theorem 19

Proof [Proof of Lemma 20] Using

||f0 bf(m)||2 2,N = 1

N ||f0(ex N) B(m)y N||2 2

N ||f0(ex N) U (I (I D)m) U y N||2 2

N ||f0(ex N) U (I (I D)m) U f0(ex N) + εN ||2 2

N ||f0(ex N) U (I (I D)m) U f0(ex N)||2 2

N ||U (I (I D)m) U εN)||2 2

N ||U (I D)m U f0(ex N)||2 2 | {z } I

N ||U (I (I D)m) U εN||2 2 | {z } II

where the first inequality holds due to ||a + b||2 2||a||2 + ||b||2, we show the convergence of I and II to 0 in probability for N . Recall that S = UDUT with U containing the orthonormal eigenvalues of S and D being a diagonal matrix with diagonal entries Dℓℓ= dℓ= bµℓ bµℓ+λ, where bµℓ, ℓ= 1, . . . , N are the eigenvalues of G. Convergence of I:

2 N ||U (I D)m U f0(ex N)||2 2 = 2

N f0(ex N) U (I D)m U U (I D)m U f0(ex N)

ℓ=1 (1 dℓ)2m U f0(ex N) 2

As G has full rank, there exists a β RN such that f0(ex N) =

NGβ. Define ef := 1

N PN ℓ=1 βℓK( , exℓ) for which holds ef(ex N) = f0(ex N). By the representation theorem it

further holds β Gβ = || ef||2 H ||f0||2 H = R2. Let Dbµ RN N be the diagonal matrix with

Boosting Causal Additive Models

diagonal entries bµ1, . . . , bµN. Then,

ℓ=1 (1 dℓ)2m U f0(ex N) 2

U f0(ex N) 2 ℓ 2e m dℓ

U Gβ 2 ℓ e m dℓ

U Gβ 2 ℓ e m bµℓ

e m tr D 1 bµ U Gββ GU

UD 1 bµ U | {z } G 1

e m β Gβ = 1 + λ

e m || ef||2 H 1 + λ

which goes to 0 as m(N) for N . In the first inequality we have used the fact that (1 x)2m exp( x)2m = exp( 2mx) 1 2e m x for all x R, where e is Euler s number. In the second inequality we have used 1 dℓ= bµℓ+λ

bµℓ, as 0 < bµℓ 1 for all ℓ= 1, 2, . . . , N. Convergence of II:

N ||U (I (I D)m) U εN||2 2 > ξ

N ||U (I (I D)m) U εN||2 2 > ξ n e XN BN o + P n e XN / BN o

The latter term goes again to 0 by Assumption 13 . For any ex N BN we show that

N ||U (I (I D)m) U εN||2 2 > ξ | e XN = ex N 0

for any ξ > 0 and uniformly in ex N. Recall that εN | e XN = ex N is a sub-Gaussian vector with mean 0 and independent components. Thus, we can apply the Hanson-Wright inequality of Theorem 25. Remember that D is a function of ex N. We need to show for any ex N BN, Am = 1

N U (I (I D)m) U that E h AmεN 2 2 | e XN = ex Ni 0, ||A2 m||2 F 0 and

Kertel and Klein

||A2 m|| 0. It holds by Lemma 30 uniformly for ex N BN

E h AmεN 2 2 | e X = ex Ni = EY N| e XN=ex N

N U (I (I D)m) U εN

= EY N| e XN=ex N

N tr εN U (I (I D)m)2 U εN

EY N| e XN=ex N

N tr εN εN tr U (I (I D)m)2 U

N tr (I (I D)m)2

ℓ=1 (1 (1 dℓ)m)2

1 1 bµℓ λ + bµℓ

for N . Further,

||A2 m||2 F = 1 N2 tr (I (I D)m)4 1

N tr (I (I D)m)2 2 ,

which goes to 0 with the same calculation as above. Finally,

The statement follows by integrating out BN with respect to ex N.

Boosting Causal Additive Models

Appendix D. Algorithm

Data: x N = (x1, . . . , x N) RN p

Input: Kernels K1, . . . , Kp implying RKHSs H1, . . . , Hp, penalty λ, step size ν Output: b G, b F F (1) 0 b G N bfkj = arg mingkj Hj PN ℓ=1 (gkj(xℓj) xℓk)2 + λ||gkj||2 Hj, j, k = 1, . . . , p, j = k

S(j, k) log PN ℓ=1 bfkj(xℓj) xℓk 2

for m 2 to mstop do

// Find the next edge and update graph (j0, k0) arg min(j,k)/ N S(j, k, F (m 1)); b G b G + (j0, k0)

f(m) k0 f(m 1) k0 + ν bfk0j0

f(m) k f(m 1) k , k = 1, . . . , k0 1, k0 + 1, . . . , p

F (m) f(m) 1 , . . . , f(m) p

// Check if AIC score has increased if AIC(F (m), x N) > AIC(F (m 1), x N) then break // Update edges that cause cycle and ensure they are not chosen anymore N get Forbidden Edges( b G) // Update S bfk0j Equation (21) , j = 1, . . . , p S(j, k0, F (m)) Equation 20, j = 1, . . . , p end return b G, F (m)

Raj Agrawal, Chandler Squires, Neha Prasad, and Caroline Uhler. The decamfounder: nonlinear causal discovery in the presence of hidden variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(5):1639 1658, 07 2023. ISSN 13697412. doi: 10.1093/jrsssb/qkad071. URL https://doi.org/10.1093/jrsssb/qkad071.

Sara Aibar, Carmen Bravo Gonz alez-Blas, Thomas Moerman, Vˆan Anh Huynh-Thu, Hana Imrichova, Gert Hulselmans, Florian Rambow, Jean-Christophe Marine, Pierre Geurts, Jan Aerts, et al. Scenic: single-cell regulatory network inference and clustering. Nature methods, 14(11):1083 1086, 2017.

Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. The Journal of Machine Learning Research, 3(Jul):1 48, 2002.

Albert-L aszl o Barab asi and R eka Albert. Emergence of scaling in random networks. Science, 286(5439):509 512, 1999. doi: 10.1126/science.286.5439.509.

Kertel and Klein

Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Rohit Bhattacharya, Tushar Nagarajan, Daniel Malinsky, and Ilya Shpitser. Differentiable causal discovery under unmeasured confounding. In Arindam Banerjee and Kenji Fukumizu, editors, International Conference on Artificial Intelligence and Statistics, volume 130, pages 2314 2322. The Proceedings of Machine Learning Research, 13 15 Apr 2021.

Peter B uhlmann and Torsten Hothorn. Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4):477 505, 2007. doi: 10.1214/07-STS242.

Peter B uhlmann and Bin Yu. Boosting with the l2 loss. Journal of the American Statistical Association, 98(462):324 339, 2003. doi: 10.1198/016214503000125. URL https://doi. org/10.1198/016214503000125.

Peter B uhlmann, Jonas Peters, and Jan Ernest. CAM: Causal additive models, highdimensional order search and penalized regression. The Annals of Statistics, 42(6):2526 2556, 2014. doi: 10.1214/14-AOS1260.

Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1 49, 2002.

Paul Erd os and Alfr ed R enyi. On random graphs i. Publicationes Mathematicae Debrecen, 6:290, 1959.

Hawoong Jeong, B alint Tombor, R eka Albert, Zoltan N Oltvai, and A-L Barab asi. The large-scale organization of metabolic networks. Nature, 407(6804):651 654, 2000.

Marcus Kaiser and Maksim Sipos. Unsuitability of NOTEARS for causal graph discovery when dealing with dimensional quantities. Neural Processing Letters, 54(3):1587 1595, 2022.

Diviyan Kalainathan, Olivier Goudet, Isabelle Guyon, David Lopez-Paz, and Michale Sebag. Structural agnostic modeling: Adversarial learning of causal graphs. The Journal of Machine Learning Research, 23(219):1 62, 2022.

Markus Kalisch and Peter B uhlman. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. The Journal of Machine Learning Research, 8(3):613 636, 2007.

Kirthevasan Kandasamy and Yaoliang Yu. Additive approximations in high dimensional nonparametric regression via the salsa. In International Conference on Machine Learning, pages 69 78. The Proceedings of Machine Learning Research, 2016.

Maximilian Kertel, Stefan Harmeling, Markus Pauly, and Nadja Klein. Learning causal graphs in manufacturing domains using structural equation models. International Journal of Semantic Computing, 17(04):511 528, 2023. doi: 10.1142/S1793351X23630023.

Boosting Causal Additive Models

Trent Kyono, Yao Zhang, and Mihaela van der Schaar. CASTLE: Regularization via auxiliary causal graph discovery. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1501 1512. Curran Associates, Inc., 2020.

S ebastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon Lacoste-Julien. Gradient-based neural DAG learning. In International Conference on Learning Representations, 2019.

Ha Quang Minh. Some properties of Gaussian reproducing kernel Hilbert spaces and their implications for function approximation and learning theory. Constructive Approximation, 32:307 338, 2010.

Ignavier Ng, S ebastien Lachapelle, Nan Rosemary Ke, Simon Lacoste-Julien, and Kun Zhang. On the convergence of continuous constrained optimization for structure learning. In International Conference on Artificial Intelligence and Statistics, pages 8176 8198. The Proceedings of Machine Learning Research, 2022a.

Ignavier Ng, Shengyu Zhu, Zhuangyan Fang, Haoyang Li, Zhitang Chen, and Jun Wang. Masked gradient-based causal structure learning. In International Conference on Data Mining (SDM), pages 424 432. SIAM, 2022b.

Christopher Nowzohour and Peter B uhlmann. Score-based causal learning in additive noise models. Statistics, 50(3):471 485, 2016. doi: 10.1080/02331888.2015.1060237. URL https://doi.org/10.1080/02331888.2015.1060237.

Jonas Peters and Peter B uhlmann. Structural intervention distance for evaluating causal graphs. Neural Comput., 27(3):771 799, March 2015. ISSN 0899-7667. doi: 10.1162/ NECO a 00708. URL https://doi.org/10.1162/NECO_a_00708.

Jonas Peters, Joris M. Mooij, Dominik Janzing, and Bernhard Sch olkopf. Causal discovery with continuous additive noise models. The Journal of Machine Learning Research, 15 (1):2009 2053, 01 2014. ISSN 1532 4435.

Jonas Peters, Dominik Janzing, and Bernhard Sch olkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, Cambridge, MA, USA, 2017.

Garvesh Raskutti and Caroline Uhler. Learning directed acyclic graph models based on sparsest permutations. Stat, 7(1):e183, 2018. doi: https://doi.org/10.1002/sta4.183. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.183. e183 sta4.183.

Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research, 15(1):335 366, 2014.

Alexander G. Reisach, Christof Seiler, and Sebastian Weichwald. Beware of the simulated dag! causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34, 2021.

Kertel and Klein

Mark Rudelson and Roman Vershynin. Hanson-Wright inequality and sub-Gaussian concentration. Electronic Communications in Probability, 18:1 9, 2013. doi: 10.1214/ECP. v18-2865.

Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012. ISBN 0262017180.

Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. ISBN 0262194759.

Rajen Shah and Jonas Peters. The hardness of conditional independence testing and the generalised covariance measure. Annals of Statistics, 48(3):1514 1538, 2018.

Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyv arinen, Yoshinobu Kawahara, Takashi Washio, Patrik O Hoyer, and Kenneth Bollen. Direct Li NGAM: A direct method for learning a linear non-Gaussian structural equation model. The Journal of Machine Learning Research, 12:1225 1248, 2011.

Ali Shojaie and George Michailidis. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika, 97(3):519 538, 07 2010.

Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search, volume 81. MIT Press, 01 1993. ISBN 978-1-4612-7650-0. doi: https://doi.org/10.7551/ mitpress/1754.001.0001.

Bernhard Stankewitz. Early stopping for L2-boosting in high-dimensional linear models. Annals of Statistics, 52(2):491 518, 2024. doi: 10.1214/24-AOS2356.

Hongwei Sun. Mercer theorem for RKHS on noncompact sets. Journal of Complexity, 21 (3):337 349, 2005. ISSN 0885-064X.

Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm for learning bayesian networks. In Conference on Uncertainty in Artificial Intelligence, pages 584 590, Arlington, Virginia, USA, 2005. AUAI Press. ISBN 0974903914.

Gerhard Tutz and Harald Binder. Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics, 62(4):961 971, 2006.

Sara van de Geer. On the uniform convergence of empirical norms and inner products, with application to causal inference. Electronic Journal of Statistics, 8(1):543 574, 2014.

Matthew J. Vowels, Necati Cihan Camgoz, and Richard Bowden. D ya like DAGs? a survey on structure learning and causal discovery. ACM Computing Surveys, 55(4), 11 2022.

Grace Wahba. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, 1990. doi: 10.1137/1.9781611970128.

Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge, 2019.

Boosting Causal Additive Models

Anja Wille, Philip Zimmermann, Eva Vranov a, Andreas F urholz, Oliver Laule, Stefan Bleuler, Lars Hennig, Amela Preli c, Peter von Rohr, Lothar Thiele, et al. Sparse graphical gaussian modeling of the isoprenoid gene network in arabidopsis thaliana. Genome biology, 5(11):1 13, 2004.

Yue Yu, Jie Chen, Tian Gao, and Mo Yu. DAG-GNN: DAG structure learning with graph neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, International Conference on Machine Learning, volume 97 of The Proceedings of Machine Learning Research, pages 7154 7163. PMLR, 09 15 Jun 2019.

Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Sch olkopf. Kernel-based conditional independence test and application in causal discovery. In Conference on Uncertainty in Artificial Intelligence, pages 804 813, Arlington, Virginia, USA, 2011. AUAI Press. ISBN 9780974903972.

Xun Zheng, Bryon Aragam, Pradeep Ravikumar, and Eric P Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, volume 32, 2018.

Xun Zheng, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric Xing. Learning sparse nonparametric dags. In International Conference on Artificial Intelligence and Statistics, pages 3414 3425. The Proceedings of Machine Learning Research, 2020.

Bruno Peter Zwahlen. Uber die Eigenwerte der Summe zweier selbstadjungierter Operatoren. Commentarii Mathematici Helvetici, 40:81 116, 1965/66.