# sample_complexity_of_interventional_causal_representation_learning__998aa162.pdf

Sample Complexity of Interventional Causal Representation Learning

Emre Acartürk Rensselaer Polytechnic Institute Burak Varıcı Carnegie Mellon University

Karthikeyan Shanmugam Google Deep Mind Ali Tajer Rensselaer Polytechnic Institute

Consider a data-generation process that transforms low-dimensional latent causallyrelated variables to high-dimensional observed variables. Causal representation learning (CRL) is the process of using the observed data to recover the latent causal variables and the causal structure among them. Despite the multitude of identifiability results under various interventional CRL settings, the existing guarantees apply exclusively to the infinite-sample regime (i.e., infinite observed samples). This paper establishes the first sample-complexity analysis for the finite-sample regime, in which the interactions between the number of observed samples and probabilistic guarantees on recovering the latent variables and structure are established. This paper focuses on general latent causal models, stochastic soft interventions, and a linear transformation from the latent to the observation space. The identifiability results ensure graph recovery up to ancestors and latent variables recovery up to mixing with parent variables. Specifically, O((log 1

δ )4) samples suffice for latent graph recovery up to ancestors with probability 1 δ, and O(( 1

δ )4) samples suffice for latent causal variables recovery that is ϵ close to the identifiability class with probability 1 δ.

1 Introduction

The observed data generated in a wide range of technological, social, and biological domains has high-dimensional, complex, and often unexplainable structures. Nevertheless, sometimes, such complex structures can be explained by latent generating factors that often form lower-dimensional structures. The field of causal representation learning (CRL) is motivated by this premise and focuses on disentangling the causally-related latent generating factors that underlie a given high-dimensional observable dataset. In particular, given a high-dimensional dataset, the objective of CRL is to learn (i) the underlying latent causal variables and (ii) the causal relationships among the latent variables. The learned representations then provide an explainable structure for the observed data and facilitate informed reasoning for the downstream tasks [1].

Formally, CRL consists of a data generation and data transformation pipeline as follows. There exists a set of high-level latent variables Z Rn that are causally related. The causal interactions are captured by a directed acyclic graph (DAG) G. The latent variables Z go through an unknown transformation g and generate the observed variables X Rd, i.e., X = g(Z). The objective of CRL is to use the observed variables X and learn the latent graph G and latent variables Z. The primary question facing CRL is identifiability, which refers to establishing the conditions under which recovering Z and G are possible.

Work was done while BV was a Ph.D. student at Rensselaer Polytechnic Institute.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Identifiability is known to be impossible without proper inductive biases or additional supervision [2, 3]. One approach to address this issue is performing interventions leading to interventional CRL which has gained significant recent attention [4, 5, 6, 7, 8, 9, 10]. Specifically, interventions on latent causal variables create additional statistical diversity in the observed data. Accessing to enough interventional environments enables identifiability [1]. Despite the recent advances in establishing identifiability guarantees for interventional CRL, all the existing guarantees focus on the asymptotic infinite-sample regime, where one assumes access to an infinite number of observable samples X.

In this paper, we establish conditions for finite-sample non-asymptotic identifiability and performance guarantees for interventional CRL. We focus on CRL under the general (non-parametric) latent causal models, soft interventions, and linear transformations from the latent to the observation space. This setting is well-studied in the infinite-sample regime [4, 5, 7, 11], where the existing identifiability results show that the causal graph can be recovered up to ancestral nodes, and latent variables can be learned up to mixing with their ancestors. In this paper, we provide the probabilistic finite-sample counterparts of these identifiability guarantees and establish the first sample complexity analysis for interventional CRL in the finite-sample regime.

Our CRL approach falls in the category of score-based CRL [5], based on which our sample complexity analysis consists of two steps. In the first step, we delineate a general sample complexity for any desired consistent score estimator. Subsequently, we specialize these general results by adopting the reproducing kernel Hilbert space (RKHS)-based score estimator [12]. We establish the following identifiability guarantees for any desired pair of constants ϵ, δ R+.

Latent graph recovery: Using RHKS-based score estimator, O((log 1

δ )4) samples suffice to recover the transitive closure of the latent graph with probability at least 1 δ.

Latent variables recovery: Using the same score estimator, O(( 1

δ )4) samples suffice to ensure that the mean squared error in the estimated latent causal variables is at most ϵ2 with probability at least 1 δ.

Dependence on model dimension: We further establish the explicit and implicit dependence of sample complexity on the dimensions of the latent and observable spaces, n and d, respectively. The precise characterizations of these expressions involve additional model-dependent constants specified in Section 5.

Improved guarantees: Finally, we note that our latent variables recovery result, where we show recovery up to mixing with parents, which improves upon the results in existing literature that guarantee recovering up to mixing with ancestors.

Methodology. We offer novel finite-sample CRL algorithms and characterize the sample complexity guarantees achievable by these algorithms. We design our algorithms using the properties of score functions (i.e., the gradient of the log density). This approach is inspired by the score-based CRL framework [5, 8]. However, the algorithms are significantly different, led by the need to integrate finite-sample score estimation routines and their imperfections. In our analysis, first, we provide sample complexity upper bounds for a score-based framework that uses any desired consistent score difference estimator. Then, we adopt a specific score estimator [12] and provide explicit sample complexity guarantees.

Related work. All the existing studies on interventional CRL focus on the infinite-sample regime. We briefly review the studies that adopt a similar CRL model, i.e., linear transformations and singlenode soft interventions. In this setting, complete identifiability results for linear latent models are established in [7]. Similar results are shown for nonlinear latent causal models in [4, 11] and for linear non-Gaussian models in [13]. In the most closely related setting to ours, identifiability up to ancestors is shown to be possible using soft interventions on general latent causal models [8]. Other studies on interventional CRL include using do interventions for polynomial transformations [6, 14] and hard interventions for general transformations [10, 8]. On a partially related problem, error rates for using score functions on observed causal discovery are provided in [15].

Notations. For n N, we define [n] {1, . . . , n}. Vectors are represented by lower case bold letters, and element i of vector v is denoted by vi. Matrices are represented by upper case bold letters, and we denote row i and column j of matrix A by Ai,: and A:,j, respectively. The row permutation matrix for permutation π of [n] is denoted by Pπ. Random variables and their realizations are presented by upper and lower case letters, respectively. We denote the Moore-Penrose pseudoinverse

of a matrix A by A . Given a symmetric matrix A Rd d, we denote the vector of eigenvalues of A ordered in ascending order by λ(A) Rd and the matrix of eigenvectors by Q(A) Rd d such that A = Q(A) diag(λ(A)) Q(A) . For any matrix A, we denote the rank, column, and null spaces of A by rank(A), col(A), and null(A), respectively.

2 Finite-sample data-generation process

The existing studies on analyzing the identifiability guarantees of CRL inevitably necessitate the availability of an infinite number of samples as otherwise, perfect identifiability is rendered impossible. In this paper, our objective is identifiability analysis assuming only a finite number of samples are available a significant departure from the infinite-sample regime. In this section, we specify a general data-generating process consisting of a latent causal space and an unknown linear transformation that maps the latent variables to observed variables.

Latent causal model. We have a latent space of n causally related random variables. We denote the latent causal variables by Z [Z1, . . . , Zn] . The causal relationships among the variables of Z are represented by a directed acyclic graph G with n nodes where the i-th node represents Zi. We denote the parents, children, and ancestors of a node i [n] in G by pa(i), ch(i), and an(i), respectively. We denote the probability density function (pdf) of Z by p Z, which is assumed to be well-defined without any zeros over lower-dimensional manifolds and to have full support on Rn. Given the DAG G, p Z factorizes according to p Z(z) = Y

i [n] pi(zi | zpa(i)) . (1)

We assume that the conditional pdfs {pi : i [n]} are continuously differentiable with respect to z. We call a permutation π = (π1, . . . , πn) of [n] a valid causal order if for all i, j [n], i pa(j) implies πi < πj. Without loss of generality, we assume that (1, . . . , n) is a valid causal order.

Transformation model. The latent variables Z are mapped to the observed variables X [X1, . . . , Xd] through an unknown linear transformation G Rd n, where d n. Specifically,

X = G Z , (2)

where G has full column rank. We denote the pdf of X by p X. Note that owing to the generation process of X, the pdf p X is supported on an n-dimensional subspace embedded in Rd. We denote the L2 norm of any function f with finite variance under p X by f 2 p X Ep X f(x) 2 2.

Intervention model. We consider CRL under interventions, and in particular focus on soft interventions, as the most general form of interventions. Hence, in addition to the observational model specified by (2), we consider n single-node interventional environments {Em : m [n]}. We assume that the node intervened in the environment Em is unknown and denote it by Im. Leaving one node unintervened renders identifiability impossible [7]. Hence, we inevitably have {Im : m [n]} = [n].

Applying a soft intervention on node i changes its causal mechanism specified by the observational conditional pdf pi(zi | zpa(i)) to a distinct interventional conditional pdf qi(zi | zpa(i)). The conditional pdfs {qi : i [n]} are assumed to be continuously differentiable with respect to z. Subsequently, the pdf of Z in the interventional environment Em, denoted by pm, factorizes according to

pm Z (z) = qℓ(zℓ| zpa(ℓ)) Y

i =ℓ pi(zi | zpa(i)) , where ℓ= Im . (3)

To distinguish the observational and interventional data, we denote the latent and observed random variables in environment Em by Zm and Xm, respectively. Interventions do not alter the transformation matrix G, indicating that similarly to (2), we have Xm = G Zm. Throughout the paper, we adopt the convention that E0 refers to the observational environment.

Score functions. The score function associated with a pdf is defined as the gradient of its logarithm. We denote the score functions associated with the observational distributions of Z and X by s Z and s X, respectively, i.e.,

s Z(z) z log p Z(z) , and s X(x) x log p X(x) . (4)

Similarly, we denote the score functions associated with the distributions of Z and X in environment Em for m [n] by sm Z and sm X, respectively, i.e.,

sm Z (z) z log pm Z (z) , and sm X(x) x log pm X(x) . (5)

In our algorithm and analysis, we will analyze the score variations across different interventions. For this purpose, we define the score difference functions of Z and X between environments Em and E0 as dm Z (z) sm Z (z) s Z(z) , and dm X(x) sm X(x) s X(x) , m [n] . (6) To ensure that interventions affect the target node and its parents distinctly, we adopt the following assumption. This assumption holds for a large class of models, e.g., additive noise models under stochastic hard interventions [5, Lemma 2]. Assumption 1 ([5, Assumption 1]). For any m [n], and for all k pa(Im), the term [dm Z ]k/[dm Z ]Im is not a constant function of z.

Finite sample data. We assume that we have only N data samples under each of the environments. We denote the N samples of Zm and Xm by

{zm j : j [N]} , and {xm j : j [N]} , m {0} [n] . (7)

Accordingly, for each sample j, we concatenate all latent and observational samples under different environments to create sample matrices

Zj [z0 j , . . . , zn j ] , and Xj [x0 j, . . . , xn j ] , j [N] . (8)

Finally, we define the set of samples XN {Xj : j [N]} and ZN {Zj : j [N]}.

3 Finite-sample identifiability objectives

The CRL framework specified in Section 2 has two objectives: use the N observational samples XN = {Xj : j [N]} to recover the causal graph G and the latent causal variables ZN = {Zj : j [N]}. In this section, we formalize the inference rules and their associated fidelity metrics for recovering G and ZN.

Latent graph recovery. Define G as the set of all DAGs on n nodes. We define ˆG as a generic estimator of the latent graph G, i.e.,

ˆG : Rd (n+1) N G . (9)

To quantify the accuracy of the estimate ˆG(XN), we provide the following probably approximately correct (PAC) guarantee, which is the probabilistic counterpart of the standard identifiability up to ancestor definition in the interventional CRL literature [5, 7, 11].

Definition 1 (δ-PAC graph recovery). Graph estimate ˆG(XN) achieves δ PAC latent graph recovery if, with probability at least 1 δ, the transitive closures of ˆG(XN) and G are isomorphic.

Latent variables recovery. We investigate a two-step estimator for the latent variables ZN. First, we define a generic estimator for the pseudoinverse of G as

H: Rd (n+1) N Rn d . (10)

Then, given an estimate H(XN) for G , we estimate {Zj : j [N]} according to

ˆZj = H(XN) Xj , j [N] . (11)

We provide the following definition to quantify the fidelity of these estimates with respect to the ground truth variables. Definition 2 ((ϵ, δ) PAC variables recovery). The estimate H(XN) achieves (ϵ, δ) PAC latent variables recovery if the estimated causal variables { ˆZj : j [N]} satisfy

ˆZj = PI (Cpa + Cerr) Zj , (12)

where for all i pa(j), Cpa satisfies (Cpa)i,j = 0, and, with probability at least 1 δ, we have Cerr 2 ϵ.

We note that Definition 2 is the probabilistic counterpart of the standard identifiability up to ancestor definition in interventional CRL literature.

Noisy score model. The score-based CRL framework uses properties of score function differences to build the estimators for the latent graph and variables. Therefore, when we only have access to finite data, we need to estimate the score functions. We denote a generic score function estimator for environment Em by ˆsm X(x; XN): Rd Rd (n+1) N Rd . (13) When the dependence is clear from the context, we will drop the explicit dependence of ˆsm X on XN. Similarly to (6), we define the estimated score difference functions

ˆd m X(x; XN) ˆsm X(x; XN) ˆs X(x; XN) , m [n] . (14)

In this paper, we focus on consistent score difference estimators {ˆd m X : m [n]}, i.e., they satisfy convergence in probability. Specifically, for any ϵ, δ > 0, there exists N(ϵ, δ) N such that

P max m [n]

ˆd m X( ; XN) dm X p X ϵ 1 δ , N N(ϵ, δ) . (15)

Notably, we note that many score estimators are known to be consistent [12, 16, 17, 18].

4 Finite-sample CRL algorithms

In this section, we design finite-sample interventional CRL algorithms through which we establish finite-sample identifiability guarantees as well as bounds on the associated sample complexities. Our algorithms fall within the score-based category of CRL algorithms [5]. The main intuition of this framework is that score functions in the observed space contain all the information needed for recovering the latent graph G and the inverse transform G . Specifically, two metrics are pivotal for retrieving the latent space:

1. Score differences: As shown in Lemma 1, the nonzero entries of the latent score differences dm Z encode the graph structure and the observed score differences dm X are generated using the inverse transform G . 2. Score difference correlations: As shown in Lemma 2, the column space of the correlation matrix of the score differences contains crucial information about the latent graph and the inverse transform, which we will leverage to form estimates for the latent graph G and the inverse transform G .

To proceed, for m [n] we denote the correlation matrices of dm X and ˆd m X by

Rm X Ep X dm X(x) (dm X(x)) , and ˆRm X Ep X ˆd m X(x) (ˆd m X(x)) . (16)

Next, we formalize two critical properties of score differences and their correlation matrix. Specifically, the following lemma states that the sparsity pattern of the latent score differences dm Z exposes the structure of the latent graph, and the observed score differences dm X preserve this information through the pseudoinverse of G. Lemma 1 ([5]). Score function differences in the latent space satisfy, for all m [n],

z Rn : [dm Z (z)]i = 0 i pa(Im) . (17)

Furthermore, the score differences in the observed domain are given by

dm X(x) = (G ) dm Z (z) , x = G z . (18)

The next lemma specifies that the structure of the column space of correlation matrix Rm X is heavily constrained by the graph structure and the inverse transformation G . Lemma 2. For any m [n], we have col(Rm X) span (G ) :,i : i pa(Im) .

While these two properties enable recovering the latent graph and variables, the ground truth score functions dm Z (z) and correlation matrix Rm X are unknown. We only have access to their noisy counterparts, as estimated from the observed data. In our algorithms, we use methods with soft decision rules that can counter the effects of the errors introduced by the score estimation procedure. These are facilitated by the following three key definitions.

Algorithm 1 Causal order estimation

1: Input: ˆRm for all m [n], η 0 2: Vn {1, . . . , n} remaining unordered set 3: for t (n, . . . , 2) do 4: for k Vt do 5: if dim col(P

m Vt\{k} ˆRm; η) = t 1 then

6: πt k Ik has no ancestors in IVt

7: Vt 1 Vt \ {k} remove the identified node from unordered set 8: break k loop 9: π1 m for m V1 10: Return π

Algorithm 2 Graph estimation

1: Input: ˆRm for all m [n], π, η 0, γ 0 2: Initialize ˆG with empty graph over nodes {π1, . . . , πn} 3: Construct the Vt sets of Algorithm 1 for all t [n] using π 4: for t (n 1, . . . , 1) do 5: for j (t + 1, n) do 6: Mt,j Vj \ ({πt} ˆch(πt)) set for determining whether πt πj 7: if col(P m Vt ˆRm; η) γ null(P

m Mt,j ˆRm; η) then

8: Add πt πj and πt l to ˆG for all l ˆch(πj) in ˆG

9: Return ˆG

Definition 3 (Approximate column space). We define the η-approximate column space of a positive semidefinite matrix A Rd d, denoted by col(A; η), as the span of the eigenvectors of A associated with the eigenvalues that are strictly greater than η.

Definition 4 (Approximate null space). We define the η-approximate null space of A, denoted by null(A; η), as the orthogonal complement of col(A; η).

Definition 5 (Approximate subspace orthogonality). We say two subspaces A and B with orthonormal bases A and B are γ-approximately orthogonal, denoted by A γ B, if B A 2 γ.

In these definitions, setting η = γ = 0 yields the standard definitions of column and null spaces and orthogonality. Next, we describe our algorithms for latent graph recovery (Algorithms 1 and 2) and latent variables recovery (Algorithm 3). We note that the algorithms for latent graph and latent variable recovery algorithms are fully decoupled, enabling the independent recovery of the graph and the latent variables.

Algorithm 1 Causal order estimation. In this algorithm, we estimate a permutation π of [n] such that I π is a valid causal order. This serves as an intermediate step for estimating the latent graph. For this purpose, we use the noisy correlation matrices { ˆRm X : m [n]} to identify the leaf nodes with no children. In particular, the key property is that when Ik is a leaf node, by carefully selecting the threshold η, the approximate column space of the term (P

m [n] ˆRm X ˆRk X) has dimension n 1. Precisely, with high probability, we have

m [n]\{k} ˆRm X ; η = n 1 Ik is a leaf node . (19)

After finding a leaf node, we iteratively identify the youngest node among the remaining set of nodes. Leveraging this, we construct the permutation π, which consists of nodes ordered from the eldest to the youngest. In Lemma 3, we show that I π is a valid causal order with high probability.

Algorithm 2 Latent graph estimation. In Algorithm 2, we use the causal order found in Algorithm 1 and correlation matrices { ˆRm X : m [n]} to form a graph estimate ˆG. We build this graph iteratively by considering a candidate edge πt πj for all possible (t, j) pairs starting from a leaf node t. The key property that we leverage is that we can form two subsets Vt and Mt,j of [n] using π,

Algorithm 3 Inverse transform estimation

1: Input: ˆRm for all m [n], η 0 2: H 0n d 3: for t (1, . . . , n) do 4: C col( ˆRπt; η) 5: Hπt,: v for v C and v 2 = 1

6: Return H

t, and j such that, for sufficiently small γ, with high probability the following approximation holds.

ˆRm X ; η γ null X

ˆRm X ; η Iπt Iπj in G . (20)

We check each edge πt πj and add the detected edges to the estimated graph ˆG. In Theorem 1, we show that this procedure guarantees a PAC latent graph recovery. Algorithm 3 Inverse transform estimation. In Algorithm 3, we build our inverse transform estimate H one row at a time. The key property we use is that for any m [n] and unit vector v col( ˆRm X), the following error term is small with high probability v X X

i Zi 2 . (21)

In other words v X is approximately equal to ZIm up to mixing with {Zi : i pa(Im)}, which conforms to our latent variables recovery objective. Note that in the noise-free setting, an exact equality holds due to Lemma 2 and property G G = In. Based on this observation, in our algorithm, we construct our inverse transform estimate by setting row m of H to a unit vector v col( ˆRm X) for all m [n]. In Theorem 2, we show that Algorithm 3 achieves a PAC latent variables recovery.

5 Sample complexity analysis

In this section, we analyze the sample complexity of our CRL algorithm to establish an achievable sample complexity for CRL.

Threshold selection. A critical step in our algorithm designs is choosing thresholds η and γ specified in Definitions 3 5. These thresholds determine the approximate ranks of noisy matrices and approximate orthogonality between subspaces, respectively. Specifically, we select η such that the approximate rank of ˆRm X tracks that of Rm X with high probability. To proceed, we denote the minimum nonzero eigenvalue among arbitrary Rm X sums by

η min M [n] min n λi X

m M Rm X : i [d] , λi X

m M Rm X = 0 o , (22)

where λ(A) Rd denotes the vector of eigenvalues of A. In our algorithms, we let η (0, η ) and show that this choice ensures that eigenvalues of the null and column spaces can be separated via η. For the approximate orthogonality test in Definition 5, we show that γ defined below serves as a bound on how close the non-orthogonal column spaces of Rm X sums become orthogonal.

γ min i [n]

G i,: 2 G:,i 2 . (23)

Similarly, we let γ (0, γ ) in our algorithms. For any choice of η and γ, we define η min{η, η η} and γ min{γ, γ γ}. We note that we do not have to know η and γ a priori, and can include routines to estimate them in practice. These estimates do not have to be highly accurate since even rough estimates suffice to choose reliable thresholds η and γ. Specifically, based on estimates for η and γ , we can choose safe thresholds for η and γ, e.g., one-fourth of the estimates for η and γ , so that (i) with high probability we satisfy the requirement η (0, η ) and γ (0, γ ), and (ii) we can avoid collecting excessive samples and compromising the sample complexity bounds. We provide the details of constructing such estimates of η and γ in Appendix G.

Sample complexity results. We provide two sets of sample complexity analyses for the latent graph and the latent variables. We estimate the latent graph in two steps: (i) causal order estimation

(Algorithm 1) and (ii) latent graph estimation (Algorithm 2). Algorithm 1 only employs approximate rank tests. Therefore, we first show that the approximate rank of the sum of correlation matrices { ˆRm X : m [n]} tracks that of {Rm X : m [n]} with high probability. To proceed, we define the following constants that appear repeatedly in our analysis:

β 4 max m [n]

1 and βmin 2 min m [n]

dm X p X . (24)

Lemma 3. Let η (0, η ). For any δ > 0, Nrank(δ) samples suffice to ensure that with probability at least 1 δ,

M [n] : dim col X

m M ˆRm X; η = rank X

m M Rm X , (25)

where Nrank(δ) N min nβη

n , βmin o , δ . (26)

Note that the function N( , ) is an inherent property of the score difference estimator specified in (15), and it is monotonically decreasing in its second argument. Using this lemma, we can show that Algorithm 1 returns a permutation π such that I π is a valid causal order with high probability. Lemma 4. Let η (0, η ). Under Assumption 1, for any δ > 0, Nrank(δ) samples suffice to ensure that with probability at least 1 δ, I π is a valid causal order, where π is the output of Algorithm 1.

Next, we show that Algorithm 2 achieves latent graph recovery with high probability. Theorem 1 (Sample complexity Graph). Let η (0, η ) and γ (0, γ ). Under Assumption 1, for any δ > 0, NG(δ) samples suffice to ensure that the output ˆG(XN) collectively generated by Algorithms 1 and 2 satisfies δ PAC graph recovery, where

NG(δ) N min βη

We note that constants β and βmin are sample-independent. Then, (27) specifies the sample complexity as a function of the hyperparameters η, γ, target reliability δ (0, 1), as well as the latent dimension n. Next, we state the complementary result for recovering the latent variables. Theorem 2 (Sample complexity Variables). Let η (0, η ). For any ϵ > 0 and δ > 0, NZ(ϵ, δ) samples suffice to ensure that the output H(XN) of Algorithm 3 satisfies (ϵ, δ) PAC causal variables recovery, where

NZ(ϵ, δ) N min n ϵβη n G 2 , βη , βmin o , δ . (28)

Theorems 1 and 2 collectively specify an extent of identifiability achievable in the finite-sample regime. Importantly, we note that since we can establish (ϵ, δ) PAC identifiability guarantees for any vanishing ϵ, δ > 0 using finite samples, Algorithms 2 and 3 are consistent estimators of the latent graph and the inverse transformation up to the corresponding equivalence classes.

Finally, we note that the latent variables are recovered up to mixing with parents, as specified in Theorem 2. This result is a refinement of the existing latent variable recovery results under soft interventions in existing literature [5, 7], which recover the latent variables up to mixing with ancestors.

RKHS-based score estimator. So far, we have characterized the sample complexity for any consistent score difference estimator. Next, we specialize the results by adopting the RKHS-based score estimator of [12]. We adopt this particular choice since it has known non-asymptotic sample complexity properties. To our knowledge, it is the only score estimator equipped with such a guarantee. To use the sample complexity of this score estimator in our paper, we first show that it is consistent in the sense of (15). For the formal statement of the following property and its attendant assumptions, we refer to Appendix E. Lemma 5 (Informal). Assume that s X and sm X satisfy the conditions of the RKHS-based score estimator in [12]. Then, for any given δ and ϵ, the convergence specified in (15) holds when

N(ϵ, δ) max n C

2κ2 o 4 log 8n

where κ and C are sample independent constants that depends only on p X, pm X for m [n] and the structure of the RKHS.

We use Lemma 5 to customize the general sample complexity bounds in Theorems 1 and 2 to the RKHS-based score estimator. For this purpose, we first provide our sample complexity result for recovering the latent graph. Theorem 3 (RKHS-based sample complexity Graph). Let η (0, η ) and γ (0, γ ). Under Assumption 1 and the conditions of Lemma 5, NG(δ) samples suffice to ensure that ˆG(XN) collectively generated by Algorithms 1 and 2 satisfies δ PAC graph recovery, where

NG(δ) = max C

2κ2 4 log 8n

4 , and ϵG min βη

(30) Remark 1 (RKHS-based error bound Graph). Theorem 3 implies that using N samples, the output ˆG(XN) of Algorithms 1 and 2 satisfies δG(N) PAC graph recovery, where

δG(N) 8n exp N 1/4 max C

Similarly, we leverage Lemma 5 to specialize the general sample complexity in Theorem 2 to the RKHS-based score estimator. Theorem 4 (RKHS-based sample complexity Variables). Let η (0, η ). Under the conditions of Lemma 5, NZ(ϵ, δ) samples suffice to ensure that the output H(XN) of Algorithm 3 satisfies (ϵ, δ) PAC causal variables recovery, where

NZ(ϵ, δ) = max C ϵ ϵZ , 2

2κ2 4 log 8n

and ϵZ min ϵβη n G 2 , βη , βmin

We note that the results in Theorems 3 and 4 show a more explicit dependence of the sample complexity on the latent space dimension n. We observe that the sample complexity of latent graph recovery explicitly depends on δ and the latent dimension n according to O((n log n

δ )4). Similarly, the sample complexity of latent variables recovery explicitly depends on ϵ, δ, and the latent dimension n according to O(( n

δ )4). We note that constants β, βmin, η , and γ are model parameters, and their scaling behavior in terms of n and d are investigated numerically in the next section.

6 Experiments

In this section, we perform numerical assessments of our analyses to provide complementary insight into the sample complexity results of Section 5. Specifically, we evaluate the variations of the model constants with respect to problem dimensions n and d. Furthermore, we also evaluate performance variations of the finite-sample algorithm in terms of the variations in the score estimation error. For this purpose, we focus on the mean squared error (MSE) of the score estimator, specified by E ˆd m X dm X 2 p X. These evaluations facilitate a better understanding of the properties of the CRL problem that are not explicit in our theoretical queries.2

Experimental details. We consider problem dimensions n {3, 5, 10} and d {n, 15} and generate G using Erd os-Rényi model with density 0.5 on n nodes. We adopt linear Gaussian models as the latent causal model. We consider N {102.5, 103, 103.5, 104, 104.5, 105} samples, and generate 100 latent models for each triplet (N, n, d). We generate N samples of Z from each environment for each latent model. Transformation G Rd n is randomly sampled under full-rank and bounded condition number constraints, and the observed variables are generated as X = G Z. Due to the Gaussian structure of Z, the observed variables X also have multivariate Gaussian distributions with

2The codebase for the experiments can be found at https://github.com/acarturk-e/ finite-sample-linear-crl.

(a) Constants vs. problem dimension.

(b) Graph recovery vs. score MSE, d = 15

Figure 1: Numerical evaluations.

score function s X(x) = Θ x, where Θ is the precision matrix of X. Therefore, we estimate s X using the sample precision matrix estimated from N samples, ˆΘN, as ˆs X(x; XN) = ˆΘN x.

Results. The sample complexity results depend on dimension n and d through their explicit presence as well as their implicit effect on model-dependent constants β, βmin, η , and γ , as established in Theorems 1 and 2. First, we observe that all these parameters are independent of d. This is also expected theoretically since we can project down the d dimensional observed space to the n dimensional col(G). In Figure 1a, we illustrate the dependence of these constants on data dimensions n and d. We observe that γ remains constant, while β and βmin are decreasing as n increases, but the decay is not steep. On the other hand, we observe that η decays exponentially with n. γ is implicitly controlled by the condition number, which is kept bounded. We recall that βmin is the minimum among { dm X p X : m [n]}, and similarly, β is the inverse of the maximum of { dm X p X : m [n]}, so they are expected to become smaller as n increases. Finally, for η , we note that the definition in (22) is an order statistic among exponentially many, 2n, subsets.

Next, we note that all the imperfections in the graph and variable recovery are due to the imperfections in score estimates. In fact, this is the bottleneck in our decisions: the key impact of finite samples versus infinite samples is the degradation in the quality of score estimates. To have a direct insight into how the score estimation noise power translates into decision imperfection, we assess the quality of graph recovery success rate versus varying degrees of mean score error of score difference estimates E[ ˆd m X dm X 2 p X] for different triplets (N, n, d). In Figure 1b, each data point corresponds to a different sample size N, and the legend entries {3, 5, 10} denote the latent dimension n. This figure illustrates the scatter plot of the error probability δ versus MSE for different parameter combinations and the corresponding local regression curves. It is observed that in the low MSE regime, 1 δ decays linearly with respect to log E[ ˆd m X dm X 2 p X] and it plateaus as the MSE increases. This trend is due to our algorithms high sensitivity to errors in estimating approximate matrix ranks.

7 Conclusion

In this paper, we have established the first sample complexity results for interventional causal representation learning. Specifically, we have characterized upper bounds on the sample complexity of latent graph and variables recovery in terms of the finite sample performance of any consistent score difference estimator. We have, furthermore, adopted a particular score estimator [12] to derive explicit sample complexity statements.

Our sample complexity results are given for partial identifiability (up to ancestors) in the soft intervention and linear transformation setting. Establishing sample complexity results for perfect identifiability via hard interventions or considering general transformations are important future directions.

Acknowledgements and disclosure of funding

This work was supported by IBM through the IBM-Rensselaer Future of Computing Research Collaboration.

[1] Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634, May 2021.

[2] Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429 439, April 1999.

[3] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In Proc. International Conference on Machine Learning, Long Beach, CA, June 2019.

[4] Burak Varıcı, Emre Acartürk, Karthikeyan Shanmugam, Abhishek Kumar, and Ali Tajer. Scorebased causal representation learning with interventions. ar Xiv:2301.08230, 2023.

[5] Burak Varıcı, Emre Acartürk, Karthikeyan Shanmugam, Abhishek Kumar, and Ali Tajer. Scorebased causal representation learning: Linear and general transformations. ar Xiv:2402.00849, 2024.

[6] Kartik Ahuja, Divyat Mahajan, Yixin Wang, and Yoshua Bengio. Interventional causal representation learning. In Proc. International Conference on Machine Learning, Honolulu, Hawaii, July 2023.

[7] Chandler Squires, Anna Seigal, Salil S. Bhate, and Caroline Uhler. Linear causal disentanglement via interventions. In Proc. International Conference on Machine Learning, Honolulu, Hawaii, July 2023.

[8] Burak Varıcı, Emre Acartürk, Karthikeyan Shanmugam, and Ali Tajer. General identifiability and achievability for causal representation learning. In Proc. International Conference on Artificial Intelligence and Statistics, Valencia, Spain, May 2024.

[9] Simon Buchholz, Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Schölkopf, and Pradeep Ravikumar. Learning linear causal representations from interventions under general nonlinear mixing. In Proc. Advances in Neural Information Processing Systems, New Orleans, LA, December 2023.

[10] Julius von Kügelgen, Michel Besserve, Wendong Liang, Luigi Gresele, Armin Keki c, Elias Bareinboim, David M Blei, and Bernhard Schölkopf. Nonparametric identifiability of causal representations from unknown interventions. In Proc. Advances in Neural Information Processing Systems, New Orleans, LA, December 2023.

[11] Jiaqi Zhang, Chandler Squires, Kristjan Greenewald, Akash Srivastava, Karthikeyan Shanmugam, and Caroline Uhler. Identifiability guarantees for causal disentanglement from soft interventions. In Proc. Advances in Neural Information Processing Systems, New Orleans, LA, December 2023.

[12] Yuhao Zhou, Jiaxin Shi, and Jun Zhu. Nonparametric score estimators. In Proc. International Conference on Machine Learning, virtual, July 2020.

[13] Jikai Jin and Vasilis Syrgkanis. Learning causal representations from general environments: Identifiability and intrinsic ambiguity. ar Xiv:2311.12267, 2023.

[14] Simon Bing, Urmi Ninad, Jonas Wahl, and Jakob Runge. Identifying linearly-mixed causal representations from multi-node interventions. In Proc. Causal Learning and Reasoning, pages 843 867, Los Angeles, CA, April 2024.

[15] Zhenyu Zhu, Francesco Locatello, and Volkan Cevher. Sample complexity bounds for scorematching: Causal discovery and generative modeling. In Proc. Advances in Neural Information Processing Systems, New Orleans, LA, December 2023.

[16] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4):695 709, July 2005.

[17] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. In Proc. Uncertainty in Artificial Intelligence, virtual, August 2020.

[18] Andre Wibisono, Yihong Wu, and Kaylee Yingxi Yang. Optimal score estimation via empirical bayes smoothing. ar Xiv:2402.07747, 2024.

[19] Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970.

Sample Complexity of Interventional Causal Representation Learning: Appendices

Table of Contents

A Proof structure 13

B Identifiability guarantees in infinite sample regime 14 B.1 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B.2 Encoder recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.3 Equivalence of function rank to correlation rank . . . . . . . . . . . . . . . . . 15 B.4 Causal order recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.5 Graph recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

C Identifiability guarantees under bounded score estimation noise 19

D Sample complexity for generic consistent score estimator 25

E Sample complexity for RKHS-based score estimator 27

F Properties of positive semidefinite matrices 28

G Hyperparameter estimation 29

A Proof structure

We structure the proofs of our main results as follows.

Appendix B Infinite sample guarantees: First, we prove identifiability results in the noise-free setting, that is, using the ground truth Rm X matrices instead of using the estimated ˆRm X. These intermediate results lay the groundwork for the proofs for the finite sample setting. Appendix C Bounded noise guarantees: Next, we show that if estimation error for the score difference functions, ˆd m X dm X p X, is bounded, then the noise-free identifiability guarantees transfer to the noisy setting. Appendix D Sample complexity guarantees: We use the noisy identifiability guarantees to prove that for any consistent score difference estimator, we can guarantee (ϵ, δ) PAC guarantees for CRL identifiability objectives. Appendix E RKHS-based score estimator: Finally, we use the finite sample guarantees provided by the RKHS-based score estimator of [12] to adapt the sample complexity statements for a generic estimator to explicit guarantees in terms of model-dependent constants.

To be used throughout the proof, we define the correlation matrices of dm Z as

Rm Z Ep Z dm Z (z) (dm Z (z)) . (34)

We recall the correlation matrices Rm X and ˆRm X specified in (16). For a subset M [n], we use the following shorthand notations for sums of the matrices in {Rm X : m M} and { ˆRm X : m M}.

m M Rm X , and ˆRM X X

m M ˆRm X , M [n] . (35)

B Identifiability guarantees in infinite sample regime

In this section, we provide the identifiability guarantees of Algorithms 1 3 with no error and probability 1 when using {Rm X : m [n]} as input.

Denote the image of a function f : Rn Rk for any k Z+ by im(f), that is,

im(f) f(z) : z Rn . (36)

First, we note a property of the column space of correlation matrices.

Lemma 6. For a continuous function f : Rn Rk, span of the image of f equals to

span im(f) = col Ep Z[f(z) f(z) ] . (37)

Proof: Note that two subspaces are equal if and only if their orthogonal complements are equal. We denote the orthogonal complement of span im(f) by

S1 span im(f) = y Rk : y f(z) = 0 z Rn . (38)

The orthogonal complement of the column space of a symmetric matrix is the null space, which we denote by

S2 null Ep Z[f(z) f(z) ] = y Rk : Ep Z[f(z) f(z) ] y = 0 . (39)

Since any correlation matrix is positive semidefinite, using Proposition 2, (39) can be written as

S2 = y Rk : y Ep Z[f(z) f(z) ] y = 0 (40)

= y Rk : Ep Z (y f(z))2 = 0 . (41)

Since f is a continuous function, so is y f for any y Rk. We can then use [5, Proposition 2], which is stated below for the sake of conciseness.

Proposition 1 ([5, Proposition 2]). Consider two continuous functions f, g : Rn R. Then, for any α > 0, z Rn f(z) = g(z) Ep Z h f(Z) g(Z) αi = 0 . (42)

Under this proposition, (41) can be written as

S2 = y Rk : y f(z) = 0 z Rn , (43)

which is the same as the definition of S1 in (38). This concludes the proof.

Corollary 1. For a continuous function f : Rd Rk, the span of the image of f G equals to the column space of Ep Z[f(G z) (f(G z)) ] = Ep X[f(x) (f(x)) ] . (44)

We can specialize this lemma for dm Z and dm X for any m [n], which yields

col(Rm Z ) = span dm Z (z) : z Rn , (45)

and col(Rm X) = span dm X(x) : x col(G) . (46)

B.1 Proof of Lemma 2

We can state (17) from Lemma 1 equivalently as

z Rn dm Z (z)

i = 0 i pa(Im) . (47)

Recall from (18) in Lemma 1 that dm X is related to dm Z by

dm X(x) = (G ) dm Z (z) = X

i,: dm Z (z)

i , x = G z . (48)

Then, due to (47), for any given z Rn and corresponding x = G z, we have

dm X(x) = X

i,: dm Z (z)

i span (G )i,: : i pa(Im) . (49)

Since dm X(x) lies in the subspace span{(G )i,: : i pa(Im)} for any x col(G), we have

span dm X(x) : x col(G) span (G )i,: : i pa(Im) . (50)

Due to (46), we have span dm X(x) : x col(G) = col(Rm X) , (51) which concludes the proof.

B.2 Encoder recovery

We can use Lemma 2 to show that in the infinite sample regime, Algorithm 3 works correctly. Specifically, let us show the following. Lemma 7 (Infinite samples Encoder). For any v col(Rm X), we have v G

i = 0 = i pa(Im) . (52)

Proof: Due to Lemma 2, for any v col(Rm X), we have

v span (G )i,: : i pa(Im) . (53)

In other words, any v col(Rm X) can be expressed as

i pa(Im) ci (G )

for some constants {ci : i pa(Im)}. Then, we have

i pa(Im) ci (G )i,:

i pa(Im) ci (G )i,: G . (55)

Since G has full column rank, the pseudoinverse G satisfies G G = In, that is,

(G )i,: G = e i , (56)

where ei denotes the i-th standard basis vector. Substituting this, (55) becomes

i pa(Im) ci e i . (57)

Note that i-th entry of a row vector w Rn is given by w ei. Therefore, we have

(v G)i = ci i pa(Im) , 0 otherwise . (58)

This concludes the proof.

B.3 Equivalence of function rank to correlation rank

In this section, we adapt the workhorse lemmas of [5] to our setting.

Lemma 8 ([5, Lemmas 5 and 6]). Consider set A [n] such that IA {Im : m A} is ancestrally closed, i.e., pa(IA) = IA. Then,

rank RA X = |A| . (59)

Also, under Assumption 1, for any k A, we have

rank RA\{k} X = |A| if j A \ {k} such that Ik pa(Ij) , |A| 1 otherwise . (60)

We first note that due to (18) and the definitions of Rm X and Rm Z in (16) and (34), we have

Rm X = (G ) Rm Z G , (61)

where G is a full row rank matrix. Therefore, for any B [n],

rank RB X = rank RB Z . (62)

Next, we note that for a function f : Rn Rk, the notation R(f) in [5] is used to denote the rank of a function, and it corresponds to dim span im(f) in the notation of this paper. Using Proposition 3 and Lemma 6, we can see that the notation R(F) for a set of functions F can be given as

R(F) = k dim( \

f F (span im(f)) ) (63)

f F null(Ep Z[f(z) f(z) ])) (64)

= k dim null( X

f F Ep Z[f(z) f(z) ]) (65)

= dim col( X

f F Ep Z[f(z) f(z) ]) . (66)

Using (66) and (62), we can adopt [5, Lemmas 5 and 6] using {Rm X : m [n]} matrices.

We note that (59) and (60) together with Lemma 2 imply that, for any ancestrally closed A [n] and k A for which there exists j A \ {k} such that Ik pa(Ij), we have

col RA X = col RA\{k} X = span (G ) :,i : i pa(IA) . (67)

We note that due to the property G G = In, we have, for any M [n],

null (G ) :,i : i M = span G:,i : i M , (68)

and null G:,i : i M = span (G ) :,i : i M . (69)

B.4 Causal order recovery

We will show that Algorithm 1 gives a valid causal order in the infinite sample regime using Lemma 8. Lemma 9 (Infinite samples Causal order). Setting η = 0, I π is a valid causal order, where π is the output of Algorithm 1 using {Rm X : m [n]}.

Proof: The proof of this lemma follows closely to the proof of [5, Lemma 7]. Note that setting η = 0 means that approximate rank and column/null spaces are actually equal to the standard rank and column/null space definitions. We note that I π is a valid causal order if and only if, for any t [n], we have de(Iπt) Iπi : i > t . (70) We prove (70) by induction as follows.

In the base case, for t = n, we have Vt = [n]. Therefore, using Lemma 8, we see that

rank(RVt\{k} X ) = n de(Ik) IVt = , n 1 de(Ik) IVt = . (71)

Therefore, for t = n, rank(RVt\{k} X ) = t 1 in Algorithm 1 if and only if k Vt satisfies

= de(Ik) IVt = de(Ik) [n] = de(Ik) . (72)

Since { Iπi : i > t } = for t = n, the base case is proved by choosing any such k as πt.

For the induction step at time t {n 1, . . . , 1}, we assume that, for any u {t + 1, . . . , n}, we have de(Iπu) { Iπi : i > u }. By construction, Vt = [n] \ { πu : u > t }. Therefore, IVt is

ancestrally closed: For any k IVt, k cannot be a descendant of I[n]\Vt = { Iπu : u > t } due to the induction hypothesis. In other words, pa(IVt) = IVt. Therefore, we can use Lemma 8 as

rank(RVt\{k} X ) = t de(Ik) IVt = , t 1 de(Ik) IVt = . (73)

Therefore, for any t { n 1, . . . , 1 }, rank(RVt\{k} X ) = t 1 in Algorithm 1 if and only if k Vt satisfies = de(Ik) IVt . (74)

This is equivalent to

de(Ik) [n] \ IVt = I[n]\Vt = { Iπu : u > t } . (75)

Therefore, choosing any such k as πt satisfies the induction hypothesis for t as well, and the proof by induction is concluded.

B.5 Graph recovery

Lemma 10 (Infinite samples Graph). Setting η = γ = 0, the collective output of Algorithms 1 and 2 using {Rm X : m [n]} is equal to the graph isomorphism of the transitive closure of G by permutation I.

Proof: The proof of this lemma follows closely to the proof of [5, Lemma 8]. Note that setting η = 0 means that approximate ranks and column/null spaces are equal to the standard definitions. Similarly, γ = 0 means that approximate orthogonality is equivalent to the standard orthogonality definition. We first note that the transitive closure of two DAGs are equal if and only if their transitive reductions are equal. Therefore, the lemma statement is equivalent to

πt ˆpatr(πj) Iπt patr(Iπj) , t, j [n] . (76)

We will prove this result by induction over the transitive reduction edges. In the rest of the proof, using Lemma 9, we will leverage that I π is a valid causal order.

In the base case, we have t = n 1, and thus necessarily j = n. The only possible path between Iπt to Iπj is therefore a transitive reduction edge. Since we did not add any children to node πt yet, the set Mt,j is equal to Vj \ {πt}. Noting that IVj is ancestrally closed and the only possible child of Iπt is Iπj when I π is a valid causal order, we can use Lemma 8 to obtain

rank(RMt,j X ) = n Iπj ch(Iπt) , n 1 Iπj ch(Iπt) . (77)

Specifically, using (67), we can specify the exact column spaces we can obtain from Mt,j as

col(RMt,j X ) = span (G ) :,i : i IVj Iπj ch(Iπt) , span (G ) :,i : i IVj\{πt} Iπj ch(Iπt) . (78)

Then, using (68), the null space of RMt,j X is given by

null(RMt,j X ) = span G:,i : i IVj Iπj ch(Iπt) , span G:,i : i IVj\{πt} Iπj ch(Iπt) . (79)

The test in Algorithm 2 involves checking whether the column space of RVt X and the null space of RMt,j X are orthogonal. Note that IVt is ancestrally closed, therefore, using (67), we get

col(RVt X ) = span (G ) :,i : i pa(IVt) = span (G ) :,i : i IVt . (80)

Next, we note that the orthogonality test which uses orthonormal bases of subspaces (Definition 5) is equal to the following. Let A Rd k1 and B Rd k2 be full column rank matrices. Then,

col(A) col(B) A A B B 2 = 0 . (81)

This holds since A A is the orthogonal projection operator onto col(A). Therefore, given orthonormal bases A0 and B0 for A and B, we have A A = A0 A 0 and B B = B0 B 0 . Then, A A B B 2 = A0 A 0 B0 B 0 2 = A 0 B0 2 , (82) which is precisely the metric used in Definition 5. To use this property, let us first consider the matrix G:,A which is formed by stacking the independent column vectors in set {G:,i : i A} for some subset A [n]. Since G has full column rank, we see that the pseudoinverse of G:,A is exactly given by G:,A = G

where the matrix G

A,: is formed by stacking the independent row vectors in set {(G )i,: : i A}. Using this notation, if we have

col(RMt,j X ) = span (G ) :,i : i pa(IMt,j) , (84)

then the orthogonality metric is equal to (G ) :,IVt G IVt,: G:,I[n]\pa(Mt,j ) G I[n]\pa(Mt,j ),: 2 (85)

= G:,IVt G IVt,: G:,I[n]\pa(Mt,j ) G I[n]\pa(Mt,j ),: 2 (86)

= G:,IVt IIVt,I[n]\pa(Mt,j ) G I[n]\pa(Mt,j ),: 2 (87)

i IVt\pa(IMt,j ) G:,i G i,: 2 . (88)

Due to the construction of sets Vt and Mt,j, we have IVt \ IMt,j = Iπt. Therefore, the only entry in IVt that may be missing from pa(IMt,j) is Iπt. Therefore, previous equation is equal to (G ) :,IVt G IVt,: G:,I[n]\pa(Mt,j ) G I[n]\pa(Mt,j ),: 2 (89)

= G:,Iπt 2 G Iπt,: 2 Iπt pa(IMt,j) , 0 Iπt pa(IMt,j) . (90)

In summary, if (84) holds, the orthogonality inspected in Algorithm 2 happens if and only if

Iπt pa(IMt,j) . (91)

In the base case, since t = n 1, this holds if and only if Iπt patr(Iπj), and the base case is done.

For the induction hypothesis, we assume that for any u { n 1, . . . , t + 1 }, all the transitive reductive edges has been correctly identified, that is, for any i [n], πu ˆpatr(πi) if and only if Iπu patr(Iπi). We show that the induction hypothesis is also satisfied for t and j [n] using induction on j.

In the base case, we have j = t+1. This case works exactly the same way as the t = n 1 and j = n case. For the induction hypothesis, assume that for any u { n + 1, . . . , j 1 }, all the transitive reductive edges between Iπt and Iπu have been correctly identified, that is, πt ˆpatr(πu) if and only if Iπt patr(Iπu). We prove that the hypothesis also holds for j as follows. First, we claim that IMt,j {πt} is an ancestrally closed set. Note that IVj is an ancestrally closed set by construction. Since the induction hypothesis ensures that all transitive reduction relations between Iπu and Iπi for any u { t + 1, . . . , n 1 } and i [n] are recovered, this also means that transitive closure relations, that is, descendants, are also recovered. Namely, the induction hypothesis implies

Iπi de(Iπu) πi ˆde(πu) , u { t + 1, . . . , n 1 } , i [n] . (92)

Next, the induction hypothesis for j implies that for any k {t + 1, . . . , j 1 }, we have also identified all the transitive reductive edges Iπt Iπk. Therefore, for set Mt,j {πt} = Vj \ ˆch(πt), all the elements in ˆch(πt) has the property that their descendants have been fully identified, and therefore are excluded from the set ( ˆG keeps the transitive closure), which means the set IMt,j {πt} is ancestrally closed. Since this set is ancestrally closed, we have

pa(IMt,j) = IMt,j Iπt pa(Iπj) , IMt,j {πt} Iπt pa(Iπj) , (93)

or, using Lemma 8,

col(RMt,j X ) = span (G ) :,i : i pa(IMt,j) . (94)

In the base case t = n, we had shown that this was a sufficient condition for ensuring that Algorithm 2 detects the transitive reductive edges correctly. Therefore, we can detect any existing transitive reduction edges between Iπt and Iπj for any j and t, which concludes the proof.

C Identifiability guarantees under bounded score estimation noise

In this section, we show that as long as the error in the estimate of the score differences, maxm [n] ˆd m X dm X p X, is upper bounded by a model-dependent constant, the results in the infinite data regime still hold.

The first key observation is that the error ˆRm X Rm X 2 is bounded in terms of ˆd m X dm X p X. Lemma 11. In high signal-to-noise ratio regime for score differences, that is, when ˆd m X dm X p X 2 dm X p X , (95)

the error in correlation matrix Rm X is bounded by the error in dm X as ˆRm X Rm X 2 4 dm X p X ˆd m X dm X p X . (96)

Proof: First, note that ˆRm X Rm X = Ep X ˆd m X(x) ˆd m X(x) dm X(x) dm X(x) (97)

= Ep X (ˆd m X(x) dm X(x)) (ˆd m X(x) dm X(x))

+ dm X(x) (ˆd m X(x) dm X(x))

+ (ˆd m X(x) dm X(x)) dm X(x) .

Next, we apply the spectral norm. ˆRm X Rm X 2 = Ep X (ˆd m X(x) dm X(x)) (ˆd m X(x) dm X(x))

+ dm X(x) (ˆd m X(x) dm X(x))

+ (ˆd m X(x) dm X(x)) dm X(x) 2

Ep X (ˆd m X(x) dm X(x)) (ˆd m X(x) dm X(x)) 2 + 2 Ep X dm X(x) (ˆd m X(x) dm X(x)) 2 . (100)

Note that the first term is the spectral norm of a covariance matrix, and is upper bounded by Ep X (ˆd m X(x) dm X(x)) (ˆd m X(x) dm X(x)) 2 Ep X ˆd m X(x) dm X(x) 2 2 (101)

= ˆd m X dm X 2 p X . (102)

For the second term, we can upper bound the spectral norm by Frobenius norm as Ep X dm X(x) (ˆd m X(x) dm X(x)) 2 2 Ep X dm X(x) (ˆd m X(x) dm X(x)) 2 F (103)

Ep X (dm X(x))i (ˆd m X(x) dm X(x))j 2 . (104)

Using Cauchy Schwarz inequality,

Ep X dm X(x) (ˆd m X(x) dm X(x)) 2 2

i,j=1 Ep X (dm X(x))2 i Ep X (ˆd m X(x) dm X(x))2 j

= Ep X dm X(x) 2 2 Ep X ˆd m X(x) dm X(x) 2 2 (106)

= dm X 2 p X ˆd m X dm X 2 p X . (107)

Therefore, we can upper bound ˆRm X Rm X 2 by ˆRm X Rm X 2 ˆd m X dm X 2 p X + 2 dm X p X ˆd m X dm X p X . (108)

Finally, we note that if (95) holds, then this bound simplifies to (96).

Our finite sample CRL algorithms use approximate column and null spaces of (sums of) noisy score difference correlation matrices ˆRM X , where M [n]. Therefore, we need to ensure that these approximate subspaces (i) have the same rank as, and (ii) are close to their noise-free counterparts.

First, we show that under bounded score difference estimation noise, the approximate rank of ˆRM X is equal to the rank of RM X . To show this, we will leverage Weyl s inequality. Lemma 12 (Weyl s inequality). For any two symmetric k k real matrices A, B, we have λ(A) λ(B) A B 2 , (109)

where λ(A) for symmetric matrix A denotes the vector of eigenvalues of A ordered in ascending order, and v is the maximum absolute value among the entries of v.

We can now show that the approximate rank of ˆRm X matrices and sums are equal to the rank of Rm X under proper threshold selection and low error. Lemma 13. Define η as the minimum nonzero eigenvalue among {RM X : M [n]}, that is,

η min M [n] min λi RM X : i [d] , λi RM X = 0 . (110)

(i) For η (0, η ), if max m [n]

ˆRm X Rm X 2 < min{η, η η} = η , (111)

then, for any m [n], we have

dim col (Rm X) = dim col ˆRm X; η . (112)

(ii) For η (0, η ), if max m [n]

ˆRm X Rm X 2 < η

then, for any M [n], we have

dim col RM X = dim col ˆRM X ; η . (114)

Proof: Let us investigate the approximate rank of ˆRM X under threshold η for any M [n]. If the eigenvalues of ˆRM X that correspond to the eigenvalues of the null space of RM X are below the threshold η, and the eigenvalues of ˆRM X that correspond to the eigenvalues of the column space of RM X are above η, then the approximate rank of ˆRM X is equal to the rank of RM X . Let us denote the maximum amount any eigenvalue of ˆRM X differs from the corresponding eigenvalue of RM X by

χ λ( ˆRM X ) λ(RM X ) . (115)

Thus, the eigenvalues of the noisy correlation matrix ˆRM X that correspond to the null space eigenvalues of RM X is at most χ. Similarly, eigenvalues of the noisy correlation matrix ˆRM X that correspond to the column space eigenvalues of RM X is at least

min λi RM X : i [d] , λi RM X = 0 χ . (116)

Therefore, if χ < η < min λi RM X : i [d] , λi RM X = 0 χ , (117)

then using threshold η ensures that the approximate rank of ˆRM X is equal to the rank of RM X . We can generalize this statement as follows. First, note that, by definition,

η min λi RM X : i [d] , λi RM X = 0 . (118)

Then, the rank statement about ˆRM X is correct if χ satisfies

χ < η < η χ . (119)

This statement can be equivalently written as

χ < min{η, η η} η . (120)

Next, we find an upper bound on χ in terms of Rm X Rm X 2 using Weyl s inequality (Lemma 12).

χ = λ( ˆRM X ) λ(RM X ) ˆRM X RM X 2 (121)

ˆRm X Rm X 2 (122)

|M| max m [n]

ˆRm X Rm X 2 (123)

n max m [n]

ˆRm X Rm X 2 . (124)

Therefore, for any η (0, η ), leveraging the condition (120), we have

(i) If |M| = 1 and M = {m}, we can use (123) to show that the condition

ˆRm X Rm X 2 < η , (125)

ensures that dim col (Rm X) = dim col ˆRm X; η . (126)

(ii) For any generic M, we can use (124) to show that the condition

ˆRm X Rm X 2 < η

ensures that dim col RM X = dim col ˆRM X ; η . (128)

This concludes the proof.

Since we have a sufficient condition to ensure that approximate ranks of { ˆRM X : M [n]} are equal to the ranks of {RM X : M [n]}, now we can state a sufficient condition for finding a valid causal order. Lemma 14 (Bounded noise Causal order). Let η (0, η ). If

ˆRm X Rm X 2 < η

then I π is a valid causal order, where π is the output of Algorithm 1.

Proof: We note that Algorithm 1 only uses approximate rank statements for { ˆRM X : M [n]}. Therefore, if the approximate rank statements are equal to the ranks of {RM X : M [n]}, then, the infinite sample guarantees of Algorithm 1 provided in Lemma 9 immediately transfer to the noisy Rm X regime.

Next, we note that Algorithms 2 and 3 use the direction information of the approximate column and null spaces of { ˆRM X : M [n]} as well. Therefore, we need to bound the distance of the estimated subspaces from their noise-free counterparts. The main results we will use for this purpose are the Davis Kahan sin θ theorems [19], specialized for approximate column and null spaces.

Lemma 15 ([19, sin θ theorem]). Consider two k k real symmetric matrices A and ˆA. Denote the orthonormal bases of the column and null spaces of A by A1 and A0, respectively, such that

A = A1C1A 1 , (130)

where C1 is a symmetric matrix with the same eigenvalues as A. Similarly, denote the orthonormal bases for the approximate column and null space of ˆA by ˆA1 and ˆA0, such that

ˆA = ˆA0 ˆC0 ˆA 0 + ˆA1 ˆC1 ˆA 1 , (131)

where eigenvalues of ˆC0 are that of the approximate null space of ˆA, and the eigenvalues of ˆC1 are that of the approximate column space of ˆA. Define

χ min λmin( ˆC1) , λmin(C1) λmax( ˆC0) , (132)

where λmax and λmin denote the maximum and minimum eigenvalue of a symmetric matrix. Then, given that the approximate rank of ˆA is equal to the rank of A and χ > 0, we have ˆA 0 A1 2 1

ˆA A 2 , (133)

and ˆA 1 A0 2 1

ˆA A 2 . (134)

Lemma 16 ([19, Symmetric sin θ theorem]). Under the setting described in the sin θ theorem above, we have ˆA0 ˆA 0 A0 A 0 2 1

ˆA A 2 , (135)

and ˆA1 ˆA 1 A1 A 1 2 1

ˆA A 2 . (136)

We note that χ in these sin θ theorems can be bounded using Weyl s inequality (Lemma 12) and the minimum nonzero eigenvalue of matrix A as follows.

λmax( ˆC0) ˆA A 2 , λmin( ˆC1) λmin(C1) ˆA A 2 . (137)

Therefore, χ λmin(C1) ˆA A 2 . (138)

Using RM X for A and ˆRM X for ˆA, and using the definition of η in (110), we obtain

χ η ˆRM X RM X 2 . (139)

In order to ensure that the approximate rank of ˆRM X is equal to the rank of RM X , it suffices that ˆRM X RM X 2 satisfies η > ˆRM X RM X 2 , (140) due to (121). If this holds, we get

χ > η η = η min{η, η η} = max{η, η η} min{η, η η} = η . (141)

Therefore, the left-hand sides of (133) (136) are upper bounded by

ˆRM X RM X 2 . (142)

We can use Davis Kahan symmetric sin θ theorem (Lemma 16) to prove that encoder estimation error is upper bounded by the error in ˆRm X. Lemma 17 (Bounded noise Encoder). Let η (0, η ). If (125) holds, then the output H of Algorithm 3 satisfies H G = PI (Cpa + Cerr) , (143) where for all i pa(j), Cpa satisfies (Cpa)i,j = 0, and

η G 2 max m [n]

ˆRm X Rm X 2 . (144)

Proof: We prove this result by considering each row m of H separately. In Algorithm 3, we select Hm,: from col( ˆRm X; η) for each m [n]. We start by noting that in Lemma 7, we have shown that using the noise-free score difference correlation matrices, we get v G

i = 0 = i pa(Im) v col(Rm X) , v 2 = 1 . (145)

We can decompose any v col( ˆRm X; η) as

v = A0 A 0 v + A1 A 1 v , (146)

where A0 and A1 are the orthonormal bases of the null and column spaces of Rm X in line with the notation of the Davis Kahan sin θ theorem. Note that A1 A 1 is the orthogonal projection operator onto the column space of Rm X, therefore, using (145), we have (A1 A 1 v) G

i = 0 = i pa(Im) v col( ˆRm X; η) , v 2 = 1 . (147)

Therefore, the erroneous entries in v G can be written as v G

i = (A0 A 0 v) G

i i pa(Im) , v col( ˆRm X; η) , v 2 = 1 . (148)

Note that if we choose v col( ˆRm X; η) as row m of H, we have PI Cerr

m,: 2 2 = X

v G 2 i (A0 A 0 v) G 2 2 . (149)

Since v col( ˆRm X; η), we have ˆA1 ˆA 1 v = v, where ˆA1 is the orthonormal basis of the approximate column space of ˆRm X. Using this and v 2 = 1, we can write (149) as PI Cerr

m,: 2 A0 A 0 v 2 G 2 (150)

= A0 A 0 ˆA1 ˆA 1 v 2 G 2 (151)

A0 A 0 ˆA1 ˆA 1 2 G 2 (152)

= A 0 ˆA1 2 G 2 , (153)

Using Davis Kahan sin θ theorem (Lemma 15), when (125) holds, we can further upper bound the m-th row of the error term by

ˆRm X Rm X 2 G 2 . (154)

Note that the spectral norm is upper bounded by the Frobenius norm. Therefore,

Cerr 2 Cerr F n 1

η G 2 max m [n]

ˆRm X Rm X 2 . (155)

Next, we investigate the approximate orthogonality test used for Algorithm 2 under noisy score difference correlation matrices.

Lemma 18. Let A, B, ˆA, ˆB be k k symmetric matrices, such that

A = A0C0A 0 + A1C1A 1 , ˆA= ˆA0 ˆC0 ˆA 0 + ˆA1 ˆC1 ˆA 1 , (156)

B = B0D0B 0 + B1D1B 1 , ˆB = ˆB0 ˆD0 ˆB 0 + ˆB1 ˆD1 ˆB 1 , (157)

where [A0, A1], [B0, B1], [ ˆA0, ˆA1], [ ˆB0, ˆB1] are all k k orthogonal matrices, and shapes of A0 ˆA0 and B0 ˆB0 match. Given that the spectra of (i) C0 and ˆC1, (ii) C1 and ˆC0, (iii) D0 and ˆD1, and (iv) D1 and ˆD0 are all separated with gap χ, we have

(i) If A 0 B0 2 = 0, that is, if col(A0) col(B0), then,

ˆA 0 ˆB0 2 1

χ ˆA A 2 + ˆB B 2 . (158)

(ii) If A 0 B0 2 = µ = 0, that is, col(A0) col(B0), then,

ˆA 0 ˆB0 2 µ 1

χ ˆA A 2 + ˆB B 2 . (159)

Proof: We first prove a perturbation bound on the product of orthogonal projectors. ˆA0 ˆA 0 ˆB0 ˆB 0 A0 A 0 B0 B 0 2 (160)

= ˆA0 ˆA 0 ˆB0 ˆB 0 B0 B 0 A0 A 0 ˆA0 ˆA 0 B0 B 0 2 (161)

ˆA0 ˆA 0 2 ˆB0 ˆB 0 B0 B 0 2 + A0 A 0 ˆA0 ˆA 0 2 B0 B 0 2 (162)

ˆB0 ˆB 0 B0 B 0 2 + A0 A 0 ˆA0 ˆA 0 2 . (163)

Using Davis Kahan symmetric sin θ theorem (Lemma 16) yields ˆA0 ˆA 0 ˆB0 ˆB 0 A0 A 0 B0 B 0 2 1

χ ˆA A 2 + ˆB B 2 . (164)

We use this result in two separate cases: col(A0) col(B0) and col(A0) col(B0).

(i) Case 1: col(A0) col(B0). In this case, (164) immediately yields (158).

(ii) Case 2: col(A0) col(B0). Let us define A 0 B0 2 = µ = 0. We can use (164) and triangle inequality to get ˆA0 ˆA 0 ˆB0 ˆB 0 2 + 1

χ ˆA A 2 + ˆB B 2 (165)

ˆA0 ˆA 0 ˆB0 ˆB 0 2 + ˆA0 ˆA 0 ˆB0 ˆB 0 A0 A 0 B0 B 0 2 (166)

A0 A 0 B0 B 0 2 (167)

= A 0 B0 2 = µ . (168)

Rearranging the terms yields (159).

Finally, we show that the graph estimation is correct under bounded noise. Lemma 19 (Bounded noise Graph). Let η (0, η ) and γ (0, γ ), where

γ min t [n]

G:,t 2 (G )t,: 2 , and γ min{γ, γ γ} . (169)

Under Assumption 1, the output ˆG of Algorithm 2 is the graph isomorphism of the transitive closure of the true latent graph G if

ˆRm X Rm X 2 < min η

Proof: Algorithm 2 uses approximate column and null spaces of RM X for different M [n]. Therefore, the generic sufficient condition (113), that is,

ˆRm X Rm X 2 < η

ensures that the approximate ranks of all ˆRM X matrices investigated in the algorithm will be equal to the rank of RM X . To ensure that the estimated graph will be correct, we must guarantee that the approximate orthogonality tests will also give correct results. Using Lemma 18, we can show that the test metric is upper bounded in terms of maxm [n] ˆRm X Rm X 2. We do so by setting A = RVt X , ˆA = ˆRVt X , B = RMt,j X , ˆB = ˆRMt,j X ; and A0 as the orthonormal basis of the column space of A, and B0 as the null space of B. As shown in (141), (113) ensures that the spectral gap χ in Lemma 18 statement is at least χ > η . In Lemma 10, we show that in the infinite sample, noise-free regime, when investigating (t, j), the test metric is given by

A 0 B0 2 = 0 It Ij G , G:,t 2 (G )t,: 2 It Ij G . (172)

Since the approximate orthogonality test needs to distinguish even the weakest instance of nonorthogonality from orthogonality, we define

γ min t [n]

G:,t 2 (G )t,: 2 . (173)

Using Lemma 18, we see that for (t, j) for which It Ij exists in G, if (113) holds, then

ˆA 0 ˆB0 2 < 2n

η max m [n]

ˆRm X Rm X 2 . (174)

Similarly, for (t, j) for which It Ij does not exist in G, if (113) holds, then

ˆA 0 ˆB0 2 > γ 2n

η max m [n]

ˆRm X Rm X 2 . (175)

Therefore, for a choice of γ (0, γ ) to yield correct orthogonality decisions, it suffices that

η max m [n]

ˆRm X Rm X 2 < γ < γ 2n

η max m [n]

ˆRm X Rm X 2 , (176)

or, equivalently, 2n

η max m [n]

ˆRm X Rm X 2 < min{γ, γ γ} γ . (177)

In other words, when using γ (0, γ ) as the approximate orthogonality threshold, we can correctly identify orthogonal subspaces from non-orthogonal ones when

ˆRm X Rm X 2 < η γ

Due to Lemma 10, if all the approximate rank and orthogonality tests are correct, then the output ˆG of Algorithm 2 is the graph isomorphism of the transitive closure of the true latent graph G.

D Sample complexity for generic consistent score estimator

In this section, we will transform the bounded noise identifiability guarantees in Appendix C to sample complexity statements about a generic consistent score difference estimator. Specifically, recall the consistent score difference estimator model from (15).

P max m [n]

ˆd m X( ; XN) dm X p X ϵ 1 δ , N N(ϵ, δ) . (179)

Recall the definitions of the constants in (24),

β 4 max m [n]

1 and βmin 2 min m [n]

dm X p X . (180)

First, we derive the sufficient samples for (ϵ, δ)-approximating ˆRm X using Lemma 11. Lemma 20. For any ϵ > 0 and δ > 0, using NR(ϵ, δ) samples suffice to ensure that

P max m [n]

ˆRm X Rm X 2 ϵ 1 δ , N NR(ϵ, δ) , (181)

where NR(ϵ, δ) N (min{ϵ β, βmin}, δ) . (182)

Proof: In Lemma 11, in order to bound the error in ˆRm X for a specific m [n], we require dm X ˆd m X p X < 2 dm X p X . (183)

To ensure this for all m [n], it suffices that

ˆd m X dm X p X < 2 min m [n]

dm X p X βmin . (184)

Under this condition, Lemma 11 shows that the error in correlation matrices Rm X is bounded by the error in dm X as ˆRm X Rm X 2 4 dm X p X ˆd m X dm X p X . (185)

To get an m-agnostic bound, we can take max of both sides to get

ˆRm X Rm X 2 4 max m [n]

dm X p X max m [n]

ˆd m X dm X p X . (186)

In other words, to achieve ϵ error in estimating ˆRm X, the error in estimating dm X suffices to be upper bounded by ϵ β ϵ 4 maxm [n] dm X p X > max m [n]

ˆd m X dm X p X . (187)

In summary, (184) and (187) together gives that

ˆd m X dm X p X < min βmin, ϵ β , (188)

suffices to ensure max m [n]

ˆRm X Rm X 2 ϵ . (189)

Using the sample complexity of ˆd m X, this means that

N NR(ϵ, δ) N min ϵ β, βmin , δ (190)

samples suffice to ensure this error upper bound on ˆRm X with probability at least 1 δ.

Next, we restate and prove the sample complexity of estimating the causal order in Lemma 4.

Lemma 4 (Sample complexity Causal order). Let η (0, η ). Under Assumption 1, for any δ > 0, Nrank(δ) samples suffice to ensure that with probability at least 1 δ, I π is a valid causal order, where π is the output of Algorithm 1, where

Nrank(δ) N min βη

, δ . (191)

Proof: Lemma 14 states that as long as ˆRm X error is upper bounded by η

n , the output of Algorithm 1 is correct. Then, using Lemma 20, we see that

n , δ = N min βη

samples suffice to ensure that the estimated causal order is correct with probability at least 1 δ.

Next, we prove the sample complexity of estimating the latent graph.

Theorem 1 (Sample complexity Graph). Let η (0, η ) and γ (0, γ ). Under Assumption 1, for any δ > 0, NG(δ) samples suffice to ensure that collective output ˆG(XN) of Algorithms 1 and 2 satisfies δ PAC graph recovery, where

NG(δ) N min βη

, δ . (193)

Proof: Lemma 19 states that the estimated graph is correct if

ˆRm X Rm X 2 < min η

Therefore, Lemma 20 implies that

samples suffice for correct graph recovery with probability at least 1 δ.

Similarly, we prove the sample complexity of estimating the causal variables.

Theorem 2 (Sample complexity Variables). Let η (0, η ). For any ϵ > 0 and δ > 0, NZ(ϵ, δ) samples suffice to ensure that the output H(XN) of Algorithm 3 satisfies (ϵ, δ) PAC causal variables recovery, where

NZ(ϵ, δ) N min ϵβη n G 2 , βη , βmin

, δ . (196)

Proof: Lemma 17 states that if the input error is bounded by

ˆRm X Rm X 2 < η , (197)

then the estimated encoder achieves error rate ϵ Cerr 2 n 1

η G 2 max m [n]

ˆRm X Rm X 2 ϵ . (198)

Therefore, to ensure (ϵ, δ) PAC causal variables recovery, it suffices that, with probability at least 1 δ, the input error on ˆRm X is upper bounded by

ˆRm X Rm X 2 < min ϵη n G 2 , η

Using Lemma 20, this implies that

N N min ϵβη n G 2 , βη , βmin

samples suffice for (ϵ, δ) PAC causal variables recovery.

E Sample complexity for RKHS-based score estimator

In this section, we adopt the RKHS-based score estimator of [12], state its assumptions, derive its noise model for score difference estimation, and use this model to make the sample complexity upper bounds derived in Appendix D explicit.

The score estimator in [12] requires the support of the distribution to be an open subset of its domain [12, Assumption B.1]. Therefore, we estimate the score function of X through its orthogonal projection to the n dimensional support manifold col(G). The projection matrix can be constructed from observed data samples with probability 1 as long as N n.

Next, we state Lemma 5 formally including all the assumptions required.

Lemma 5 (Formal). Denote a trace-class matrix-valued kernel by K: Rd Rd Rd d, denote its associated RKHS by HK, and define κ supx col(G) tr K(x, x). Define an integral operator on HK as LK,pf Z K(x, )f(x)p(x) dx , f HK , (201)

where p is a pdf. Assume that there exists f 0 HK and f m HK for all m [n] such that s X = LK,p Xf 0 and sm X = LK,pm X f m. Then, using Tikhonov regularization, under [12, Assumptions B.1 B.5], for any m [n], for any δ (0, 1) and N (2

2κ2 log 8n/δ)4, with probability at least 1 δ, score estimator in [12, Theorem B.1] satisfies

ˆd m X dm X p X C N 1/4 log 8n/δ , (202)

where C is a sample independent constant that depends only on p X, pm X for m [n] and the structure of HK.

Proof: We use the fact that for a function f HK, we have f p X = LK,p Xf HK LK,p X op f HK . (203)

Therefore, even if we estimate sm X using data from pm X but evaluate it on data from p X, we can still bound the MSE of the estimate using the Hilbert norm. The result states that the Hilbert norm error bound is, for any m {0} [n], and if N (2

2κ2 log 4/δ)4,

P ˆsm X sm X HK C1 N 1/4 log 4

1 δ . (204)

Since p X is a norm, we have

ˆd m X dm X p X ˆsm X sm X p X + ˆs X s X p X LK,p X op ˆsm X sm X HK + ˆs X s X HK . (205)

Combining with the Hilbert norm error bound, taking union bound and defining C 2C1 LK,p X op, we get

P ˆd m X dm X p X C N 1/4 log 4

1 2δ . (206)

Note that the noise model in (15) requires a bound on maxm [n] ˆd m X dm X p X. By taking a union bound, we get

P max m [n] ˆd m X dm X p X C N 1/4 log 4

1 2nδ . (207)

By replacing δ with δ/2n, we get, when N (2

2κ2 log 8n/δ)4,

P max m [n] ˆd m X dm X p X C N 1/4 log 8n

1 δ , (208)

which concludes the proof. Corollary 2. The error bound given in this result can be immediately transformed into a sample complexity statement. Specifically,

P max m [n]

ˆd m X dm X p X ϵ 1 δ , N N(ϵ, δ) , (209)

N(ϵ, δ) = max 2

Using this N( , ) function, we can immediately make the sample complexity results in the previous section explicit. The theorem statements are also stated here for completeness.

Theorem 3 (RKHS-based sample complexity Graph). Let η (0, η ) and γ (0, γ ). Under Assumption 1 and the conditions of Lemma 5, NG(δ) samples sufficient to ensure that the collective output ˆG(XN) of Algorithms 1 and 2 satisfies δ PAC graph recovery, where

NG(δ) = max C

2κ2 4 log 8n

4 , and ϵG min βη

Theorem 4 (RKHS-based sample complexity Variables). Let η (0, η ). Under the conditions of Lemma 5, NZ(ϵ, δ) samples ensure that the output H(XN) of Algorithm 3 satisfies (ϵ, δ) PAC causal variables recovery, where

NZ(ϵ, δ) = max C ϵ ϵZ , 2

2κ2 4 log 8n

ϵZ min ϵβη n G 2 , βη , βmin

F Properties of positive semidefinite matrices

In this section, we present properties of positive semidefinite (PSD) matrices that are commonly used in our proofs. Proposition 2. For a real k k PSD matrix A, the null space of A can be specified using the quadratic form as null(A) = v Rk : v Av = 0 . (214)

Proof: For a vector v Rk, Av = 0 implies v Av = 0 unconditionally. We prove the reverse direction by showing that Av = 0 implies v Av = 0. First, since A is PSD, it has a Cholesky decomposition A = U U for some matrix U Rk k. If Av = U Uv = 0, then Uv = 0. This implies v Av = Uv 2 2 = 0. This concludes the proof.

Proposition 3. If A, B are real k k PSD matrices, then

null(A + B) = null(A) null(B) , (215)

and col(A + B) = col(A) + col(B) , (216)

where + denotes the Minkowski sum between sets, which is defined, for any two subsets A and B of a vector space, as A + B a + b : a A , b B . (217)

Proof: Using Proposition 2, we can write null(A + B) as

null(A + B) = v Rk : v (A + B)v = 0 . (218)

Since A and B are both PSD, for any v Rk, we have v Av 0 and v Bv 0. Therefore, v (A + B)v = 0 holds if and only if v Av = 0 and v Bv = 0. That is,

null(A + B) = v Rk : v Av = 0 v Bv = 0 . (219)

We note that, due to Proposition 2, the right hand side of this expression is equal to null(A) null(B), which concludes the proof for (215).

Next, note that the column space of a symmetric matrix is the orthogonal complement of its null space. Then, using (215), we have

col(A + B) = (null(A + B)) = null(A) null(B) . (220)

Since null(A) and null(B) are subspaces of a finite dimensional vector space, we have

(null(A) null(B)) = null(A) + null(B) = col(A) + col(B) . (221)

This concludes the proof for (216).

G Hyperparameter estimation

In this section, we provide minor modifications to our algorithms that enable constructing rough estimates for threshold upper bounds η and γ in practice.

Estimating η : We note that in Algorithm 1, at step t, all matrices investigated must have either rank t or t 1. Therefore, the t 1-th eigenvalues of all of these matrices are non-zero, or equivalently we have η . This observation lets us to iteratively build and refine an upper bound on η during Algorithm 1, which we can use as a surrogate of η in the subsequent Algorithm 2 and 3. For Algorithm 1 itself, at time t, we can pick k Vt with minimal t-th eigenvalue this is a threshold-free way of estimating the causal order.

Estimating γ : We first note that definition in (23) depends entirely on the ground truth decoder matrix G and its pseudoinverse. Secondly, we note that Algorithm 3 can be used independently of Algorithm 2 to generate an estimate H of G up to some estimation error and possible mixing among rows. We can compute the value

min i [n] Hi,: 2 H :,i 2 , (222)

which would be equal to γ had H = G , and use it as our estimate of γ .

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Sections 4 and 5 establish the results claimed in the abstract and introduction. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations of the work, e.g., linear transformations, are clarified throughout the paper and discussed in Section 7. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: The assumptions for the theoretical results are clearly given in the theorem statements, and the proofs of the results are provided in the appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: The algorithm details are described in Section 4. The experimental procedure is described in Section 6. The codebase for the experiments can be found at https: //github.com/acarturk-e/finite-sample-linear-crl.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The codebase for the experiments can be found at https://github.com/ acarturk-e/finite-sample-linear-crl. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The experimental procedure is described in Section 6. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: The only experimental result with error bars is Figure 1a, in which the 95% error bars are reported but are indistinguishable from the plot lines. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: Experiments are run on a single commercial laptop CPU.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The authors have reviewed the Neur IPS Code of Ethics and confirm that the research in this paper conforms with the code of ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: This paper is mostly theoretical and does not pose potential negative societal impacts.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper is mostly theoretical and does not pose such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: Portions of the publicly available code of [5], available under Apache 2.0 license, is adopted in the code of our experiments.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The codebase for the experiments can be found at https://github.com/ acarturk-e/finite-sample-linear-crl, and is released under Apache 2.0 license. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: [NA] Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.