# functional_directed_acyclic_graphs__c0176410.pdf

Journal of Machine Learning Research 25 (2024) 1-48 Submitted 9/22; Revised 11/23; Published 3/24

Functional Directed Acyclic Graphs

Kuang-Yao Lee kuang-yao.lee@temple.edu Department of Statistics, Operations, and Data Science Temple University Philadelphia, PA 19122, USA

Lexin Li lexinli@berkeley.edu Department of Biostatistics and Epidemiology University of California at Berkeley Berkeley, CA 94720, USA

Bing Li bxl9@psu.edu Department of Statistics Pennsylvannia State University University Park, PA 16802, USA

Editor: Peter Spirtes

In this article, we introduce a new method to estimate a directed acyclic graph (DAG) from multivariate functional data. We build on the notion of faithfulness that relates a DAG with a set of conditional independences among the random functions. We develop two linear operators, the conditional covariance operator and the partial correlation operator, to characterize and evaluate the conditional independence. Based on these operators, we adapt and extend the PC-algorithm to estimate the functional directed graph, so that the computation time depends on the sparsity rather than the full size of the graph. We study the asymptotic properties of the two operators, derive their uniform convergence rates, and establish the uniform consistency of the estimated graph, all of which are obtained while allowing the graph size to diverge to inﬁnity with the sample size. We demonstrate the eﬃcacy of our method through both simulations and an application to a time-course proteomic dataset.

Keywords: Graphical model, faithfulness, functional regression, linear operator, reproducing kernel Hilbert space, uniform consistency.

1. Introduction

In this article, we introduce a new method to estimate a directed acyclic graph (DAG) based on multivariate functional data. Functional graphical modeling is becoming increasingly important, as multivariate functional data, where the observations are sampled from a vector of random functions, are fast emerging in a wide variety of scientiﬁc applications. Examples include molecular network modeling based on time-course gene or phosphoprotein data (Hill et al., 2016), and brain eﬀective connectivity analysis based on electrocorticography or functional magnetic resonance imaging data (Friston, 2011). A crucial problem in these applications is to investigate directional relations among the random functions, which is

c 2024 Kuang-Yao Lee, Lexin Li, and Bing Li.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-1038.html.

Lee, Li, and Li

a challenging task. The DAG model oﬀers a tractable solution, and yet functional DAG modeling is a relatively underdeveloped topic. Consider a graph G = (V, E), where V = {1, . . . , p} denotes a set of vertices associated with a vector of random variables or functions, X1, . . . , Xp, and E {(i, j) V V : i = j} a set of edges. Let (V V)0 denote the set {(i, j) V V : i = j} that excludes the diagonal pairs. A pair (i, j) E denotes an edge between vertices i and j, and is said to be directed from i to j if (j, i) / E. In this relation, i is called a parent of j, and j a child of i. If both (i, j) and (j, i) belong to E, then the edge is said to be undirected. If there is a directed path i . . . j from i to j, then i is called an ancestor of j, and j a descendant of i. A DAG is a graph that contains only directed edges and no directed cycles. If two edges meet head-to-head at a vertex i on a path, say j i k, then i is called a collider on the path. For S V\{i, j}, the vertices i and j are said to be d-connected by S if and only if there exists a path connecting i and j that satisﬁes: (i) every collider in the path is either in S or has a descendant in S, and (ii) no non-collider in the path is in S. By convention, (i) includes the cases where the path has no collider at all. Also note that the descendant of a collider in the path does not belong to the path. We say i and j are d-separated by S if they are not d-connected by S. In the classical random variable setting, directional relations among the variables can be depicted by faithfulness. Formally, X = (X1, . . . , Xp)T Rp is said to be faithful with respect to a DAG G, if and only if the collection of the triplets {(i, j, S) T : i and j are d-separated by S} is the same as {(i, j, S) T : Xi Xj | XS}, where denotes statistical independence, and T = {(i, j, S) : (i, j) (V V)0, S V\{i, j}}. Moreover, when X follows a multivariate Gaussian distribution, we have the equivalence that Xi Xj | XS cov (Xi, Xj|XS) = 0. In the functional data setting, the notion of faithfulness is essentially the same. Speciﬁcally, suppose X = (X1, . . . , Xp) T is a p-dimensional Gaussian random element in a Hilbert space of functions deﬁned on an interval T in R. We say X is faithful with respect to a DAG G, if and only if the following equivalence holds:

i and j are d-separated by S Xi Xj | XS. (1)

The conditional independence in (1) is in terms of Hilbertian random elements, and is formally deﬁned in Section 2. Through the faithfulness in (1), a DAG is associated with a collection of conditional independence relations among p random functions. To evaluate the conditional independence in (1), we develop two linear operators, the conditional covariance operator (CCO), and the partial correlation operator (PCO). Under the assumption that the p-variate random function X follows a Gaussian distribution, the conditional independence can be completely characterized by CCO or PCO, in the sense that CCO or PCO is zero if and only if the conditional independence between the random functions holds true. This agrees with the classical result that the conditional independence between two Gaussian variables is equivalent to their conditional covariance or partial correlation being zero. Henceforth, CCO and PCO can be viewed as the functional counterparts of conditional covariance and partial correlation. We next estimate the DAG by repeatedly evaluating the Hilbert-Schmidt norms of CCO and PCO, whose computation is straightforward and only involves eigen-decomposition of linear operators. We further embed CCO and PCO in the commonly used PC-algorithm (Spirtes et al., 2000), and turn

Functional Directed Acyclic Graphs

the NP-hard problem of evaluating conditional independence for every pair (i, j) and every possible subset S V\{i, j} to a computationally eﬃcient algorithm of order pm, where m is the maximum degree of the DAG. We also establish the error bounds and the uniform convergence for the estimated CCO and PCO, as well as the uniform consistency of the estimated DAG. We carry out all these theoretical investigations by allowing the graph size to diverge to inﬁnity along with the sample size. We note that our theoretical analysis faces a number of challenges. The ﬁrst is, when characterizing the conditional independence between two random functions, we need to deal with diverging numbers of random variables from the Karhunen-Lo eve expansions of the functions, which requires a much more involved asymptotic analysis than in the classical setting. The second challenge is, most existing concentration inequalities as well as their suﬃcient conditions have been tailored for high-dimensional random variables, rather than high-dimensional random functions. To ﬁll this gap, we develop suitable new asymptotic tools, including functional versions of the sub-Gaussianity and Bernstein s inequality for Hilbert space-valued random elements. The third challenge is, because the covariances are replaced by the covariance operators, we need to establish several concentration bounds and uniform convergences for the relevant linear operators in the high-dimensional setting. In fact, the theoretical framework we develop here is fairly general, and we expect it to be useful in other functional data analysis settings as well, especially when the number of functions involved is large compared to the sample size. It is also noteworthy that there is some novelty in our presentation of the classical DAG theory: We describe the CPDAG as a member of the quotient space of the collection of all DAGs, which makes its subtle deﬁnition more transparent and explicit. Our work is a natural step forward in the current research on statistical graphical modeling. The majority of existing solutions focus on undirected or directed graphical models for random variables. Notable examples of the former include Yuan and Lin (2007); Friedman et al. (2008); Ravikumar et al. (2011); Cai et al. (2011); Guo et al. (2011); Ren et al. (2015); Fan and Lv (2016); Liu et al. (2021), among others, whereas examples of the latter include Chickering (2002); Kalisch and B uhlmann (2007); Harris and Drton (2013); van de Geer and B uhlmann (2013); Li et al. (2020), among others. More recently, Zhu et al. (2016), Li and Solea (2018), Qiao et al. (2019), Qiao et al. (2020), Solea and Li (2020), and Zhao et al. (2022) extended undirected graphical models to random functions. Even though the undirected and directed graphs are related, we discuss their diﬀerences in Section 7.3. Despite the recent progress, the problem of estimating functional directed graphical models remains largely underdeveloped. G omez et al. (2021) reformulated DAG estimation as sparse function-on-function regression (Fan et al., 2015; Luo and Qi, 2016), but required the causal order is known a prior, which can be unrealistic in practice. Lee and Li (2022) relaxed this requirement and proposed to estimate DAG through a functional structural equation model (SEM) with two man steps: order determination, then sparse functional regression. Our proposal is considerably diﬀerent in several ways. First, unlike G omez et al. (2021), our method does not assume the order is known, and thus the problem is more challenging; see also van de Geer and B uhlmann (2013) for the diﬀerences between DAG estimation with and without a known order in the random variable setting. Second, our method is built upon the proposed functional PC-algorithm, and belongs to the category of independence-based solutions, whereas Lee and Li (2022) is built upon function-on-function

Lee, Li, and Li

regression, and belongs to the category of SEM-based approaches; see also Peters et al. (2014) for the diﬀerences of these two categories of solutions under the random variable setting. Due to this methodological diﬀerence, the subsequent asymptotic analysis becomes utterly diﬀerent too. In Section 7.1, we establish a one-to-one correspondence between the functional DAG and the functional linear SEM, which in turn reveals how the functional DAG factorizes the joint distribution, and how to interpret the identiﬁed edges. We further develop in Section 7.2 the notion of do-intervention for possible causal interpretation in the functional setting. None of these results are available in Lee and Li (2022). We also remark that our proposal hinges on linear operators, which are being increasingly employed in graphical modeling (Li et al., 2014; Lee et al., 2016, 2020). Nevertheless, we use linear operators to study a completely diﬀerent problem. For instance, Lee et al. (2021) studied conditional undirected graphs that vary along with the external variables, whereas we target DAG estimation for multivariate random functions. Correspondingly, the estimation method and asymptotic theory are substantially diﬀerent. The rest of the article is organized as follows. We introduce DAG and its equivalence class in Section 2, and develop the linear operators, CCO and PCO, in Section 3. We derive their sample estimation and the modiﬁed PC-algorithm in Section 4, and develop the asymptotic theory in Section 5. We conduct numerical studies in Section 6, and further discuss the model in Section 7. We relegate all technical proofs and additional numerical results to the Appendix.

2. DAG and its equivalence class for functions

Let (Ω, F, P) denote a probability space. For an interval T R and t T, let X(t) = [X1(t), . . . , Xp(t)]T denote a vector of multivariate random functions of dimension p deﬁned on Ωand taking values in ΩX = ΩX1 ΩXp, where each ΩXi is a Hilbert space of R-valued functions on T, i V. Let , ΩXi denote the inner product in ΩXi, and ΩXi the norm induced by this inner product. We allow p to increase with the sample size n; that is, p = p(n), and limn p(n) = . Henceforth, V also depends on n implicitly. We abbreviate X(t) and Xi(t) as X and Xi whenever there is no confusion. Next, we introduce two assumptions on X.

Assumption 1 There exists M0 > 0 such that max{E Xi 2 ΩXi : i V} M0.

Assumption 2 The p-variate random function X is a zero-mean Gaussian random element in ΩX and is faithful with respect to a DAG G; i.e., X satisﬁes (1).

Assumption 1 is standard in high-dimensional functional data analysis, and ensures that the trace of the covariance operator of Xi is uniformly bounded. Assumption 2 is our main distributional assumption. The zero-mean condition is to simplify the development, and can be easily relaxed. It is also possible to relax the Gaussian condition, by employing the notions of functional additive conditional independence (Li and Solea, 2018) or copula graphical models (Liu et al., 2012; Solea and Li, 2020). Nevertheless, we feel the Gaussian case is important for its own sake and is worthy of a full investigation. We leave the non-Gaussian extension as future research.

Functional Directed Acyclic Graphs

Under Assumption 1, the bilinear form (f, g) 7 E( f, Xi ΩXi g, Xj ΩXj ) from ΩXi ΩXj to R is bounded, and induces a bounded linear operator ΣXi Xj : ΩXj ΩXi for each (i, j) V V. We call it the covariance operator from ΩXj to ΩXi. We then deﬁne the joint covariance operator ΣXX : ΩX ΩX as the p p matrix of operators whose (i, j)th entry is ΣXi Xj. This means, for any f = (f1, . . . , fp)T ΩX, we have ΣXXf = Pp j=1ΣX1Xjfj, . . . , Pp j=1ΣXp Xjfj T . Moreover, for any subsets A, B V, we deﬁne ΣXAXB to be the matrix of operators whose entries are {ΣXi Xj : i A, j B}, and its dimension is |A| |B|, where |A| denotes the cardinality of A. We note that, in Assumption 2, an ΩX-valued random element X is Gaussian if and only if f, X ΩX is a Gaussian random variable for every f = (f1, . . . , fp)T ΩX, or equivalently, E exp Pp i=1ι fi, Xi ΩXi = exp 1/2 Pp i=1 Pp j=1 fi, ΣXi Xjfj ΩXi , where ι = 1. Next, we formally deﬁne the notion of conditional independence of random functions. For the probability space (Ω, F, P), suppose (ΩX, FX), (ΩY , FY ), (ΩZ, FZ) are measurable spaces, and X : Ω ΩX, Y : Ω ΩY , Z : Ω ΩZ are random elements in (ΩX, FX), (ΩY , FY ), (ΩZ, FZ), respectively. We say that X and Y are conditionally independent given Z, if and only if, for every A FX, B FY ,

P(X A, Y B | Z) = P(X A | Z) P(Y B | Z) almost surely P.

In our case, ΩX, ΩY , ΩZ are separable Hilbert spaces, and FX, FY , FZ, are the Borel σ-ﬁelds generated by the open sets in ΩX, ΩY , ΩZ, respectively. The speciﬁc forms of ΩX, ΩY , ΩZ are determined by contexts. This general deﬁnition satisﬁes all the relevant axioms of conditional independence such as those described in Lauritzen (1996, Chapter 3). In particular, as in the setting of undirected graphs, under the Gaussian assumption, the pairwise Markov property is equivalent to the global Markov property. For a DAG G deﬁned on V, let H = {(i, j, S) : (i, j) V V, S V\{i, j}}, and H0 = {(i, j, S) H : i = j}. Moreover, let

D(G) = {(i, j, S) H0 : i and j are d-separated by S under G},

F = {(i, j, S) H0 : Xi Xj | XS}. (2)

By deﬁnition, X is faithful to G if D(G) = F. It is possible for two diﬀerent DAGs, say G1 and G2, to share the same D; i.e., D(G1) = D(G2). Consequently, by conditional independence and faithfulness, we can only determine the class of DAGs with the same D. Let G be the collection of all DAGs deﬁned on V. We say that G1, G2 G are equivalent, and write G1 G2 if and only if D(G1) = D(G2), where is an equivalence relation. This is called the Markov equivalence (Peters et al., 2014). Thus, each G G induces an equivalence class {G G : G G}. The collection of all Markov equivalence classes forms a partition of G, which is a quotient space of G, denoted by G/ (Kelley, 1955), and is referred to as the quotient space of Markov equivalence. A partially directed graph is a graph in which some edges are undirected. A completed partially acyclic directed graph (CPDAG) is a partially directed acyclic graph L such that there exists a member D of G/ satisfying the following properties: (i) each directed edge in L is an edge in every DAG in D; and (ii) for each undirected edge, say i j, in L, there is a DAG in D with i j being one of its edges, and another DAG in D with j i

Lee, Li, and Li

being one of its edges. Chickering (2002, Lemma 2) showed that two DAGs are Markov equivalent if and only if they have the same CPDAG. In other words, a member D of G/ corresponds to one and only one CPDAG. This also establishes a bijection between G/ and the collection of all CPDAGs.

3. Two linear operators

In this section, we formally develop the conditional covariance operator and the partial correlation operator to estimate the CPDAG, and derive their population properties. We begin with some notation. Let Ωand Ω be two Hilbert spaces, B (Ω, Ω ) the collection of all bounded linear operators from Ωto Ω , B 2(Ω, Ω ) the collection of all Hilbert-Schmidt operators from Ωto Ω , and B 1(Ω, Ω ) the collection of all trace class operators from Ωto Ω . When Ω = Ω, we write B (Ω, Ω) as B (Ω), B 2(Ω, Ω) as B 2(Ω), and B 1(Ω, Ω) as B 1(Ω). Let , HS, and TR be the operator norm in B (Ω), the Hilbert-Schmidt norm in B 2(Ω), and the trace norm in B 1(Ω) respectively. For a linear operator A, let ker(A), ran(A), and ran(A) denote the kernel, range, and the closure of the range of A; i.e., ker(A) = {f Ω: Af = 0}, ran(A) = {Af : f Ω}, ran(A) = cl [ran(A)], where cl( ) stands for the closure of a set. For a bounded and self-adjoint linear operator A, its restriction on ran(A) is an injective function from ran(A) to ran(A). We call the inverse of this function A|ran(A) as the Moore-Penrose inverse, and denote it by A . That is, A

is the mapping from ran(A) to ran(A) such that, for any x ran(A), A x is the unique member y ran(A) satisfying Ay = x. Let A denote the adjoint of A. For two positive sequences {an} and {bn}, we write an bn or bn an if an/bn is bounded, and write an bn or bn an if an/bn 0. If two sequences an and bn are ordered by , then we use an bn to denote the smaller sequence in terms of . Moreover, we write an bn if an bn and bn an. For each i V, let {(λa i , ηa i )}a N be the collection of eigenvalue-eigenfunction pairs of ΣXi Xi, with λ1 i λ2 i 0 and N being the set of natural numbers {1, 2, . . . , }. Then Xi = P

a Nca i ηa i holds almost surely, where ca i = Xi, ηa i are from i.i.d. N(0, λa i ). This expansion is known as the Karhunen-Lo eve (K-L) expansion, and c1 i, c2 i, . . . are called the functional principal component scores (Bosq, 2000). We note that the K-L expansion has been widely used in functional data analysis (Yao et al., 2005; Yao and M uller, 2010; Li and Guan, 2014; Chen and Lei, 2015). Particularly, Qiao et al. (2019) also used the K-L expansion for undirected functional graphic modeling. We next formally deﬁne the conditional covariance operator. Let

MXAXB = Σ XAXAΣXAXB, for A, B V.

Given its resemblance to the regression coeﬃcient matrix in the classical regression, we call MXAXB a regression operator, and the next assumption ensures it is well-deﬁned.

Assumption 3 Suppose ran(ΣXAXB) ran(ΣXAXA), and MXAXB is Hilbert-Schmidt.

The Hilbert-Schmidt assumption on MXAXB regulates the degree of smoothness in the dependence between XA and XB. That is, the output of ΣXAXB needs to suﬃciently concentrate on the leading eigenfunctions of ΣXAXA. Similar conditions are commonly imposed in the literature (see, e.g., Lee et al., 2016; Li, 2018).

Functional Directed Acyclic Graphs

For any A V, let ΩXA be the direct sum of {ΩXi : i A}; i.e., ΩXA is the Cartesian product of ΩXi, i A, and the inner product in ΩXA is the sum of the inner products in ΩXi, i A. The next proposition shows that the regression operator MXAXB in fact uniquely determines the mapping f 7 E( f, XB ΩXB | XA), which characterizes E(XB | XA). In other words, MXAXB has a one-to-one correspondence with E(XB | XA).

Proposition 1 Suppose Assumptions 1 to 3 hold. Then, for f ΩXB, MXAXBf, XA ΩXA = E( f, XB ΩXB | XA).

Deﬁnition 1 Suppose Assumptions 1 to 3 hold. For (i, j, S) H, let ΣXi Xj|XS : ΩXj ΩXi be the linear operator ΣXi Xj M XSXiΣXSXSMXSXj.

We call ΣXi Xj|XS the conditional covariance operator (CCO) of Xi and Xj given XS.

We note that, an equivalent form of CCO is ΣXi Xj|XS = ΣXi Xj ΣXi XSΣ XSXSΣXSXj. However, the form in Deﬁnition 1 involves the regression operator, and is more conducive for subsequent proofs. The next theorem establishes the properties of ΣXi Xj|XS.

Theorem 1 Suppose Assumptions 1 to 3 hold. Then, for every (i, j, S) H0,

(i) ΣXi Xj|XS = 0 if and only if Xi Xj | XS;

(ii) ΣXi Xj|XS = P

a,b N cov(ca i , cb j | XS)(ηa i ηb j), where ca i is the ath K-L coeﬃcient of Xi, and ηa i is the eigenfunction of ΣXi Xi associated with its ath largest eigenvalue.

Theorem 1 (i) generalizes the classical result when X is a vector of Gaussian random variables to the functional setting. Consequently, we can use the CCO to characterize the conditional independence between Xi and Xj given XS. We next deﬁne the partial correlation operator, which extends partial correlation to the functional setting. This is motivated by the observation that partial correlation achieves better scaling in the classical random variable setting (Peng et al., 2009; Lee et al., 2016; Liu, 2017). The next theorem establishes the existence of PCO and its connection with the conditional independence. Its proof is similar to that of Lee, Li, and Zhao (2016, Theorem 1), and is thus omitted.

Theorem 2 Suppose Assumptions 1 to 3 hold. Then, there exists a unique operator RXi Xj|XS B (ΩXj, ΩXi) such that,

(i) ΣXi Xj|XS = Σ 1/2 Xi Xi|XSRXi Xj|XSΣ 1/2 Xj Xj|XS;

(ii) RXi Xj|XS 1, for every (i, j, S) H0.

Moreover, RXi Xj|XS = 0 if and only if Xi Xj | XS.

Deﬁnition 2 We call RXi Xj|XS in Theorem 2 the partial correlation operator (PCO) of Xi and Xj given XS.

Lee, Li, and Li

4. Estimation

Theorems 1 and 2 suggest two linear operators, ΣXi Xj|XS and RXi Xj|XS, to characterize the conditional independence. In this section, we develop their sample estimators, ﬁrst at the operator level, then at the coordinate level. We then develop a procedure for evaluating Xi Xj | XS for a given triplet (i, j, S) H0, and a modiﬁed PC-algorithm for estimating the entire DAG.

4.1 Operator-level estimation

We ﬁrst derive the K-L expansion at the sample level. Suppose X1, . . . , Xn are i.i.d. samples from X, where Xk = (Xk 1 , . . . , Xk p)T, k = 1, . . . , n. Let En represent the sample mean operator; i.e., En(U) = Pn k=1U k/n, for a set of samples (U 1, . . . , U n). The covariance operator ΣXi Xj is estimated by

ˆΣXi Xj = En [(Xi En Xi) (Xj En Xj)] ,

for any (i, j) V V. For each i V, let {(ˆλa i , ˆηa i )}a N be the sequence of eigenvalueeigenfunction pairs of ˆΣXi Xi. Then, the sample-level K-L expansion of Xk i En(Xi) is Xk i En(Xi) = P

a Nˆc k,a i ˆηa i , where ˆc k,a i = Xk i En(Xi), ˆηa i ΩXi. To improve estimation eﬃciency, we further truncate this expansion at the dth term to obtain the approximation, Xk i En(Xi) Pd a=1ˆc k,a i ˆηa i , for all k = 1, . . . , n, and i V. Note that, we allow the truncation number d to depend on n. Correspondingly, the truncated estimate of ΣXi Xj is ˆΣd Xi Xj = Pd a,b=1En(ˆca i ˆcb j)(ˆηa i ˆηb j). We next estimate the regression operators MXSXi by

ˆ MXSXi = (ˆΣd XSXS + ϵI) 1 ˆΣd XSXi, (3)

for i V, where ϵ > 0 is a ridge-type tuning parameter, and I is the identity operator from ΩXS to ΩXS. The inverse is done by taking the inverse of the eigenvalues; see Proposition 3 in Section 4.2. Following Theorem 1(ii), we estimate the CCO by

ˆΣ d,ϵ Xi Xj|XS = Pd a,b=1{En(ˆca i ˆcb j) En[( ˆ MXSXi ˆηa i )(XS)( ˆ MXSXj ˆηb j)(XS)]}(ˆηa i ˆηb j).

Similarly, following Theorem 2(i), we estimate the PCO by

ˆR d,ϵ,δ Xi Xj|XS = (ˆΣ d,ϵ Xi Xi|XS + δI) 1/2 ˆΣ d,ϵ Xi Xj|XS(ˆΣ d,ϵ Xj Xj|XS + δI) 1/2, (4)

where δ > 0 is a ridge parameter that regularizes the inverses of the two matrices. Note that in (3) and (4), we both truncate the K-L expansion and employ the ridgetype regularization. We choose to do so for the following reasons. First of all, we note that the rank of ˆΣd XSXS is dns, where ns is the cardinality of S. This rank can be larger than the sample size n, and thus we need to introduce an extra ridge-type regularization in (3). Meanwhile, since the ranks of ˆΣ d,ϵ Xi Xi|XS and ˆΣ d,ϵ Xj Xj|XS are d, which is typically smaller than n, an alternative estimator of the PCO is not to employ the ridge regularization as in (3) and (4), which leads to

Rd Xi Xj|XS = ( Σd Xi Xi|XS) 1/2 Σd Xi Xj|XS( Σd Xj Xj|XS) 1/2, (5)

Functional Directed Acyclic Graphs

where Σd Xi Xj|XS = ˆΣd Xi Xj ˆΣd Xi XS ˆΣ d XSXS ˆΣd XSXj, and ˆΣ d XSXS = Pd a=1(λa S) 1(ηa S ηa S) is the Moore-

Penrose inverse of ˆΣd XSXS, for (i, j, S) H0. Comparing (5) to (4), this alternative PCO estimator has two fewer tuning parameters, and thus is computationally easier. However, as we remark after Theorem 4 in Section 5.2, the asymptotic analysis of (4) is much easier than that of (5). Moreover, when d = 1, (5) is closely related to one of the competing methods, linear-PC, that we numerically compare in Section 6.1, and we show that (4) achieves a better empirical performance than (5). Therefore, we propose (4) as our PCO estimator, and build our DAG estimation based on (4).

4.2 Coordinate-level evaluation for conditional independence

We begin with constructing the spaces ΩXi and ΩX using functional bases. Let H be a generic ﬁnite-dimensional Hilbert space spanned by a set of functions B = {b1, . . . , bm} deﬁned on T. For any h H, let [h]B = ([h]B,1 , . . . , [h]B,m)T denote the coordinate of h with respect to B; that is, h = Pm i=1 [h]B,i bi = b T 1:m [h]B, where b1:m denotes the vector of functions (b1, . . . , bm)T. Let KB = { bs, bt H} m s,t=1 be the Gram kernel matrix. Then the inner product h1, h2 H can be expressed as [h1] T B KB [h2]B. We next derive the coordinate representation of Xk, k = 1, . . . , n. Suppose each Xk

is measured on uk time points Tk = {tk1, . . . , tkuk}, k = 1, . . . , n. Let T1:n = n k=1Tk, τ1 < τ2 < < τℓbe all the time points in T1:n, and ℓ= |T1:n| be the total number of distinctive time points. Let ΩXi be the linear space spanned by the functions {κT( , τ) : τ T1:n}. Let KT = {κT(τi, τj)} ℓ i,j=1. Suppose KT is of rank r, and has the spectral decomposition KT = UTDTU T T , where DT Rr r is the diagonal matrix whose diagonal elements are the sorted nonzero eigenvalues of KT, and UT Rℓ r is the matrix whose columns are the eigenvectors corresponding to the eigenvalues in DT. Let (b1, . . . , br)T = D 1/2 T U T T [κT( , τ1), . . . , κT( , τℓ)]T. It is straightforward to verify that Br = {b1, . . . , br} is an orthonormal basis of ΩXi. Using this basis, each Xk i can be represented as Xk i ( ) = Pr t=1 [Xk i ]Br,t bt( ) = b T 1:r( ) [Xk i ]Br. Note that the kth individual Xk i is observed only at uk time points in Tk. Let Xk i (Tk) = Xk i (tk1), . . . , Xk i (tkuk) T, and b1:r(Tk) = b1:r(tk1), . . . , b1:r(tkuk) . Therefore, we have Xk i (Tk) = b1:r(Tk)T [Xk i ]Br, on both side of which we then multiply b1:r(Tk) to get b1:r(Tk)Xk i (Tk) = b1:r(Tk)b1:r(Tk)T [Xk i ]Br. Solving this linear equation with a ridge-type regularization, we obtain the following coordinate representation of [Xk i ]Br, for i V and k = 1, . . . , n,

[Xk i ]Br = [b1:r(Tk)b1:r(Tk)T + ϵk TIr] 1 b1:r(Tk)Xk i , (6)

where Ir is the r r identity matrix, and ϵk T is a ridge tuning parameter. We next derive the coordinate representations of the truncated sample covariance operators. Let H and H be two ﬁnite-dimensional Hilbert spaces spanned by B = {b1, . . . , bm} and B = {b 1, . . . , b m }, respectively. Let A : H H be a linear operator. The coordinate representation of A with respect to B and B is deﬁned as ([Ab1]B , . . . , [Abm]B ) B [A]B. If H is a third Hilbert space with basis B , and A is a linear operator from H to H , then

B [A A]B = (B [A ]B )(B [A]B). For simplicity, we abbreviate B [A]B by [A] when the bases B, B are obvious from the context.

Proposition 2 For each (i, j) V V, [ˆΣXi Xj] = En [Xi En Xi] [Xj En Xj] T .

Lee, Li, and Li

Therefore, by the deﬁnition of ˆ MXSXi, the coordinate representation of ˆ MXSXif is

[ ˆ MXSXif] = ([ˆΣXSXS] + ϵIrns) 1[ˆΣXSXi][f],

for each f ΩXi, i V, and S V, with ns being the cardinality of S. Let {(ˆλa i , [ˆηa i ])}r a=1 denote the collection of eigenvalue-eigenvector pairs of [ˆΣXi Xi], i V. Then the sample-level K-L expansion of Xk i En Xi is

Xk i En Xi = Pd a=1 Xk i En Xi, ˆηa i ΩXi ˆηa i = Pd a=1 [Xk i En Xi] T [ˆηa i ] ˆηa i . (7)

Finally, we derive the coordinate representations of the sample CCO and PCO. Recall that ˆc k,a i denotes the inner product Xk i En Xi, ˆηa i ΩXi. For i V, a = 1, . . . , d, let Ca i be the n-dimensional vector {c k,a i : k = 1, . . . , n}, and C1:d i be the n d matrix whose ath column is Ca i . For S = {a1, . . . , ans} V\{i, j}, let CS be the 1 ns block matrix C1:d a1 , . . . , C1:d ans

. The sample estimates of CCO and PCO are given in Proposition 3. Its proof can be derived from Proposition 2 and (7), and is omitted.

Proposition 3 The coordinate representations of ˆΣ d,ϵ Xi Xj|XS and ˆR d,ϵ,δ Xi Xj|XS with respect to B i = {ˆη1 i , . . . , ˆηd i } and B j = {ˆη1 j, . . . , ˆηd j } are:

B i [ˆΣ d,ϵ Xi Xj|XS]B j = n 1{(ca i )T[In D(S)]cb j}d a,b=1 Mi,j|S(ϵ), (8)

B i [ ˆR d,ϵ,δ Xi Xj|XS]B j = [Mi,i|S(ϵ) + δId] 1/2Mi,j|S(ϵ)[Mj,j|S(ϵ) + δId] 1/2, (9)

where D(S) = CS [(CT S CS + ϵInsd) 1CT S CS(CT S CS + ϵInsd) 1] CT S .

Based on (8), we compute the squared Hilbert-Schmidt (H-S) norm of ˆΣ d,ϵ Xi Xj|XS as

ˆΣ d,ϵ Xi Xj|XS 2 HS = Pd a=1 ˆΣ d,ϵ Xi Xj|XS ˆηa j , ˆΣ d,ϵ Xi Xj|XS ˆηa j ΩXi = Pd a=1 ˆηa j T

B j (B j [ˆΣ d,ϵ Xj Xi|XS]B i )(B i [ˆΣ d,ϵ Xi Xj|XS]B j ) ˆηa j

B j = B i [ˆΣ d,ϵ Xi Xj|XS]B j 2 F,

where F is the Frobenius norm. In other words, the H-S norm of ˆΣ d,ϵ Xi Xj|XS is simply

the Frobenius norm of the matrix B j [ˆΣ d,ϵ Xi Xj|XS]B i . Similarly, the H-S norm of ˆR d,ϵ,δ Xi Xj|XS is

the Frobenius norm of the matrix B i [ ˆR d,ϵ,δ Xi Xj|XS]B j in (9). We then threshold the H-S norms

ˆΣ d,ϵ Xi Xj|XS HS and ˆR d,ϵ,δ Xi Xj|XS HS to evaluate the conditional independence; that is, we declare Xi Xj | XS if

B j [ˆΣ d,ϵ Xi Xj|XS]B i F < ρCCO, B j [ ˆR d,ϵ,δ Xi Xj|XS]B i F < ρPCO, (10)

where ρCCO and ρPCO are the threshold values determined adaptively given the data. Observing that for the random variable case, the partial correlation can be tested using its Fisher z-transformation, whose variance is approximated by (n |S| 3) 1 (Harris and Drton, 2013), we take ρCCO and ρPCO to be proportional to (n |S| 3) 1/2. Our numerical experiments have found that the results are not overly sensitive to the choice of these threshold values as long as they are within a reasonable range. We summarize the above estimation procedure in Algorithm 1.

Functional Directed Acyclic Graphs

Algorithm 1 Evaluation of Xi Xj | XS for a given triplet (i, j, S) H0.

1: Choose a kernel κT; e.g., the Brownian motion function κT(s, t) = min(s, t), or the radial basis function κT(s, t) = exp { γT(s t)2}, for (s, t) T T, where γT = P

s<t|τs τt|/(ℓ 2) 2. Compute the Gram matrix KT, and perform spectral decomposition to obtain UT and DT.

2: Compute the coordinate [Xk i ] using (6), for i V, k = 1, . . . , n, where the ridge parameter is set as ϵk T = 0.01 σmax[b1:r(Tk)b1:r(Tk)T], k = 1, . . . , n, with σmax( ) denoting the largest eigenvalue.

3: Perform the spectral decomposition on En [Xi En Xi] [Xi En Xi] T to obtain its ath eigenvector [ˆηa i ], and K-L coeﬃcient ˆc k,a i for Xk i using (7), i V and a = 1, . . . , d, where the parameter d is set as d = [n1/5]. Then obtain Ca i , C1:d i , and CS.

4: Compute the coordinates of ˆΣ d,ϵ Xi Xj|XS and ˆR d,ϵ,δ Xi Xj|XS using (8) and (9), where the tuning parameters are set at ϵ = 0.1 σmax(CSCT S ), and δ = 0.5 max{σmax[Mi,i|S(ϵ)], σmax[Mj,j|S(ϵ)]}.

5: Evaluate Xi Xj | XS using (10), where we take ρCCO = c 1 (n |S| 3) 1/2, and ρPCO = Φ 1(1 c/2) (n |S| 3) 1/2, with Φ( ) being the normal cumulative distribution function, 0 < c < 1 being a constant and set as c = 0.05.

4.3 Functional PC-algorithm

Algorithm 1 allows us to evaluate (10) for all possible triplets (i, j, S) H0, which leads to an estimate of the collection F deﬁned in (2). By faithfulness, we then have an estimate of D(G) in (2), and hence an estimate for CPDAG. However, despite the simplicity of this idea, the amount of computation needed to achieve it can be prohibitively large, as one has to evaluate the conditional independence Xi Xj | XS for every triplet (i, j, S) H0, which is an NP-hard problem.

A solution to address this issue is the PC-algorithm, which involves two main steps. In the ﬁrst step, it recursively deletes edges from an initial complete undirected graph based on conditional independence evaluations. This results in a skeleton. In the second step, it extends the skeleton to a CPDAG. This algorithm brings down the computation time considerably by avoiding an exhaustive search: the amount of search is determined by the sparsity, rather than the size, of the graph. In the worst scenario, its runtime grows exponentially with p, but when the true DAG is sparse, the runtime reduces to the polynomial time.

Next, we extend the PC-algorithm to our DAG setting, which, like the classical PCalgorithm, also consists of two steps. For any undirected graph M (V V)0 and any i V, let V(i, M) = {k V : (i, k) M} be the neighborhood of i in M. For any j V, let V(i, j, M) = V(i, M)\{j} be the neighborhood of i in M with j removed. For ℓ= 0, 1, 2, . . ., let Λ(M, ℓ) = {(i, j) M : |V(i, j, M)| ℓ}. For (i, j) Λ(M, ℓ), let Γ(i, j, M, ℓ) = {S V(i, j, M) : |S| = ℓ}. Moreover, deﬁne two test functions:

φ1(i, j, S) = 1 if Xi Xj | XS is accepted by CCO, and 0 otherwise;

φ2(i, j, S) = 1 if Xi Xj | XS is accepted by PCO, and 0 otherwise.

Lee, Li, and Li

Algorithm 2 Step 1 of the PC-algorithm

initialize: set ℓ= 1 and M = the complete undirected graph repeat

set ℓ= ℓ+ 1 and Rℓ= , which denotes the edge set to be removed repeat

choose (i, j) Λ(M, ℓ) repeat

evaluate φ(i, j, S) for each S Γ(i, j, M, ℓ) until φ(i, j, S) = 1, then reset Rℓto Rℓ {(i, j)} and record Si,j = S for later use; if φ(i, j, S) = 0 for all S Γ(i, j, M, ℓ), then keep Rℓthe same until all (i, j) Λ(M, ℓ) are chosen, then reset M to M\Rℓ until Λ(M, ℓ) =

The ﬁrst step of our extended PC-algorithm is summarized in Algorithm 2, where the test function φ can be either φ1 or φ2. Also note that in Algorithm 2, we need to select a sequence of pairs (i, j) from Λ(M, ℓ), but the output does not depend on the choice of the sequence. See Colombo and Maathuis (2014) for more discussion on the order-dependent issue of the PC-algorithm. The output of Step 1 is a skeleton ˆESKE, along with a collection of sets of vertices {Si,j}. The second step of the PC-algorithm transforms the skeleton ˆESKE to a CPDAG by applying several deterministic operations based on ˆESKE and {Si,j}. This step is exactly the same as the classical PC-algorithm (see, e.g., Meek, 1995, Phase I-S2 and Phase II), and its presentation is omitted. We denote the resulting CPDAG as ˆECPDAG-f CCO or ˆECPDAG-f PCO, depending on the thresholding criterion used in (10). We also denote the number of iterations in the functional-PC algorithm as ˆℓf CCO or ˆℓf PCO. The next proposition shows that, at the population-level, the output of the PC-algorithm indeed recovers the true CPDAG, and the number of iterations needed is no greater than the maximum degree of the true skeleton. Deﬁne a population-level PC-algorithm, denoted by functional-PC0, where we replace the evaluation of functional conditional independence using (10) with the ground truth of Xi Xj | XS. Denote the true edge set of G by EDAG, the CPDAG of G by ECPDAG, and the skeleton of EDAG by ESKE. Also, denote the CPDAG from functional-PC0 by E0 CPDAG, and the number of iterations in functional-PC0 by ℓ0.

Proposition 4 Suppose Assumptions 1 to 3 hold. Then,

(i) E0 CPDAG = ECPDAG;

(ii) ℓ0 m, where m = maxi V |{j : (i, j) ESKE}| is the maximum degree of ESKE.

5. Asymptotic theory

We derive the uniform convergence rates of the sample estimates of CCO and PCO, and establish the uniform consistency of the estimated CPDAG. Our asymptotic theory allows both the number of functions p and the number of leading K-L expansion d to diverge with the sample size n. Moreover, many of our theoretical results only require the sub-Gaussian

Functional Directed Acyclic Graphs

distribution of the random functions, which is weaker than the Gaussian assumption. Nevertheless, the Gaussianity is still needed for CCO and PCO to characterize the conditional independence. We ﬁrst formally deﬁne a sub-Gaussian, Hilbertian random element.

Deﬁnition 3 Suppose X is an H-valued random element with a trace-class covariance operator Σ : H H, where H is a generic separable Hilbert space. We say X follows a sub Gaussian distribution, if there exists b > 0, such that E (exp f, X H) exp (b Σf, f H/2), for all f H. We denote it as X sub G(Σ, b).

The notion of sub-Gaussianity in Hilbert space was introduced by Antonini (1997, Deﬁnition 1.1). See also Chen and Yang (2021); Mirshani and Reimherr (2021); Zapata et al. (2021); Qin Fang and Qiao (2023); Waghmare et al. (2023). Our Deﬁnition is slightly diﬀerent: Antonini (1997) allows b = 0, which leads to a degenerate distribution, whereas we require b to be strictly positive.

5.1 Uniform convergence of CCO estimation

We ﬁrst study the CCO. We begin with an assumption on the smoothness level of Xi.

Assumption 4 There exists γ > 1, such that λa i a γ and λa i λa+1 i a 1 γ, as a , for every i V.

As our estimators are built on the leading K-L coeﬃcients, it is reasonable to assume the tail eigenvalues of ΣXi Xi diminish suﬃciently fast. The ﬁrst part of Assumption 4 implies that maxi V(P a=d+1λa i ) d γ. That is, the decaying rate of the tail eigenvalues of ΣXi Xi is in a polynomial order of d. The second part of Assumption 4 requires the decaying rate of the gaps of the adjacent eigenvalues, λa i λa+1 i , to be greater than a polynomial order of a, for a N. The parameter γ imposes a level of smoothness of the decaying rate; the larger the value of γ, the faster the decaying rate. For any m N {0}, let H0(m) = {(i, j, S) H0 : |S| m}, H(m) = {(i, j, S) H : |S| m}, and H1(m) = H(m)\H0(m). In the following development, we allow p , d , m , ϵ 0 as n . Moreover, given m N, let t(m) = min{ ΣXi Xj|XS HS : ΣXi Xj|XS = 0, (i, j, S) H0(m)}, and ζ(m, d, p, ϵ, n) = md3+γ(log p)1/2/ (n1/2ϵ) + mϵ 1d γ + ϵ1/2s(m). Here t(m) is the minimal H-S norm of the nonzero CCO, and thus t(m) 1. Next, we introduce an assumption on t(m).

Assumption 5 Suppose t(m) ζ(m, d, p, ϵ, n).

Assumption 5 places a lower bound on the order of t(m) to prevent it from diminishing too fast. Note that, in the random variable setting, the partial correlation-based PC-algorithm requires a strong faithfulness assumption to ensure the uniform consistency of the estimated graph (Uhler et al., 2013). Assumption 5 is similar and can be viewed as the functional version of strong faithfulness. Also note that, we require t(m) to go to 0 at a slower rate than ζ(m, d, p, ϵ, n), whose ﬁrst term is md3+γ(log p)1/2/(n1/2ϵ). By comparison, for the partial correlation-based PC-algorithm, the order of the minimal nonzero partial correlation has to be greater than (m log p/n)1/2 (Uhler et al., 2013).

Lee, Li, and Li

We next establish the uniform convergence rate of ˆΣ d,ϵ Xi Xj|XS, and the uniform consistency

of the estimated CPDAG, ˆECPDAG-f CCO, based on CCO.

Theorem 3 (i) Uniform convergence rate of ˆΣ d,ϵ Xi Xj|XS: Suppose Assumptions 1, 3, 4, and 5 hold, Xi sub G(ΣXi Xi, b0) with E(Xi) = 0, ϵ 1, md3+γ(log p)1/2/(n1/2ϵ) 1, and the threshold ρCCO = t(m)/2. Then,

max (i,j,S) H(m) ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS HS = OP[md3+γ(log p)1/2/(n1/2ϵ) + mϵ 1d γ + ϵ1/2s(m)],

where s(m) = max(i,i,S) H1(m) MXSXi HS. (ii) Uniform graph consistency based on CCO: If we further assume Assumption 2 holds, then,

P(ˆECPDAG-f CCO = E0 CPDAG) 1 and P(ˆℓf CCO = ℓ0) 1, as n .

Theorem 3 requires the maximal degree m to satisfy md3+γ(log p)1/2/(n1/2ϵ) 1, which implies that m can grow at most at a polynomial rate, and thus in turn imposes a level of sparsity on the graph. This rate for m is consistent with the classical settings. For instance, in the sparse regression, the order of magnitude of the sparsity parameter can only grow in a polynomial order of n. The classical linear PC-algorithm (Kalisch and B uhlmann, 2007, condition A3) also requires m to grow in a polynomial order of n.

5.2 Uniform convergence of PCO estimation

We next study the PCO. We show that the norm of ˆR d,ϵ,δ Xi Xj|XS is no greater than 1, which resembles the property of the partial correlation. We then introduce two assumptions.

Proposition 5 For each (i, j, S) H0, we have ˆR d,ϵ,δ Xi Xj|XS 1.

Assumption 6 There exists c0 > 0, such that max P

b Mj,S(ρ a,b i,j,S)2/(µa i,Sµb j,S) : (i, j, S)

H0 c0, where ρ a,b i,j,S = cor( νa i,S, Xi ΩXi, νb j,S, Xj ΩXj ), Mi,S = {a N : µa i,S > 0}, µ1 i,S µ2 i,S and ν1 i,S, ν2 i,S are the eigenvalues and eigenfunctions of ΣXi Xi|XS.

Assumption 6 places a level of smoothness on the relation between Xi and Xj given XS. Under the Gaussian assumption, the correlation ρ a,b i,j,S measures the strength of dependency between Xi and Xj given XS. Moreover, because ΣXi Xi|XS is a trace-class operator, its eigenvalues decay to 0 suﬃciently fast so that P

a Nµa i,S < . Assumption 6 implies that ρ a,b i,j,S needs to converge to 0 faster than the product µa i,Sµb j,S, as a , b . Hence, intuitively, the dependency between Xi and Xj given XS has to be suﬃciently concentrated on the leading eigenfunctions. Similar to Assumption 5 on CCO, we also require the minimum H-S norm of the nonzero PCO to be suﬃciently large. For m N, let u(m) = min{ RXi Xj|XS HS : RXi Xj|XS = 0, (i, j, S) H0(m)}. The next assumption is the functional version of strong faithfulness based on PCO, which prevents u(m) to go to zero too fast.

Assumption 7 Suppose u(m) δ 3/2ζ(m, d, p, ϵ, n) + δ1/2.

Functional Directed Acyclic Graphs

We next establish the uniform convergence rate of ˆR d,ϵ,δ Xi Xj|XS, and the uniform consistency

of the estimated CPDAG, ˆECPDAG-f PCO, based on PCO.

Theorem 4 (i) Uniform convergence rate of PCO: Suppose Assumptions 1, 3, 4, 6, and 7 hold, Xi sub G(ΣXi Xi, b0) with E(Xi) = 0, ζ(m, d, p, ϵ, n) 1, δ 1, and the threshold ρPCO = u(m)/2. Then,

max (i,j,S) H0(m) ˆR d,ϵ,δ Xi Xj|XS RXi Xj|XS HS = OP[δ 3/2ζ(m, d, p, ϵ, n) + δ1/2].

(ii) Uniform graph consistency based on PCO: If we further assume Assumption 2 holds, then, P(ˆECPDAG-f PCO = ECPDAG) 1 and P(ˆℓf PCO = ℓ0) 1, as n .

B uhlmann and van de Geer (2011, Theorem 13.1) derived the consistency of the PCalgorithm of Kalisch and B uhlmann (2007) for the random variable setting. We next compare our Theorem 4 to theirs. Speciﬁcally, both results establish the uniform consistency of the parameter estimation and the uniform graph consistency: whereas B uhlmann and van de Geer (2011) was based on the partial correlation, our result is based on the proposed partial correlation operator. As such, Theorem 4 can be viewed as the functional extension of the above-mentioned Theorem 13.1, though such an extension is far from trivial. The conditions imposed by the two theorems are generally similar. Both allow the graph size p to diverge at an exponential order of the sample size n, and both require the maximum degree of the DAG m to diverge at a polynomial order of n. However, there are some minor diﬀerences. For example, B uhlmann and van de Geer (2011, condition A4) required the absolute value of the smallest non-zero partial correlation to go to zero at a rate slower than (m log p/n)1/2. By contrast, our Assumption 7 requires the minimal H-S norm of all non-zero partial correlation operators u(m) to converge to zero at a rate slower than δ 3/2md3+γ(log p)1/2/(n1/2ϵ). Our rate is slower than that of B uhlmann and van de Geer (2011), but we feel this is reasonable, as our setting involves inﬁnite-dimensional functions and is more complicated. Also, B uhlmann and van de Geer (2011, condition A4) imposed a regularization on the dependency among the random variables by upper-bounding the partial correlations. We impose a similar condition, by requiring the maximum of the H-S norms of the regression operator MXSXi, i.e., ϵ1/2s(m) 1, for any i V and any subset S V\{i} with |S| m. Note that ϵ goes to zero in a polynomial rate of n. If ϵ n b with some b (0, 1/2), then this implies that s(m) nb. As such, our condition also regulates the dependency among the random functions. Following up our discussion in the last paragraph of Section 4.1, using the estimator deﬁned in (4), Theorem 4 establishes the uniform convergence rate of our proposed PCO estimator ˆR d,ϵ,δ Xi Xj|XS in (4). By contrast, the theoretical analysis of the alternative estimator Rd Xi Xj|XS in (5) would have involved inversion of ˆΣ d,ϵ Xi Xi|XS, where the norm of this inverse is

identical to the smallest eigenvalue of ˆΣ d,ϵ Xi Xi|XS. Let µd i,S denote the smallest eigenvalue of Σd Xi Xi|XS, i.e., the population version of ˆΣ d,ϵ Xi Xi|XS. Then, it is easy to see that, the rate of convergence of Rd Xi Xj|XS in (5) depends on the rate at which µd i,S approaches to zero. On

the other hand, the norm of the ridge-type estimator (ˆΣ d,ϵ Xi Xi|XS + δI) 1 in ˆR d,ϵ,δ Xi Xj|XS in (4) is

Lee, Li, and Li

upper bounded by δ 1, regardless of the magnitude of µd i,S. Consequently, the uniform rate of convergence of ˆR d,ϵ,δ Xi Xj|XS in (4) is independent of µd i,S. Therefore, the asymptotic analysis of our PCO estimator is simpler than that of the alternative estimator deﬁned by (5).

6. Numerical studies

We ﬁrst evaluate the empirical performance of the proposed method through two simulation examples, then illustrate with an analysis of a time-course proteomic dataset.

6.1 Simulations

Without loss of generality, we assume that {1, 2, . . . , p} is the order of the true DAG. Moreover, let pa(i) = {j : (j, i) EDAG} denote the parents of i, and {u1, . . . uk} a k-grid in [0, 1] with u1 = 1/k, . . . , uk = 1. We generate the p-dimensional vector of random functions X(t) = [X1(t), . . . , Xp(t)]T in a sequential manner as,

Model I : X1(t) = ϵ1(t), Xi(t) = P

(j,i) EDAGXj(t) + ϵi(t), i = 2, . . . , p,

Model II : Xi(t) = Ui1η1(t) + Ui2η2(t), Ui2 = (|pa(i)| + 1) 1 P

j pa(i)Uj2 + ϵi , i = 1, . . . , p.

For Model I, the error functions ϵi(t) = Pv j=1ξjκ(t, sj), i = 1, . . . , p, are i.i.d. Gaussian random process with the Brownian motion covariance function, where κ(t, s) = min(t, s), {s1, . . . , sv} is a v-grid in [0, 1], ξ1, . . . , ξv are i.i.d. normal variables with mean zero and standard deviation 5, and v = 10. For Model II, U11, . . . , Up1, ϵ1, . . . , ϵp are i.i.d. standard normal variables, ηk(t) = akχk(t), χk(t) =

2 sin((k 0.5)πt) is the kth eigenfunction of the Brownian motion kernel, k = 1, 2, a1 = 2.5, and a2 = 1. We assume all subjects are observed at the same set of time points, Tk = {t1, . . . , tu}, which is taken to be an u-grid in [0, 1] with u = 50. We generate the edge set EDAG via the independent Bernoulli variable I[(i, j) EDAG], with P{I[(i, j) EDAG]} = 2q/(p 1). The expected number of edges in EDAG is qp, with q controlling the sparsity of the graph, and q = 1.05. We apply the proposed CCO and PCO estimators to the simulated data. We employ the Brownian motion kernel to construct ΩXi as the span of {κT( , ts) : s = 1, . . . , u}. Besides, we choose all the tuning parameters following the rules outlined in Algorithm 1. Our preliminary results have found that PCO outperforms CCO consistently, due to the beneﬁt of proper scaling. As such, we only report the PCO results subsequently. We also compare with some alternative methods. Speciﬁcally, we consider the PCalgorithm based on the partial correlation test (linear-PC, Spirtes et al., 2000), the PCalgorithm based on the rank correlation test (rank-PC, Harris and Drton, 2013), and the causal additive models based on high-dimensional penalized regressions (CAM, B uhlmann et al., 2014). To adapt them to the functional setting, for each subject k and node i, we ﬁrst extract from the observed function Xk i the ﬁrst K-L expansion coeﬃcient ˆc k,1 i = Xk i En Xi, ˆη1 i ΩXi using (7), which is the ﬁrst functional PCA score. We then apply the three competing methods to the sample of the p-dimensional vectors, {(ˆc k,1 1 , . . . , ˆck,1 p )T : k = 1, . . . , n}, to estimate the CPDAG. We comment that such an adaption is intuitive, but there is no theoretical guarantee. We also note that the linear-PC method corresponds to the alternative estimator Rd Xi Xj|XS in (5) with d = 1.

Functional Directed Acyclic Graphs

Structure Hamming distance True discovery rate

G n=50 n=100

G G G G G G

Figure 1: Empirical performance under Model I. Four methods are compared, from left to right, the modiﬁed PC-algorithm based on PCO, linear-PC, rank-PC, and CAM.

We evaluate the performance by two criteria, the structure Hamming distance (SHD, Tsamardinos et al., 2006), and the true discovery rate (TDR), which are deﬁned as,

SHD(ˆECPDAG, ECPDAG) = |ˆECPDAG ˆET CPDAG ECPDAG ET CPDAG|/2

+ |ECPDAG ET CPDAG ˆECPDAG ˆET CPDAG|/2

+ |ˆECPDAG (ECPDAG ˆET CPDAG ECPDAG ET CPDAG) ECPDAG|,

TDR(ˆECPDAG, ECPDAG) = |{(i, j) ˆECPDAG : (i, j) ECPDAG}|/|ˆECPDAG|,

where, for an edge set E, ET stands for {(j, i) : (i, j) E}, ECPDAG is the true CPDAG, and ˆECPDAG is its estimate. We note that the three terms in SHD(ˆECPDAG, ECPDAG) represent the numbers of deletions, insertions, and reorientations needed to transform ˆECPDAG to ECPDAG,

Lee, Li, and Li

Structure Hamming distance True discovery rate

Figure 2: Empirical performance under Model II. Four methods are compared, from left to right, the modiﬁed PC-algorithm based on PCO, linear-PC, rank-PC, and CAM.

respectively. A smaller SHD or a higher TDR indicates more accurate estimation. We choose to report the true discovery rate, instead of the true positive rate or false positive rate, because the underlying DAG is sparse, and the proportions of the true positives and negatives are highly imbalanced. We vary the sample size n in {50, 100}, and the graph size p in {25, 50, 75, 100}, resulting in 8 diﬀerent scenarios. Figures 1 and 2 report the box plots for SHD and TDR based on 80 data replications, for Model I and Model II, respectively. We observe that the proposed PCO-based method has a comparable performance as the alternative methods in Model I, but clearly outperforms the alternatives in Model II. This is because the ﬁrst PC captures most of variation in Model I, but not so in Model II. Moreover, the performance of PCO improves as the sample size increases, which agrees with our asymptotic theory.

Functional Directed Acyclic Graphs

We have also conducted additional simulations with more combinations of (n, p, q), different initializations, diﬀerent kernel functions, and comparison with the SEM method of Lee and Li (2022). We report those results in Section A.5 of the Appendix. Overall our proposed PCO-based method achieves a competitive empirical performance.

6.2 Proteomic application

We illustrate our method with a DREAM breast cancer proteomic dataset (https://www. synapse.org/#!Synapse:syn1720047/wiki/56213). The goal of the study is to estimate the directed relations among diﬀerent proteins given the time-course proteomic measurements. We consider the in silico data generated using a nonlinear dynamic model whose characteristic satisﬁes the Reverse Phase Protein Array (RPPA) quantitative proteomics technology (Hill et al., 2016). Based on various combinations of stimuli and inhibitors, the true network, as shown in the ﬁrst panel in Figure 3, is used to generate the time-courses of phosphoprotein abundance levels. There are a total of 20 diﬀerent conditions, and for each condition, 3 independent copies of 20 time-course protein levels were collected at time t = 0, 1, 2, 4, 6, 10, 15, 30, 45, 60, 120 minutes. After the removal of 4 conditions whose protein levels had unusual distributions, the data consists of n = 48 subjects, each with p = 20 protein levels measured at m = 11 time points. Figure 11 in the Appendix plots the time-course data for all 20 protein levels.

Erb B2#P RAF#P SHC#P

Gab1#P MEK#P#PRas:GTP

Erb B2#P RAF#P SHC#P

Gab1#P MEK#P#PRas:GTP

Erb B2#P RAF#P SHC#P

Gab1#P MEK#P#PRas:GTP

Erb B2#P RAF#P SHC#P

Gab1#P MEK#P#PRas:GTP

Erb B2#P RAF#P SHC#P

Gab1#P MEK#P#PRas:GTP

Erb B2#P RAF#P SHC#P

Gab1#P MEK#P#PRas:GTP

Figure 3: True and estimated graphs for the time-course proteomic data. From left to right, top to bottom: the truth, the modiﬁed PC-algorithm based on PCO, linear-PC, rank-PC, CAM, and SEM.

Lee, Li, and Li

Table 1: The performance and comparison for the time-course proteomic data.

Method PCO linear-PC rank-PC CAM SEM Structure Hamming distance 44 50 48 51 50 True discovery rate 0.70 0.59 0.53 0.48 0.40

Figure 3 reports the estimated graphs by our method, linear-PC, rank-PC, CAM, and the SEM method of Lee and Li (2022), while Table 1 reports the corresponding structure Hamming distance and true discovery rate. We see that our PCO-based method performs the best, by achieving the smallest SHD and the highest TDR.

7. Discussions

In this section, we discuss the interpretation and implication of the functional DAG model, including its relation to the linear structural equation, the factorization of joint distribution, the causal interpretation, and the comparison to an undirected graph.

7.1 Relation to the functional linear structural equation model

We show that there is a one-to-one correspondence between the functional DAG and the functional linear structural equation model. Such a relation reveals how the functional DAG factorizes the joint distribution, and allows us to better interpret and understand the identiﬁed edges of the DAG. We ﬁrst formally deﬁne the linear structural equation model (SEM) for Hilbert spacevalued random functions. For node i V, subset S V\{i}, and linear operator B(i, S) B (ΩXi, ΩXS), let Bj(i, S) denote the jth suboperator of B(i, S).

Deﬁnition 4 We say that X = (X1, . . . , Xp)T follows a zero-mean, linear structural equation model with respect to a DAG G if, for each i V, there exists a B(i, pa(i)) B (ΩXi, ΩXpa(i)), such that

j pa(i)B j (i, pa(i))Xj + ϵi,

where ϵi is a zero-mean random element in ΩXi, and ϵ1, . . . , ϵp are independent.

We next recall the notion of the global Markov property from (1). That is, X = (X1, . . . , Xp) T satisﬁes the global Markov property with respect to G, if

i and j are d-separated by S in G Xi Xj | XS. (11)

The next theorem establishes the equivalence between the functional linear SEM in Deﬁnition 4 and the global Markov property in (11) under the Gaussian assumption.

Theorem 5 Suppose Assumptions 1 and 3 are satisﬁed, and X = (X1, . . . , Xp)T is a zeromean, Gaussian random element in ΩXi. Then the following two statements are equivalent: (i) X satisﬁes the global Markov property with respect to G, and (ii) X follows a linear structural equation model with respect to G.

Functional Directed Acyclic Graphs

The structural equation factorizes the joint distribution of X1, . . . , Xp into the product of the set of conditional distributions of Xi | Xpa(i), i = 1, . . . , p, under G, and provides an interpretation of the edge directions. Consequently, we can interpret the edges of the functional DAG following the functional linear SEM. For instance, in the proteomic application, let XSOS, XERK, XSHC and XERBB1 denote the random functions of the protein levels of SOS, ERK, SHC, and ERBB1, respectively, and suppose they all have zero-means. Because ERK, SHC, and ERBB1 are the parent nodes of SOS in the ground truth, we have,

XSOS = B 1XERK + B 2XSHC + B 3XERBB1 + ϵ,

where ϵ is a zero-mean, Gaussian random error function that is independent of XERK, XSHC and XERBB1, and B 1, B 2, B 3 are the linear operators from the ranges of XERK, XSHC and XERBB1, to the range of XSOS, respectively.

7.2 Potential causal interpretation

Next, we introduce the do-intervention under the functional setting, and discuss potential causal interpretation of the functional DAG model. For any j V, and any x = (x1, . . . , xp) ΩX, let xpa(j) = {xk : k pa(j)}, and let LSEj,G : ΩXpa(j) ΩXj ΩXj denote the mapping

LSEj,G(xpa(j), xj) = P

k pa(j)B k(j, pa(j))xk + xj.

For y ΩX, A V, i A, and xi ΩXi, let y A(yi xi) be the vector y A = {yk : k A} with its member yi replaced by xi. In other words, y A(yi xi) = (y A\{yi}) {xi}. Following Pearl (2009, Deﬁnition 3.2.1), we obtain the following deﬁnition.

Deﬁnition 5 Suppose X = (X1, . . . , Xp)T follows a linear structural equation model with respect to a DAG G, a set of regression operators B(j, pa(j)) B (ΩXj, ΩXpa(j)), j = 2, . . . , p, and the error random functions {ϵj : j V}. For a ﬁxed i V, and xi ΩXi, the causal eﬀect of Xi = xi on XV\{i} is the joint distribution of XV\{i} induced by the following p 1 equations:

( LSEj,G(Xpa(j)(Xi xi), ϵj), if pa(j) i, LSEj,G(Xpa(j), ϵj), if pa(j) i,

for all j V\{i}. We denote such a joint distribution as PXV\{i}|do(xi), and call this distribution the interventional distribution at Xi = xi.

For any j V\{i}, let PXj|do(xi) denote the marginal distribution of Xj in the interventional distribution at Xi = xi, PXj the marginal distribution of Xj, and PXj|xi the conditional distribution of Xj | Xi = xi. Then by Deﬁnition 5, we have,

PXj|do(xi) =

( PXj, if j is not a descendant of i, PXj|xi, if j is a descendant of i.

Deﬁnition 5 oﬀers one way to deﬁne the causal eﬀect in functional data, and is general and potentially useful for time-course interventional studies (Luo and Zhao, 2011). Based

Lee, Li, and Li

Undirected graph CPDAG

Figure 4: The induced undirected graph structure and the directed graph structure, based on the relation in model (12).

on this deﬁnition, it is possible to develop a full methodology and theory for modeling functional interventional data, following the lines of Maathuis et al. (2009); Hauser and B uhlmann (2015), who studied the random variable-based linear SEM for interventional data. However, it requires a substantial amount of additional eﬀorts, and to avoid too much digestion, we leave it for future research.

7.3 Comparison with undirected graph

Finally, we illustrate the diﬀerence between the undirected functional graphical model and our functional DAG model by a speciﬁc example.

Example 1 Suppose X = (X1, X2, X3)T is a random element in H H H, where H is a Hilbert space, and ϵ1, ϵ2, ϵ3 are i.i.d. random elements in H with zero-mean and the covariance operator Λ. Furthermore, suppose

X1 = ϵ1, X2 = ϵ2, X3 = X1 + X2 + ϵ3. (12)

Note that the conditional covariance operator between X1 and X2 given X3, by deﬁnition, is ΣX1X2|X3 = ΣX1X2 ΣX1X3Σ X3X3ΣX3X2 = ΛΛ Λ = Λ. Similarly, the conditional covariance operators ΣX1X3|X2 = ΣX2X3|X1 = Λ. Therefore, by Theorem 5, we have,

X1 X2 | X3, X1 X3 | X2, X2 X3 | X1.

Figure 4 shows the undirected graph structure and the CPDAG, both induced by the relation in (12). We see that the the two graphs are very diﬀerent. The undirected graphical model studied in Qiao et al. (2019); Li and Solea (2018) does not oﬀer any structural simpliﬁcation of the joint distribution of X1, X2, X3 in this example, because there is no zero entry in the precision operator. On the other hand, the directed graphical model targeted by the functional DAG does provide a simpliﬁcation of the joint distribution.

Appendix A. Appendix

In this Appendix, we ﬁrst present some supporting theoretical results in Section A.1. We then prove the two main theorems, Theorems 3 and 4, in Sections A.2 and A.3. We collect the proofs of the rest of the theoretical results in Section A.4. We present additional numerical results in A.5.

Functional Directed Acyclic Graphs

A.1 Supporting theoretical results

We ﬁrst derive a useful property of a zero-mean, Hilbert space-valued, Gaussian random element. Let π : V V denote a permutation, i.e., an injective mapping from V to V. Let [a] denote the vector (1, . . . , a)T for integer a 1.

Lemma 1 Suppose Assumptions 1 and 3 hold, and the p-variate random function X = (X1, . . . , Xp)T is a zero-mean Gaussian random element in ΩX. Then, for i = 2, . . . , p, there exists a linear operator B(i, [i i]) B2(ΩXi, ΩX[i 1]), and an ΩXi-valued random element ϵi, such that Xi = Pi 1 j=1B j (i, [i 1])Xj + ϵi. (13)

Moreover, for i = 2, . . . , p, B(i, [i i]) = MX[i 1]Xi, and X1, ϵ2, . . . , ϵp are independent, zeromean Gaussian random element in ΩXi, with E(ϵi ϵi) = ΣXi Xi|X[i 1] B 1(ΩXi). This statement remains true if we replace 1, . . . , p by π(1), . . . , π(p).

Proof of Lemma 1: We ﬁrst show that, for any i V and any S V\{i}, E(Xi | XS) ΩXi. This is because, under Assumptions 1 and 3, MXSXi is deﬁned and by Proposition 1, E(Xi | XS) = M XSXi XS , which is a member of ΩXi. For i = 2, . . . , p, let ϵi = Xi E(Xi | X[i 1]). Then Xi = E(Xi | X[i 1]) + ϵi. By Proposition 1, we have Xi = M X[i 1]Xi X[i 1] + ϵi = Pi 1 j=1(M X[i 1]Xi)j Xj + ϵi. Therefore, (13) holds with B(i, [i 1]) = MX[i 1]Xi. Next, we show that X1, ϵ2, . . . , ϵp are independent. For convenience, denote X1 by ϵ1. Since ϵ1, . . . , ϵp are jointly Gaussian, we only need to check

cov[ f, ϵi ΩXi, g, ϵj ΩXj ] = f, Σϵiϵjg ΩXi = 0,

for every f ΩXi, g ΩXj. Suppose i < j. We have that,

Σϵiϵj = E(ϵi ϵj) = E{E[(Xi E(Xi | X[i 1])) (Xj E(Xj | X[j 1])) | X[i]]}

= E{(Xi E(Xi | X[i 1])) E(Xj E(Xj | X[j 1]) | X[i])}

= E{(Xi E(Xi | X[i 1])) [E(Xj | X[i]) E(Xj | X[i])]} = 0.

It remains to show Σϵiϵi = ΣXi Xi|X[i 1] and Σϵiϵi is a member of B 1(ΩXi). By deﬁnition,

Σϵiϵi = E(Xi Xi) + E[E(Xi | X[i 1]) E(Xi | X[i 1])] 2E[Xi E(Xi | X[i 1])]

= E(Xi Xi) E[E(Xi | X[i 1]) E(Xi | X[i 1])], (14)

where the second equality holds because E[Xi E(Xi | X[i 1])] = E[E(Xi | X[i 1]) E(Xi | X[i 1])]. Since the second term on the right-hand-side of (14) is equal to

E[E(Xi | X[i 1]) E(Xi | X[i 1])] = E[(M X[i 1]Xi X[i 1]) (M X[i 1]Xi X[i 1])]

= M X[i 1]XiΣX[i 1]X[i 1]MX[i 1]Xi,

we have that E(ϵi ϵi) = ΣXi Xi|X[i 1].

Lee, Li, and Li

To show that Σϵiϵi is a member of B 1(ΩXi), note that, for any f ΩXi,

f, (ΣXi Xi ΣXi Xi|X[i 1])f ΩXi 0 f, ΣXi Xif ΩXi f, ΣXi Xi|X[i 1]f ΩXi,

which further implies that ΣXi Xi|X[i 1] TR ΣXi Xi TR < by Assumption 1. This completes the proof of Lemma 1. 2

Let H be a generic separable Hilbert space. The next lemma extends the classical Bernstein s inequality (Boucheron et al., 2013, Chapter 2) to the functional setting.

Lemma 2 (Bernstein s inequality in Hilbert space) Suppose X is a random element in H with E(X) = 0, and X1, . . . , Xn are i.i.d. samples of X. If

E( X ℓ H) bℓℓ!, for some b > 0, and each ℓ N, (15)

then, for any t > 0, P( En X H > t) 2 exp { n [t/(4b) t2/(8b2)]}.

Proof of Lemma 2: By (15), we have Pn i=1E( X ℓ H) nbℓℓ!. Therefore, by Bosq (2000, Theorem 2.5),

P( En X H > t) 2 exp nt2

2 exp[ nf(t)].

Moreover, note that f(t) > t/(4b) if t > 2b, and f(t) t2/(8b2) if t 2b. This completes the proof of Lemma 2. 2

Let {(λa i , ηa i )}a N denote the eigenvalue-eigenfunction pairs of ΣXi Xi, with λ1 i λ2 i 0. Let ca i = Xi, ηa i . The next lemma shows that, if X follows a sub-Gaussian distribution, then we can bound the moments E(ca i )2ℓ, for each a N and ℓ N.

Lemma 3 If Xi sub G(Σ, b), then E(ca i )2ℓ (4b λa i )ℓℓ!, for a N and ℓ N.

Proof of Lemma 3: By the deﬁnition of sub-Gaussianity of Xi, for any s, t > 0,

P(ca i > t) = P [exp(sca i ) > exp(st)] exp(bλa i s2/2 st), (16)

where the inequality follows from the Markov s inequality. Note that infs>0 exp(bλa i s2/2 st) exp [ t2/(2bλa i )], which, together with (16), further implies that P(ca i > t) exp [ t2/(2bλa i )] for each t > 0. Using a similar argument, we can show that P(ca i < t) exp [ t2/(2bλa i )], for t > 0. Therefore, we have

P(|ca i | > t) 2 exp [ t2/(2bλa i )] . (17)

Moreover, we note that, for ℓ N,

E(ca i )2ℓ= Z

0 P |ca i | > t1/(2ℓ) dt 2 Z

0 exp t1/ℓ/(2bλa i ) dt,

0 exp t1/ℓ/(2bλa i ) dt = 2(2bλa i )ℓℓ Z

0 exp( x)xℓ 1dx (4bλa i )ℓℓ!.

Functional Directed Acyclic Graphs

Combining with (17), we obtain the desired bound for E(ca i )2ℓ. This completes the proof of Lemma 3. 2

The next lemma provides some properties regarding the Tychonoﬀregression. Its proof is similar to Lee et al. (2020, Lemma A4), and is omitted.

Lemma 4 Let Σ and Γ be self-adjoint operators in B (H), and let I be the identity mapping. Then, for any ϵ > 0,

(i) (Σ + ϵI) 1 ϵ 1;

(ii) (Σ + ϵI) 1Γ 1 + ϵ 1 Σ Γ ;

(iii) (Σ + ϵI) 1Γ1/2 ϵ 1/2(1 + ϵ 1 Σ Γ )1/2.

The next lemma is about the perturbation of the covariance operators.

Lemma 5 For a given i V, let {(λa i , ηa i )}m+1 a=1 denote the leading m + 1 eigenvalueeigenfunction pairs of ΣXi Xi, with λ1 i > λ2 i > > λm+1 i . Then, maxa=1,...,m ˆηa i ηa i 4κ 1 m ˆΣXi Xi ΣXi Xi , where κm = min{λa i λa+1 i : a = 1, . . . , m, i V}.

Proof of Lemma 5: For a given i V, let λa i be the member of {ˆλ1 i, . . . , ˆλn i } that is closest to λa i . Then by Kato (1980, Theorem 4.10), max{| λa i λa i | : a = 1, . . . , n} ˆΣXi Xi ΣXi Xi . This implies, for all a = 1, . . . , n, λa i ˆΣXi Xi ΣXi Xi λa i λa i + ˆΣXi Xi ΣXi Xi . Therefore, for all a = m + 1, . . . , n,

λa i λa i + ˆΣXi Xi ΣXi Xi λm+1 i + ˆΣXi Xi ΣXi Xi . (18)

Similarly, for all a = 1, . . . , m,

λa i λm i ˆΣXi Xi ΣXi Xi . (19)

If κm 2 ˆΣXi Xi ΣXi Xi , then the asserted inequality holds automatically. If κm > 2 ˆΣXi Xi ΣXi Xi , then λm i ˆΣXi Xi ΣXi Xi > λm+1 i + ˆΣXi Xi ΣXi Xi , which, together with (18) and (19), implies that min{ λ1 i, . . . , λm i } > max{ λm+1 i , . . . , λn i }. Therefore, we have { λ1 i, . . . , λm i } = {ˆλ1 i, . . . , ˆλm i }. Moreover, for any a = 1, . . . , m 1,

λa+1 i λa+1 i + ˆΣXi Xi ΣXi Xi < λa i ˆΣXi Xi ΣXi Xi λa i ,

which implies that λa i = ˆλa i , for all a = 1, . . . , m and i V. Therefore, max{|ˆλa i λa i | : a = 1, . . . , m} ˆΣXi Xi ΣXi Xi , which, by Kazdan (1971, Lemma 2), leads to the asserted inequality. This completes the proof of Lemma 5. 2

The next theorem establishes the concentration bound and the uniform convergence rate of ˆΣXi Xj ΣXi Xj HS.

Lee, Li, and Li

Theorem 6 Suppose Assumption 1 holds, and Xi sub G(ΣXi Xi, b0) with E(Xi) = 0 for i V. Then, for any t 0 and (i, j) V V,

P( ˆΣXi Xj ΣXi Xj HS > t) 2 exp n t

where C0 = max(2M0, 8M0b0), and M0 is as deﬁned in Assumption 1. Moreover, if log p/n 0, as n , then, max i,j V ˆΣXi Xj ΣXi Xj HS = OP[(log p/n)1/2].

Proof of Theorem 6: For convenience, for two sets I, J, we use PI,J i,j to abbreviate the double sum P

j J. Similarly, for two integers r and s, we use Pr,s i,j to abbreviate the double sum Pr i=1 Ps j=1. We ﬁrst note that, by the triangular and Jensen s inequalities,

E Xi Xj E(Xi Xj) ℓ HS 2ℓ 1[E Xi Xj ℓ HS + E(Xi Xj) ℓ HS]

2ℓ 1[M1(ℓ) + M2(ℓ)].

We next bound M1(ℓ) and M2(ℓ), respectively. For, M1(ℓ), let Ni = {a N : λa i = 0}. For ℓ= 1,

M1(ℓ) E1/2[PN,N a,b (ca i )2(cb j)2] PN,N a,b E[(ca i )2(cb j)2] 1/2 M0 = M 1 0 1!.

For any ℓ Ni, ℓ 2, we have

M1(ℓ) E|PNi,Ni a,b Xi Xj, ηa i ηb j 2 HS|ℓ/2

= E[PNi,Ni a,b (ca i )2(cb j)2]ℓ/2 M ℓ 0 PNi,Ni a,b λa i λb j M 2 0 E[(ca i )2(cb j)2/(λa i λb j)]ℓ/2 ,

where the last inequality follows from Jensen s inequality as applied to the convex function f(u) = uℓ/2. By Lemma 3, we have E(ca i )2ℓ (4b0λa i )ℓℓ!. Substituting this into the righthand-side above, we obtain that M1(ℓ) (4b0M0)ℓℓ! for ℓ 2. Therefore, for any ℓ N, we have M1(ℓ) [M0 max(1, 4b0)]ℓℓ!. For M2(ℓ), by the Cauchy-Schwarz inequality,

M2(ℓ) = [PN,N a,b E2(ca i cb j)]ℓ/2 PN,N a,b λa i λb j ℓ/2 M ℓ 0. (20)

Putting the bounds for M1(ℓ) and M2(ℓ) together, we have E Xi Xj E(Xi Xj) ℓ HS [max(2M0, 8M0b0)]ℓℓ! Cℓ 0 ℓ!, which, by Lemma 2, implies the ﬁrst statement of this theorem. Next, note that, for any t > 0,

P max i,j V ˆΣXi Xj ΣXi Xj HS > t P

i,j VP ˆΣXi Xj ΣXi Xj HS > t

2p2 exp n t

which implies that the second statement of this theorem holds when log p/n 0. This completes the proof of Theorem 6. 2

Let Σd Xi Xj = Pd a=1 Pd b=1E(ca i cb j)(ηa i ηb j) be the truncated version of ΣXi Xj, (i, j) V V.

The next theorem establishes the uniform convergence of ˆΣd Xi Xj Σd Xi Xj HS.

Functional Directed Acyclic Graphs

Theorem 7 Suppose Assumption 1 holds, Xi sub G(ΣXi Xi, b0) with E(Xi) = 0 for i V, and dγ+1(log p)1/2/n1/2 1. Then, we have

max i,j V ˆΣd Xi Xj Σd Xi Xj HS = OP d3+γ(log p)1/2/n1/2 .

Proof of Theorem 7: Note that maxi,j V ˆΣd Xi Xj Σd Xi Xj HS is upper-bounded by

max i,j V Pd,d a,b[En(ˆca i ˆcb j) E(ca i cb j)] ˆηa i ˆηb j HS

+ max i,j V Pd,d a,b E(ca i cb j) [(ˆηa i ηa i ) ˆηb j + ηa i (ˆηb j ηb j)] HS M3(n) + M4(n).

Next, we derive the orders of magnitude for M3(n) and M4(n). For M3(n), we have

M3(n) max i,j V Pd,d a,b En[(ca i cb j) E(ca i cb j)] ˆηa i ˆηb j HS

+ max i,j V Pd,d a,b En[(ˆca i ca i )(ˆcb j cb j)] ˆηa i ˆηb j HS

+ max i,j V Pd,d a,b En[ca i (ˆcb j cb j)] ˆηa i ˆηb j HS

+ max i,j V Pd,d a,b En[(ˆca i ca i )cb j] ˆηa i ˆηb j HS M3,1(n) + + M3,4(n).

For M3,1(n), it can be bounded as,

M3,1(n) = max i,j V Pd,d a,b En Xi Xj, E(Xi Xj), ηa i ηb j HS ˆηa i ˆηb j HS,

max i,j V Pd,d a,b En Xi Xj E(Xi Xj), ηa i ηb j HS ˆηa i ˆηb j HS

max i,j V En[Xi Xj E(Xi Xj)] HS Pd,d a,b ˆηa i ˆηb j HS

=d2 max i,j V En[Xi Xj E(Xi Xj)] HS.

Therefore, by Theorem 6, M3,1(n) = OP[d2(log p/n)1/2]. For M3,2(n), it can be bounded as,

M3,2(n) max i,j V Pd,d a,b En Xi Xj E(Xi Xj), (ˆηa i ηa i ) (ˆηb j ηb j) HS ˆηa i ˆηb j HS

+ max i,j V Pd,d a,b E(Xi Xj), (ˆηa i ηa i ) (ˆηb j ηb j) HS ˆηa i ˆηb j HS

16d2κ 2 d max i,j V[ ˆΣXi Xi ΣXi Xi ˆΣXj Xj ΣXj Xj

( En[Xi Xj E(Xi Xj)] HS + E(Xi Xj) HS)],

where last inequality holds because, by Lemma 5, Pd,d a,b (ˆηa i ηa i ) (ˆηb j ηb j) HS 16d2κ 2 d ˆΣXi Xi ΣXi Xi ˆΣXj Xj ΣXj Xj . By (20), we have E(Xi Xj) HS M0. Therefore, M3,2(n) = OP[d2(log p)/(nκ2 d)]. For M3,3(n), by Lemma 5 again, it can be bounded by

M3,3(n) 4d2κ 1 d max i,j V{ ˆΣXj Xj ΣXj Xj

[ En[Xi Xj E(Xi Xj)] HS + E(Xi Xj) HS]}.

Lee, Li, and Li

By Theorem 6 again, M3,3(n) = OP[d2(log p)1/2/(n1/2κd)]. Similarly, we can show that M3,4(n) has the same order of magnitude as M3,3(n). For M4(n), we have

M4(n) max i,j V E(Xi Xj) HS Pd,d a,b( ˆηa i ηa i + ˆηb j ηb j )

4d2κ 1 d M0 max i,j V( ˆΣXi Xi ΣXi Xi + ˆΣXj Xj ΣXj Xj ),

which leads to M4(n) = OP[d2(log p)1/2/(n1/2κd)]. Combining the orders of magnitude of M3,1(n), . . . , M3,4(n), and M4(n), we obtain,

max i,j V ˆΣd Xi Xj Σd Xi Xj HS

= OP[d2(log p/n)2] + OP[d2 log p/(nκd)] + Op[d2(log p)1/2/(n1/2κd)].

Because κd 0 and d2(log p)1/2/(n1/2κd) 1, the third term on the right-hand-side is the dominating term, which is equal to the desired rate because κ 1 d d1+γ by Assumption 4. This completes the proof of Theorem 7. 2

Theorem 7 generalizes the convergence result for the high-dimensional covariance matrix (Bickel and Levina, 2008) to the high-dimensional covariance operator. We remark that, under the Gaussian assumption, Qiao et al. (2019) also established the concentration bound for the sample covariance of the leading K-L expansion coeﬃcients. However, Theorem 7 diﬀers from the result of Qiao et al. (2019), in that it provides the uniform convergence at the operator level, which can not be derived directly from their result. Moreover, Theorems 6 and 7 do not require the random process to be Gaussian, but only require the distribution to be sub-Gaussian.

A.2 Proof of Theorem 3

To prove this theorem, we ﬁrst introduce an intermediate operator Σ d,ϵ Xi Xj|XS. We next derive

the order of magnitude for the diﬀerences between Σ d,ϵ Xi Xj|XS and ˆΣ d,ϵ Xi Xj|XS in Lemma 6, and between Σ d,ϵ Xi Xj|XS and ΣXi Xj|XS in Lemma 7, respectively. These two Lemmas together lead

to the ﬁrst assertion, i.e. the uniform convergence of ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS. Lastly, we derive the uniform convergence of the estimated graph. For any A, B V, let Σd XAXB be the matrix of operators {Σd Xi Xj}i A,j B, and let

Σ d,ϵ Xi Xj|XS = Σd Xi Xj Σd Xi XS[Σd XSXS(ϵ)] Σd XSXj,

where ϵ > 0 is a tuning constant that decreases to 0 as n , and [A(ϵ)] represents (A + ϵI) 1A(A + ϵI) 1. This term Σ d,ϵ Xi Xj|XS plays the role of an intermediate operator

between ΣXi Xj|XS and ˆΣ d,ϵ Xi Xj|XS.

Lemma 6 Suppose Assumptions 1 and 3 hold, Xi sub G(ΣXi Xi, b0) with E(Xi) = 0, for i V, m 1, ϵ 1, and md3+γ(log p)1/2/(n1/2ϵ) 1. Then,

max H(m) ˆΣ d,ϵ Xi Xj|XS Σ d,ϵ Xi Xj|XS HS = OP{md3+γ(log p)1/2/(n1/2ϵ)}.

Functional Directed Acyclic Graphs

Proof of Lemma 6: Because ˆ MXSXi = (ˆΣd XSXS + ϵI) 1 ˆΣd XSXi, we have ˆΣ d,ϵ Xi Xj|XS = ˆΣd Xi Xj ˆΣd Xi XS[ˆΣd XSXS(ϵ)] ˆΣd XSXj. Therefore,

max (i,j,S) H(m) ˆΣ d,ϵ Xi Xj|XS Σ d,ϵ Xi Xj|XS HS

max i,j V ˆΣd Xi Xj Σd Xi Xj HS + max (i,j,S) H(m) (ˆΣd Xi XS Σd Xi XS)[ˆΣd XSXS(ϵ)] ˆΣd XSXj HS

+ max (i,j,S) H(m) Σd Xi XS{[ˆΣd XSXS(ϵ)] [Σd XSXS(ϵ)] }ˆΣd XSXj HS

+ max (i,j,S) H(m) Σd Xi XS[Σd XSXS(ϵ)] (ˆΣd XSXj Σd XSXj) HS M5(n) + M6(n) + M7(n) + M8(n).

The order of magnitude of M5(n) is given in Theorem 7. We next derive the orders of magnitude of M6(n), M7(n) and M8(n), respectively. For M6(n), we have

M6(n) max (i,j,S) H(m) (ˆΣd Xi XS Σd Xi XS)(ˆΣd XSXS + ϵI) 1 ˆΣd XSXS(ˆΣd XSXS + ϵI) 1Σd XSXj HS

+ max (i,j,S) H(m) (ˆΣd Xi XS Σd Xi XS)(ˆΣd XSXS + ϵI) 1 ˆΣd XSXS

(ˆΣd XSXS + ϵI) 1(ˆΣd XSXj Σd XSXj) HS.

The ﬁrst term on the right, by Lemma 4 and ˆΣd XSXS(ˆΣd XSXS + ϵI) 1 1, is upper-bounded by

ϵ 1 max (i,j,S) H(m)[ ˆΣd Xi XS Σd Xi XS HS Σd XSXj HS]

ϵ 1 max (i,i,S) H1(m) ˆΣd Xi XS Σd Xi XS HS max (j,j,S) H1(m) Σd XSXj HS. (22)

By a special case of (20), ΣXj Xi HS M0. Henceforth,

max (j,j,S) H1(m) Σd XSXj HS = max (j,j,S) H1(m)

i S Σd Xi Xj 2 HS max (j,j,S) H1(m)

i S ΣXi Xj 2 HS m1/2M0.

So the right-hand-side of (22) is further bounded by ϵ 1m1/2M0(max(i,i,S) H1(m) ˆΣd Xi XS Σd Xi XS HS). Similarly, we can bound the second term on the right of (21) by

ϵ 1 max (i,j,S) H(m)( ˆΣd Xi XS Σd Xi XS HS ˆΣd XSXj Σd XSXj HS).

Therefore, we have,

M6(n) ϵ 1[m1/2 M0( max (i,i,S) H1(m) ˆΣd Xi XS Σd Xi XS HS)

+ max (i,j,S) H(m)( ˆΣd Xi XS Σd Xi XS HS ˆΣd XSXj Σd XSXj HS)].

For M7(n), let [A(ϵ)] denote (A + ϵI) 1, so that [A(ϵ)] = [A(ϵ)] A[A(ϵ)] . In these notations, the diﬀerence [ˆΣd XSXS(ϵ)] [Σd XSXS(ϵ)] can be decomposed as

{[ˆΣd XSXS(ϵ)] [Σd XSXS(ϵ)] }ˆΣd XSXS[ˆΣd XSXS(ϵ)] + [Σd XSXS(ϵ)] (ˆΣd XSXS Σd XSXS)[ˆΣd XSXS(ϵ)]

+[Σd XSXS(ϵ)] Σd XSXS{[ˆΣd XSXS(ϵ)] [Σd XSXS(ϵ)] } Γ1(n) + Γ2(n) + Γ3(n).

Lee, Li, and Li

Moreover, Σd Xi XSΓ1(n)ˆΣd XSXj HS is bounded by

Σd Xi XS[Σd XSXS(ϵ)] ˆΣd XSXS Σd XSXS HS [ˆΣd XSXS(ϵ)] ˆΣd XSXS[ˆΣd XSXS(ϵ)] ˆΣd XSXj . (23)

The ﬁrst norm in (23), by Baker (1973, Theorem 1) and Lemma 4, is upper-bounded by ϵ 1/2 Σd Xi Xi 1/2 [tr(ΣXi Xi)/ϵ]1/2 = (M0/ϵ)1/2. The third norm in (23) is upperbounded by ϵ 1/2[ (ˆΣd XSXS + ϵI) 1/2Σd XSXj + (ˆΣd XSXS + ϵI) 1/2(ˆΣd XSXj Σd XSXj) ]. By a similar argument as used to bound the ﬁrst norm, we can be further bound the above by ϵ 1/2[(M0 + ϵ 1M0 ˆΣd XSXS Σd XSXS )1/2 + ϵ 1/2 ˆΣd XSXj Σd XSXj ]. Therefore,

max (i,j,S) H(m) Σd Xi XSΓ1(n)ˆΣd XSXj HS M 1/2 0 ϵ 1( max (i,j,S) H(m) ˆΣd XSXS Σd XSXS HS)

max (i,j,S) H(m)[(M0 + ϵ 1M0 ˆΣd XSXS Σd XSXS )1/2 + ϵ 1/2 ˆΣd XSXj Σd XSXj ]. (24)

Moreover, max(i,j,S) H(m) Σd Xi XSΓ2(n)ˆΣd XSXj HS and max(i,j,S) H(m) Σd Xi XSΓ3(n)ˆΣd XSXj HS can be bounded by the right-hand-side of (24). Therefore, M7(n) is bounded by three times of the quantity on the right-hand-side of (24). For M8(n), similar to the derivation of the bound for M6(n), we can show that

M8(n) ϵ 1m1/2 M0( max (i,j,S) H(m) ˆΣd XSXj Σd XSXj HS).

On the other hand, note that

max (i,j,S) H(m) ˆΣd XSXS Σd XSXS HS = max (i,j,S) H(m)

i,j S ˆΣd Xi Xj Σd Xi Xj 2 HS,

which is no greater than m (maxi,j V ˆΣd Xi Xj Σd Xi Xj HS). By the same derivation, we

have max(i,i,S) H1(m) ˆΣd Xi XS Σd Xi XS HS is bounded by m1/2 (maxi,j V ˆΣd Xi Xj Σd Xi Xj HS). Combining the bounds for M6(n), M7(n), and M8(n), and applying Theorem 7 as well as the condition that md3+γ(log p)1/2/(n1/2ϵ) 1, we have

M6(n)+M7(n) + M8(n)

ϵ 1m(max i,j V ˆΣd Xi Xj Σd Xi Xj HS) + m(max i,j V ˆΣd Xi Xj Σd Xi Xj HS)2

+ ϵ 1/2m1/2(max i,j V ˆΣd Xi Xj Σd Xi Xj HS)

= OP[md3+γ(log p)1/2/(n1/2ϵ)] + Op[md6+2γ(log p)/n]

+ OP[m1/2d3+γ(log p)1/2/(n1/2ϵ1/2)].

Since d3+γ(log p)1/2/n1/2 1, the ﬁrst term on the right is the dominating term. This completes the proof of Lemma 6. 2

Lemma 7 Suppose Assumptions 1, 3, and 4 hold. Then, for any (i, j) V V and S V\(i, j), we have,

max (i,j,S) H(m) Σ d,ϵ Xi Xj|XS ΣXi Xj|XS HS = O{mϵ 1d γ + ϵ1/2s(m)}.

Functional Directed Acyclic Graphs

Proof of Lemma 7: Let Σϵ Xi Xj|XS = ΣXi Xj ΣXi XS[ΣXSXS(ϵ)] ΣXSXj be the intermediate operator between Σ d,ϵ Xi Xj|XS and ΣXi Xj|XS. By the triangular inequality,

max (i,j,S) H(m) Σ d,ϵ Xi Xj|XS ΣXi Xj|XS HS

max (i,j,S) H(m) Σ d,ϵ Xi Xj|XS Σϵ Xi Xj|XS HS + max (i,j,S) H(m) Σϵ Xi Xj|XS ΣXi Xj|XS HS

M9(n) + M10(n).

We next derive the orders of magnitude for M9(n) and M10(n). For M9(n), we have

M9(n) max i,j V Σd Xi Xj ΣXi Xj HS + max (i,j,S) H(m) (Σd Xi XS ΣXi XS)[Σd XSXS(ϵ)] Σd XSXj HS

+ max (i,j,S) H(m) ΣXi XS{[Σd XSXS(ϵ)] [ΣXSXS(ϵ)] }Σd XSXj HS

+ max (i,j,S) H(m) ΣXi XS[ΣXSXS(ϵ)] (Σd XSXj ΣXSXj) HS,

whose order of magnitude, by a similar argument as in the proof of Lemma 6, is no greater than that of

max i,j V Σd Xi Xj ΣXi Xj + ϵ 1/2 max (i,j,S) H(m)( Σd Xi XS ΣXi XS HS + Σd XSXj ΣXSXj HS)

+ϵ 1 max (i,j,S) H(m) Σd XSXS ΣXSXS .

Let Nd = {d + 1, d + 2, . . .}. Then the term max(i,j,S) H(m) Σd XSXS ΣXSXS HS equals

max (i,j,S) H(m)

i,j S PNd,Nd a,b E2(ca i cb j) max (i,j,S) H(m)

i,j S PNd,Nd a,b λa i λb j,

whose order is O(md γ) by Assumption 4. Similarly, we can show that

max (i,i,S) H1(m) Σd Xi XS ΣXi XS HS = O(m1/2d γ).

Therefore, M9(n) mϵ 1d γ. For M10(n), we note that, by the deﬁnitions of Σϵ Xi Xj|XS, ΣXi Xj|XS, and MXSXj,

Σϵ Xi Xj|XS ΣXi Xj|XS = ΣXi XS{[ΣXSXS(ϵ)] ΣXSXS I}MXSXj.

By the deﬁnition of [A(ϵ)] , we can rewrite the right-hand-side of the above equation as ΣXi XS ϵ2{[ΣXSXS(ϵ)] }2 2ϵ[ΣXSXS(ϵ)] MXSXj. Therefore,

M10(n) s(m) max (i,i,S) H1(m) ΣXi XS ϵ2{[ΣXSXS(ϵ)] }2 2ϵ[ΣXSXS(ϵ)] ,

whose order, by Lemma 4, is no greater than ϵ1/2s(m). Combining the orders of M9(n) and M10(n) completes the proof of Lemma 7. 2

Proof of Theorem 3: Combining Lemmas 6 and 7, we immediately obtain the uniform convergence rate of ˆΣ d,ϵ Xi Xj|XS.

Lee, Li, and Li

For the second assertion, we ﬁrst note that,

P {ˆECPDAG-f CCO = E0 CPDAG} {ˆℓf CCO = ℓ0}

P[ ˆΣ d,ϵ Xi Xj|XS HS > ρf CCO, ΣXi Xj|XS = 0, for some (i, j, S) H0(m)]

+ P[ ˆΣ d,ϵ Xi Xj|XS HS ρf CCO, ΣXi Xj|XS = 0, for some (i, j, S) H0(m)].

The ﬁrst term in (25) is further bounded by

P[max{ ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS HS : (i, j, S) H0(m)} t(m)/2] p (m).

Moreover, by the deﬁnition of t(m), the second term in (25) is no greater than

P[ ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS HS t(m)/2, ΣXi Xj|XS = 0, for some (i, j, S) H0(m)].

It is further bounded by p (m), which tends to 0 by the ﬁrst assertion and Assumption 5. We thus obtain the second assertion. This completes the proof of Theorem 3. 2

A.3 Proof of Theorem 4

Similar as the proof of Theorem 3, to prove this theorem, we ﬁrst introduce another intermediate operator Rδ Xi Xj|XS between ˆR d,ϵ,δ Xi Xj|XS and RXi Xj|XS,

Rδ Xi Xj|XS = (ΣXi Xi|XS + δI) 1/2ΣXi Xj|XS(ΣXi Xi|XS + δI) 1/2.

Then by the triangular inequality,

ˆR d,ϵ,δ Xi Xj|XS RXi Xj|XS HS ˆR d,ϵ,δ Xi Xj|XS Rδ Xi Xj|XS HS + Rδ Xi Xj|XS RXi Xj|XS HS.

We next derive the order of magnitude for the diﬀerences between ˆR d,ϵ,δ Xi Xj|XS and Rδ Xi Xj|XS in Lemma 8, and between Rδ Xi Xj|XS and RXi Xj|XS in Lemma 9, respectively.

Lemma 8 Suppose the conditions in Theorem 3(i) hold, ζ(m, d, p, ϵ, n) 1, and δ 1. Then, max (i,j,S) H0(m) ˆR d,ϵ,δ Xi Xj|XS Rδ Xi Xj|XS HS = OP[δ 3/2ζ(m, d, p, ϵ, n)].

Moreover, if δ 3/2ζ(m, d, p, ϵ, n) 1, then max(i,j,S) H0(m) ˆR d,ϵ,δ Xi Xj|XS Rδ Xi Xj|XS HS P 0.

Proof of Lemma 8: Let Γ4(n) = ˆΣ d,ϵ Xi Xi|XS, Γ5(n) = ΣXi Xi|XS, Γ6(n) = ˆΣ d,ϵ Xi Xj|XS, Γ7(n) =

ΣXi Xj|XS, Γ8(n) = ˆΣ d,ϵ Xj Xj|XS, and Γ9(n) = ΣXj Xj|XS. By the triangular inequality,

ˆR d,ϵ,δ Xi Xj|XS R d,ϵ,δ Xi Xj|XS HS

{Γ4(n) + δI] 1/2 [Γ5(n) + δI] 1/2}Γ6(n)[Γ8(n) + δI] 1/2 HS + [Γ5(n) + δI] 1/2[Γ6(n) Γ7(n)][Γ8(n) + δI] 1/2 HS + [Γ5(n) + δI] 1/2Γ7(n){[Γ8(n) + δI] 1/2 [Γ9(n) + δI] 1/2} HS M i,j,S 11 (n) + M i,j,S 12 (n) + M i,j,S 13 (n).

Functional Directed Acyclic Graphs

We next bound M i,j,S 11 (n), M i,j,S 12 (n), and M i,j,S 13 (n), respectively. For M i,j,S 11 (n), using the identity,

A1/2 B1/2 = A1/2(B 3/2 A 3/2)B3/2 + (A 1 B 1)B3/2

(Fukumizu et al., 2007), we have M i,j,S 11 (n) [M i,j,S 11,1 (n) + M i,j,S 11,2 (n)]M i,j,S 11,3 (n), where

M i,j,S 11,1 (n) = [Γ5(n) + δI] 1/2{[Γ5(n) + δI]3/2 [Γ4(n) + δI]3/2}[Γ4(n) + δI] 1 HS,

M i,j,S 11,2 (n) = [Γ4(n) Γ5][Γ4(n) + δI] 1 HS,

M i,j,S 11,3 (n) = [Γ4(n) + δI] 1/2Γ6(n)[Γ8(n) + δI] 1/2 .

By the inequality max(a1/2, b1/2) (a+b)1/2, Lemma 4, and Fukumizu et al. (2008, Lemma 7), we have

M i,j,S 11,1 (n) 3δ 3/2 max( Γ4(n) + δI 1/2, Γ5(n) + δI 1/2) Γ4(n) Γ5(n) HS 3δ 3/2[ Γ4(n) Γ5(n) + Γ5(n) + δI ]1/2 Γ4(n) Γ5(n) HS.

Because ΣXi Xi|XS ΣXi Xi, we have Γ5(n)+δI = ΣXi Xi|XS +δI ΣXi Xi +δI M0+δ. Therefore,

M i,j,S 11,1 (n) 3δ 3/2( Γ4(n) Γ5(n) + M0 + δ)1/2 Γ4(n) Γ5(n) HS.

For M i,j,S 11,2 (n), we have,

M i,j,S 11,2 (n) [Γ4(n) + δI] 1 Γ4(n) Γ5(n) HS δ 1 Γ4(n) Γ5(n) HS.

For M i,j,S 11,3 (n), by Proposition 5, we have M i,j,S 11,3 (n) 1. Combining the upper bounds for M i,j,S 11,1 (n), M i,j,S 11,2 (n), and M i,j,S 11,3 (n), and taking maximum over H0(m), we obtain that,

max (i,j,S) H0(m) M i,j,S 11 (n) ( max (i,i,S) H1(m) ˆΣ d,ϵ Xi Xi|XS ΣXi Xi|XS HS)

[3δ 3/2( max (i,i,S) H1(m) ˆΣ d,ϵ Xi Xi|XS ΣXi Xi|XS HS + M0 + δ)1/2 + δ 1].

By Theorem 3, the right-hand-side is dominated by ζ(m, d, p, ϵ, n){3δ 3/2[ζ(m, d, p, ϵ, n) + M0 + δ]1/2 + δ 1}. Therefore, if ζ(m, d, p, ϵ, n) 1, then

max (i,j,S) H0(m) M i,j,S 11 (n) = OP[δ 3/2ζ(m, d, p, ϵ, n)]. (26)

It remains to show that the orders of magnitude of the terms max(i,j,S) H0(m) M i,j,S 12 (n) and max(i,j,S) H0(m) M i,j,S 13 (n) are dominated by (26). Using similar derivations for (26),

max (i,j,S) H0(m) M i,j,S 12 (n) δ 1( max (i,j,S) H0(m) ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS HS) = OP[δ 1ζ(m, d, p, ϵ, n)],

max (i,j,S) H0(m) M i,j,S 13 (n) = OP[δ 3/2ζ(m, d, p, ϵ, n)],

both of which are dominated by (26). This completes the proof of Lemma 8. 2

Lee, Li, and Li

Lemma 9 Suppose Assumptions 1, 3 and 6 hold, and δ 1. Then,

max (i,j,S) H0(m) Rδ Xi Xj|XS RXi Xj|XS HS = Op(δ1/2).

Proof of Lemma 9: By deﬁnition, Rδ Xi Xj|XS RXi Xj|XS 2 HS is equal to

PMi,S,Mj,S a,b νa i,S, [(ΣXi Xi|XS + δI) 1/2ΣXi Xj|XS(ΣXi Xi|XS + δI) 1/2 RXi Xj|XS]νb j,S 2 ΩXi

= PMi,S,Mj,S a,b [(µa i,S + δ)1/2(µb j,S + δ)1/2 (µa i,S)1/2(µb j,S)1/2]2 νa i,S, RXi Xj|XSνb j,S 2 ΩXi (µa i,S + δ)(µb j,S + δ) ,

where Mi,S, Mj,S, µa i,S, µb j,S, νa i,S, νb j,S are as deﬁned right before Assumption 6. Moreover,

νa i,S, RXi Xj|XSνb j,S 2 ΩXi = νa i,S (µa i,S)1/2 , ΣXi Xj|XS νb j,S (µb j,S)1/2

ΩXi = (ρa,b i,j,S)2,

which implies that Rδ Xi Xj|XS RXi Xj|XS 2 HS is upper-bounded by

PMi,S,Mj,S a,b [(µa i,S + δ)1/2(µb j,S + δ)1/2 (µa i,S)1/2(µb j,S)1/2]2(ρa,b i,j,S)2/(µa i,Sµb j,S).

By direct calculation, for any a Mi,S, b Mj,S,

[(µa i,S + δ)1/2(µb j,S + δ)1/2 (µa i,S)1/2(µb j,S)1/2]2 δ(µa i,S + µb j,S) + δ2,

which is no greater than 2δM0 + δ2 because µa i,S ΣXi Xi|XS ΣXi Xi M0. Therefore, by Assumption 6,

max (i,j,S) H0(m) Rδ Xi Xj|XS RXi Xj|XS HS c1/2 0 (2δM0 + δ2)1/2 = Op(δ1/2).

This completes the proof of Lemma 9. 2

Proof of Theorem 4: Combining Lemmas 8 and 9, we immediately obtain the uniform convergence rate of ˆR d,ϵ,δ Xi Xj|XS RXi Xj|XS HS. The second assertion can be obtained following a similar proof of Theorem 3. This completes the proof of Theorem 4. 2

A.4 Proofs of other theoretical results

Proof of Proposition 1: By Assumption 2, for f ΩXB, g ΩXA, and (t1, t2) R2,

E exp n ι h t1( f, XB ΩXB MXAXBf, XA ΩXA) + t2 g, XA ΩXA

t1f t2g t1MXAXBf

, ΣUU ΣUV ΣV U ΣV V

t1f t2g t1MXAXBf

where ι = 1. By direct calculation, the inner product in the above expression equals

t2 1 f, ΣUUf ΩXB 2t2 1 f, ΣUV MXAXBf ΩXB + 2t1t2 f, ΣUV g ΩXB 2t1t2 g, ΣV V MXAXBf ΩXA + t2 1 MXAXBf, ΣV V MXAXBf ΩXA + t2 2 g, ΣV V g ΩXA = t2 1 f, (ΣUU ΣUV Σ V V ΣV U)f ΩXB + t2 2 g, ΣV V g ΩXA t2 1σ2 f + t2 2σ2 g,

Functional Directed Acyclic Graphs

where we have used the relations MXAXBf, ΣV V MXAXBf ΩXA = f, ΣUV MXAXBf ΩXB = f, ΣUV Σ V V ΣV Uf ΩXB, and f, ΣUV MXAXBf ΩXB = f, ΣUV g ΩXB. This implies, f, XB ΩXB MXAXBf, XA ΩXA and g, XA ΩXA are independent Gaussian variables with variances σ2 f and σ2 g, respectively. Since this holds for all f ΩXB and g ΩXA, we have f, XB ΩXB MXAXBf, XA ΩXA XA, which completes the proof of Proposition 1. 2

Proof of Theorem 1: Following the proof of Proposition 1, we can show that, for any (f, g) ΩXi ΩXj, the conditional distribution of ( f, Xi , g, Xj ) given XS is

N MXSXif, XS MXSXjg, XS

, f, ΣXi Xi|XSf f, ΣXi Xj|XSg g, ΣXj Xi|XSf g, ΣXj Xj|XSg

which implies (i). Because, by Assumption 3, ran(ΣXi Xj|XS) ran(ΣXi Xi), and ran(ΣXj Xi|XS) ran(ΣXj Xj), it suﬃces to show that, for any f ran(ΣXi Xi) and g ran(ΣXj Xj),

f, ΣXi Xj|XSg ΩXi = f, P

a,b Ncov(ca i , cb j | XS)(ηa i ηb j) g ΩXi. (27)

Since f ran(ΣXi Xi) and g ran(ΣXj Xj), we have f = P

a N f, ηa i ΩXiηa i , and g = P

b N g, ηb j ΩXj ηb j. Substituting these into the left-hand-side of (27), we can obtain the right-hand-side of (27). Thus, (ii) holds. This completes the proof of Theorem 1. 2

Proof of Proposition 2: Note that, for any a, b span(Br),

[a b] = ([(a b)b1] , . . . , [(a b)br]) = ([a] b, b1 , . . . , [a] b, br ) ,

which equals [a] [b] T e1, . . . , [a] [b] T er , where ei is the r-dimensional vector with its ith element equal to 1 and other elements equal to 0. Noting that ˆΣXi Xj = En[(Xi En Xi) (Xj En Xj)] completes the proof of Proposition 2. 2

Proof of Proposition 4: We have, by the faithfulness, Xi Xj | XS if and only if i and j are d-separated by S. Following the proof in Kalisch and B uhlmann (2007, Proposition 1), we can show that the above equivalence implies that the output of Step 1 of the functional-PC0 is the true skeleton ESKE, which further implies ℓ0 m by the deﬁnition of m. Therefore (ii) holds. Moreover, by Meek (1995), the output from Step 2 of functional PC0 is the CPDAG of G, which implies (i). This completes the proof of Proposition 4. 2

Proof of Proposition 5: By the deﬁnition of ˆR d,ϵ,δ Xi Xj|XS,

ran( ˆR d,ϵ,δ Xi Xj|XS) ran(ˆΣ d,ϵ Xi Xi|XS) = span{ˆηa i : a = 1, . . . , d} = span{B i },

ker( ˆR d,ϵ,δ Xi Xj|XS) ker(ˆΣ d,ϵ Xj Xj|XS) = span{ˆηa j : a = 1, . . . , d} = span{B j} .

This implies that the operator norm of ˆR d,ϵ,δ Xi Xj|XS is the same as the largest singular value

of the coordinate representation of ˆR d,ϵ,δ Xi Xj|XS with respect to B j and B i . Therefore, by

Lee, Li, and Li

Proposition 3, ˆR d,ϵ,δ Xi Xj|XS can be computed via the optimization:

maximize : [fi]TAT i,SAj,S[fj],

subject to : [fi]T(AT i,SAi,S + δIn)[fi] = [fj]T(AT j,SAj,S + δIn)[fj] = 1,

[fi] Rd, [fj] Rd,

where Ai,S = [In D(S)]1/2C1:d i . By the Cauchy-Schwarz inequality, we have

[fi]TAT i,SAj,S[fj] ([fi]TAT i,SAi,S[fi])1/2([fj]TAT j,SAj,S[fj])1/2,

which is no greater than [fi]T(AT i,SAi,S + δIn)[fi] 1/2 [fj]T(AT j,SAj,S + δIn)[fj] 1/2 = 1. This completes the proof of Proposition 5. 2

Proof of Theorem 5: First, we show (a) (b). We pick a permutation π, such that, for any i = 2, . . . , p, pa(i) π([i 1]). For convenience, we reset π(1), . . . , π(p) to 1, . . . , p. By Lauritzen et al. (1990, Corollary 2), the global Markov property is equivalent to

Xi X[i 1] | Xpa(i),

which means that the conditional distributions of Xi | X[i 1] and Xi | Xpa(i) are identical. Moreover, following a similar proof as that of Proposition 1, we can show that,

Xi | X[i 1] N(M X[i 1]Xi X[i 1], ΣXi Xi|X[i 1]),

Xi | Xpa(i) N(M Xpa(i)Xi Xpa(i), ΣXi Xi|Xpa(i)).

Since the above two Gaussian distributions are the same, we have (MX[i 1]Xi)j = 0, for j / pa(i). Therefore, by Lemma 1, X satisﬁes the functional linear structural equation model with respect to G. Next, we show (b) (a). This holds because the global Markov property is implied by the local Markov property, which, under the Gaussian distribution, is implied by (b). This completes the proof of Theorem 5. 2

A.5 Additional numerical results

Additional combinations of (p, q, n): We carry out an additional simulation study for Model I with more combinations of (p, q, n). In our simulation, the true DAG is generated by a random graph, whose level of sparsity is controlled by the expected neighborhoods size q. Meanwhile, as q increases, we expect the maximal degree m to increase as well. Table 2 reports the average values of m for diﬀerent combinations of (p, q, n). This new study thus allows us to further examine the empirical performance of the proposed PCO method with denser graphs. Moreover, it oﬀers new insight into the condition that Theorem 3 requires about m that, md3+γ(log p)1/2/(n1/2ϵ) 1. (28)

This condition implies that m can grow at most at a polynomial rate, and thus in turn imposes a level of sparsity on the graph. We generate (p, q, n) following the relation,

q(n) = c1 + c2 n, logc3[p(n)] = c4 nc5,

Functional Directed Acyclic Graphs

n p q ave. m TDR TPR FM 20 4 1.18 2.150 0.956 0.139 0.195 40 8 1.26 3.237 0.742 0.399 0.505 60 17 1.34 4.400 0.730 0.498 0.586 80 32 1.42 5.150 0.687 0.530 0.594 100 57 1.50 5.937 0.689 0.568 0.619 120 100 1.58 6.600 0.684 0.574 0.622

Table 2: The true discovery rate (TDR), true positive rate (TPR) and F-measure (FM) between the estimated and true CPDAG, as p grows in an exponential order of n, and q grows in a polynomial order of n.

with c1 = 0.77, c2 = 0.003, c3 = 6, c4 = 0.09, and c5 = 0.7. As such, p grows in an exponential order of n, and q grows in a polynomial order of n. The resulting combinations are n = {20, 40, 60, 80, 100, 120}, p = {4, 8, 17, 32, 57, 100}, and q = {0.82, 0.88, 0.93, 0.99, 1.05, 1.10}. For the evaluation criteria, in addition to the structure Hamming distance (SHD) and the true discovery rate (TDR), we employ two additional criteria, the true positive rate (TPR) and the F-measure (FM), which are deﬁned as,

TPR(ˆECPDAG, ECPDAG) = |{(i, j) ˆECPDAG : (i, j) ECPDAG}|/|ECPDAG|,

FM(ˆECPDAG, ECPDAG) = 2 TDR(ˆECPDAG, ECPDAG) TPR(ˆECPDAG, ECPDAG)

TDR(ˆECPDAG, ECPDAG) + TPR(ˆECPDAG, ECPDAG) ,

where FM(ˆECPDAG, ECPDAG) is computed as the harmonic mean of TDR(ˆECPDAG, ECPDAG) and TPR(ˆECPDAG, ECPDAG). A motivation of adding the new criteria is that an estimator may sometimes have a high true discovery rate (TDR) but a low TPR (Rijsbergen, 1979). A higher TPR and a higher FM indicate a more accurate estimator. Figures 5 to 8 report the box plots of the SHD, TDR, TPR and FM, respectively, between the true CPDAG and the PCO estimate across 80 data replications. We see that, as sample size increase, the SHD decreases, and both TPR and FM increase with a stabilizing TDR. This shows that our method performs well for these new combinations, and, in particular, for the larger m s, as long as the sample size is reasonably large. Table 2 reports the average TDR, TPR, and FM for six combinations of (p, q, n), along with the average m. We see that, with the increasing sample size, while TDR slightly deceases, TPR and FM both increase substantially. This results indicates that the overall performance of our PCO method improves as n increases. It also provides additional support for our theoretical consistency and why we need condition (28).

Undirected screening: We also investigate the performance of our method when coupled with an undirected graph estimation as a starting point. In the classical setting of random variables, the undirected graph is a supergraph of the skeleton ESKE of the DAG G, and it is easy to show this statement continues to hold in the setting of random functions. Speciﬁcally, we employ Qiao et al. (2019) to estimate an undirected graph, then feed it as an initial graph into our proposed PCO-based PC-algorithm. We have chosen Qiao et al. (2019) over Li and Solea (2018), since both our method and Qiao et al. (2019) consider the

Lee, Li, and Li

q = 0.82 q = 0.88

G G G GGGGGGGGGGGGG

GGG G G G G G G G

G GG p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0 20 40 60 80

0 20 40 60 80

G p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

q = 0.93 q = 0.99

G GGG G G GGGG

GG GGGG GGGGGGG

G p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0 30 60 90 120

0 30 60 90 120

GGGG G GGG G G

GG GG G G G

G G G G GG G

G p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0 50 100 150

0 50 100 150

q = 1.05 q = 1.10

GGGG G GGG G G

GG GG G G G

G G G G GG G

G p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0 50 100 150

0 50 100 150

GGGGGGGGGGG

G G G GG GG

GG GG GG G G G GGG

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0 50 100 150

0 50 100 150

Figure 5: The structure Hamming distance (SHD) between the estimated and true CPDAG for combinations of the sample size n, graph size p, and sparsity rate q.

Gaussian distribution. There is a penalty constant in Qiao et al. (2019) that determines the level of sparsity of the estimated undirected graph. We have experimented with a range of penalty values, resulting in diﬀerent percentage values of the selected edges among all edges, from 100% to 4%, for the initial estimation. When the percentage is 100%, there is no penalty in the undirected graph estimation, or eﬀectively, no pre-screening for our method. Table 3 reports the average SHD and TDR and the standard error (in the parenthesis) based on 80 data replications for Model I with (n, p, q) = (50, 50, 0.7). We see that, as the

Functional Directed Acyclic Graphs

q = 0.82 q = 0.88

GG GG GG GGG

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

q = 0.93 q = 0.99

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

q = 1.05 q = 1.10

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

Figure 6: The true discovery rate (TDR) between the estimated and true CPDAG for combinations of the sample size n, graph size p, and sparsity rate q.

percentage of the pre-selected edges decreases to 10%, the performance of the combined algorithm improves. When this percentage drops below 10%, the performance begins to decline. This example shows the potential advantage of coupling an initial undirected graph estimation with our proposed DAG estimation method.

Eﬀect of the kernel function: In our simulations in Section 6.1, we employ the Brownian motion covariance function (BMC) kernel, κT = min(s, t). To investigate whether our method is sensitive to the choice of kernels, here we use the radial basis function (RBF) kernel, κT = exp{γG(s t)2}, where the bandwidth parameter is computed as

Lee, Li, and Li

q = 0.82 q = 0.88

G G G G G G

G G G G G G

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

q = 0.93 q = 0.99

G p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

q = 1.05 q = 1.10

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

Figure 7: The true positive rate (TPR) between the estimated and true CPDAG for combinations of the sample size n, graph size p, and sparsity rate q.

s<t|τs τt|/(ℓ 2) 2 following Li and Song (2017). We compare the estimation results using these two diﬀerent kernels. We still use the simulation Model I in Section 6.1, with (n, p, q) = (50, 50, 1.05). Figure 9 shows the box plots of structure Hamming distance (SHD) and true positive rate (TPR) based on 80 data replications. We see that our method displays a relatively stable performance under the change of kernels.

Comparison with Lee and Li (2022) We analytically compare our proposed method with the SEM method of Lee and Li (2022) in Section 1; see the last paragraph of page 3. Here we numerically compare the two methods. We adopt the simulation Model I

Functional Directed Acyclic Graphs

q = 0.82 q = 0.88

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

q = 0.93 q = 0.99

GGG GG G G G

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

q = 1.05 q = 1.10

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

p=32 p=57 p=100

p=4 p=8 p=17

20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

Figure 8: The F-measure (FM) between the estimated and true CPDAG for combinations of the sample size n, graph size p, and sparsity rate q.

in Section 6.1, with (n, p, q) = (50, 50, 1.05). Figure 10 shows the box plots of structure Hamming distance (SHD) and true positive rate (TPR) based on 80 data replications. We see that our method performs better in this example, partly because our method is built upon the Gaussian assumption, which is satisﬁed in this simulated model. By comparison, the SEM method of Lee and Li (2022) does not require the Gaussian assumption. We also point out that, in this example, the computation of our method is much faster than SEM. On a 2 x E5-2630 v4 workstation, the average running time of our method is 6.31 seconds, and that of SEM is 18.64 seconds.

Lee, Li, and Li

ave. sparsity 100% 17% 15% 12% 10% 8% 5 % 6% 4% SHD 23.73 22.34 20.95 19.71 18.51 18.69 19.96 22.56 31.75 (s.e.) 6.60 6.29 6.11 6.14 6.19 5.98 5.99 5.82 3.98 TDR 0.71 0.75 0.77 0.79 0.81 0.81 0.80 0.77 0.62 (s.e.) 0.08 0.08 0.08 0.08 0.08 0.08 0.09 0.10 0.09

Table 3: The average and standard error (in the parenthesis) of the structure Hamming distance (SHD) and the true discovery rate (TDR), for the undirected graph pre-screening as the initialization + the proposed DAG estimation.

BMC (default) RBF

BMC (default) RBF

Figure 9: Empirical performance, in terms of structure Hamming distance (SHD) and true positive rate (TPR), under Brownian motion covariance function (BMC) kernel and radial basis function (RBF) kernel.

Figure 10: Empirical performance, in terms of structure Hamming distance (SHD) and true positive rate (TPR), between the proposed PCO method and the SEM method of Lee and Li (2022).

Plot of the proteomic data: Figure 11 plots the time-course measurements for all 20 protein levels in the DREAM breast cancer proteomic dataset.

Rita Giuliano Antonini. Subgaussian random variables in hilbert spaces. Rendiconti del Seminario Matematico della Universit a di Padova, 98:89 99, 1997.

Charles R Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:273 289, 1973.

Functional Directed Acyclic Graphs

P. J. Bickel and E. Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 2604, 2008.

D Bosq. Linear Processes in Function Spaces. Springer, 2000. ISBN 0387950524.

S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.

P B uhlmann, J Peters, and J Ernest. CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526 2556, 2014.

Peter B uhlmann and Sara van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Publishing Company, 1st edition, 2011. ISBN 3642201911.

T. Cai, W. Liu, and X. Luo. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594 607, 2011.

Kehui Chen and Jing Lei. Localized functional principal component analysis. Journal of the American Statistical Association, 110(511):1266 1275, 2015.

Xiaohui Chen and Yun Yang. Hanson Wright inequality in Hilbert spaces with application to K-means clustering for non-Euclidean data. Bernoulli, 27(1):586 614, 2021.

D M Chickering. Learning equivalence classes of Bayesian-network structures. Journal of Machine Learning Research, 2(3):445 498, 2002.

D. Colombo and M. H. Maathuis. Order-independent constraint-based causal structure learning. Journal of Machine Learning Research, pages 3921 3962, 2014.

Y. Fan and J. Lv. Innovated scalable eﬃcient estimation in ultra-large Gaussian graphical models. The Annals of Statistics, 44(5):2098 2126, 2016.

Y. Fan, G.M. James, and P. Radchenko. Functional additive regression. The Annals of Statistics, 43:2296 2325, 10 2015.

Jerome H. Friedman, Trevor J. Hastie, and Robert J. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432 441, 2008.

Karl J. Friston. Functional and eﬀective connectivity: A review. Brain Connectivity, 1(1): 13 36, 2011.

K Fukumizu, F R Bach, and A Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8:361 383, 2007.

K. Fukumizu, A. Gretton, X. Sun, and B. Sch olkopf. Kernel Measures of Conditional Dependence. Advances in Neural Information Processing Systems, 20:489 496, 2008.

Jian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Joint estimation of multiple graphical models. Biometrika, 98(1):1 15, mar 2011.

Lee, Li, and Li

Ana Mar ıa Estrada G omez, Kamran Paynabar, and Massimo Pacella. Functional directed graphical models and applications in root-cause analysis and diagnosis. Journal of Quality Technology, 53(4):421 437, 2021.

Naftali Harris and Mathias Drton. PC algorithm for nonparanormal graphical models. Journal of Machine Learning Research, 14(1):3365 3383, 2013.

Alain Hauser and Peter B uhlmann. Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs. Journal of the Royal Statistical Society, Series B., 77:291 318, 2015.

Hill et al. Inferring causal molecular networks: empirical assessment through a communitybased eﬀort. Nature Methods, 13(4):310 318, February 2016.

Markus Kalisch and Peter B uhlmann. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8:613 636, 2007.

T. Kato. Perturbation Theory of Linear Operators. Springer, 1980.

J. L. Kazdan. Perturbation of complete orthonormal sets and eigen-function expansions. Proceedings of the American Mathematical Society, 27:506 510, 1971.

John L. Kelley. General topology. Springer-Verlag, 1955.

S. L. Lauritzen, A. P. Dawid, B. N. Larsen, and H. G. Leimer. Independence properties of directed markov ﬁelds. Networks, 20(5):491 505, 1990.

Steﬀen L Lauritzen. Graphical Models. Oxford: Oxford University Press, 1996.

Kuang-Yao Lee and Lexin Li. Functional structural equation model. Journal of the Royal Statistical Society: Series B, 84(2):600 629, April 2022.

Kuang-Yao Lee, Bing Li, and Hongyu Zhao. On an additive partial correlation operator and nonparametric estimation of graphical models. Biometrika, 103:513 530, 2016.

Kuang-Yao Lee, Tianqi Liu, Bing Li, and Hongyu Zhao. Learning causal networks via additive faithfulness. Journal of Machine Learning Research, 21(51):1 38, 2020.

Kuang-Yao Lee, Dingjue Ji, Lexin Li, Todd Constable, and Hongyu Zhao. Conditional functional graphical models. Journal of the American Statistical Association, 2021.

B. Li and E. Solea. A nonparametric graphical model for functional data with application to brain networks based on fmri. Journal of the American Statistical Association, 113 (524):1637 1655, 2018.

Bing Li. Linear operator-based statistical analysis: A useful paradigm for big data. Canadian Journal of Statistics, 46:79 103, 2018.

Bing Li and Jun Song. Nonlinear suﬃcient dimension reduction for functional data. Annals of Statistics, 45(3):1059 1095, 2017. ISSN 00905364.

Functional Directed Acyclic Graphs

Bing Li, Hyonho Chun, and Hongyu Zhao. On an additive semi-graphoid model for statistical networks with application to pathway analysis. Journal of the American Statistical Association, 109:1188 1204, 2014.

Chunlin Li, Xiaotong Shen, and Wei Pan. Likelihood ratio tests for a large directed acyclic graph. Journal of the American Statistical Association, 115(531):1304 1319, 2020.

Yehua Li and Yongtao Guan. Functional principal component analysis of spatiotemporal point processes with applications in disease surveillance. Journal of the American Statistical Association, 109(507):1205 1215, 2014.

Han Liu, Fang Han, Ming Yuan, John Laﬀerty, and Larry Wasserman. High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4):2293 2326, 2012.

Qingyang Liu, Yuping Zhang, and Zhengqing Ouyang. Structural inference of time-varying mixed graphical models. Stat, 10(1):e414, 2021.

Weidong Liu. Structural similarity and diﬀerence testing on multiple sparse Gaussian graphical models. The Annals of Statistics, 45(6):2680 2707, December 2017.

R Luo and X Qi. Function-on-function linear regression by signal compression. Journal of the American Statistical Association, 112(518):690 705, 2016.

Ruiyan Luo and Hongyu Zhao. Bayesian Hierarchical Modeling for Signaling Pathway Inference From Single Cell Interventional Data. The Annals of Applied Statistics, 5(2A): 725 745, 2011.

Marloes H. Maathuis, Markus Kalisch, and Peter B uhlmann. Estimating high-dimensional intervention eﬀects from observational data. The Annals of Statistics, 37(6):3133 3164, 2009.

Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh Conference on Uncertainty in Artiﬁcial Intelligence, UAI 95, pages 403 410, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc.

Ardalan Mirshani and Matthew Reimherr. Adaptive function-on-scalar regression with a smoothing elastic net. Journal of Multivariate Analysis, 185:104765, 2021.

Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge: Cambridge University Press, 2nd Edition, 2009.

J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735 746, 2009.

J. Peters, J. M. Mooij, D. Janzing, and B. Sch olkopf. Causal Discovery with Continuous Additive Noise Models. Journal of Machine Learning research, 15:2009 2053, 2014.

Xinghao Qiao, Shaojun Guo, and Gareth M. James. Functional graphical models. Journal of the American Statistical Association, 114(525):211 222, 2019.

Lee, Li, and Li

Xinghao Qiao, Cheng Qian, Gareth M James, and Shaojun Guo. Doubly functional graphical models in high dimensions. Biometrika, 107(2):415 431, 02 2020. ISSN 0006-3444. doi: 10.1093/biomet/asz072.

Shaojun Guo Qin Fang and Xinghao Qiao. Adaptive functional thresholding for sparse covariance function estimation in high dimensions. Journal of the American Statistical Association, 0(0):1 13, 2023. doi: 10.1080/01621459.2023.2200522. URL https://doi. org/10.1080/01621459.2023.2200522.

Pradeep Ravikumar, Martin J. Wainwright, Garvesh Raskutti, and Bin Yu. Highdimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron. J. Stat., 5:935 980, 2011.

Zhao Ren, Tingni Sun, Cun-Hui Zhang, and Harrison H. Zhou. Asymptotic normality and optimalities in estimation of large gaussian graphical models. The Annals of Statistics, 43(3):991 1026, 2015.

C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, 2nd edition, 1979.

Eftychia Solea and Bing Li. Copula gaussian graphical models for functional data. Journal of the American Statistical Association, To appear, 2020.

P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. The MIT Press, 2000.

Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1):31 78, 2006.

Caroline Uhler, Garvesh Raskutti, Peter B uhlmann, and Bin Yu. Geometry of the faithfulness assumption in causal inference. The Annals of Statistics, 41(2):436 463, April 2013.

Sara van de Geer and Peter B uhlmann. l0-penalized maximum likelihood for sparse directed acyclic graphs. The Annals of Statistics, 41(2):536 567, 2013.

Kartik G. Waghmare, Tomas Masak, and Victor M. Panaretos. The functional graphical lasso. ar Xiv, 2023.

F. Yao and H.-G. M uller. Functional quadratic regression. Biometrika, 97(1):49 64, 2010.

F. Yao, H.-G. M uller, and J.-L. Wang. Functional linear regression analysis for longitudinal data. The Annals of Statistics, 33(6):2873 2903, 2005.

M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19 35, 2007.

J Zapata, S Y Oh, and A Petersen. Partial separability and functional graphical models for multivariate Gaussian processes. Biometrika, 109(3):665 681, 10 2021. ISSN 1464-3510. doi: 10.1093/biomet/asab046. URL https://doi.org/10.1093/biomet/asab046.

Functional Directed Acyclic Graphs

Boxin Zhao, Y. Samuel Wang, and Mladen Kolar. Fudge: A method to estimate a functional diﬀerential graph in a high-dimensional setting. Journal of Machine Learning Research, 23(82):1 82, 2022.

Hongxiao Zhu, Nate Strawn, and David B. Dunson. Bayesian graphical models for multivariate functional data. Journal of Machine Learning Research, 17(204):1 27, 2016. URL http://jmlr.org/papers/v17/16-164.html.

Lee, Li, and Li

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

400 600 800

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

400 550 700

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

400 800 1200

0 20 40 60 80 100 120

350 500 650

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

400 700 1000

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

0 20 40 60 80 100 120

Figure 11: Plots of 20 protein levels for the proteomic data.