# functional_directed_acyclic_graphs__c0176410.pdf Journal of Machine Learning Research 25 (2024) 1-48 Submitted 9/22; Revised 11/23; Published 3/24 Functional Directed Acyclic Graphs Kuang-Yao Lee kuang-yao.lee@temple.edu Department of Statistics, Operations, and Data Science Temple University Philadelphia, PA 19122, USA Lexin Li lexinli@berkeley.edu Department of Biostatistics and Epidemiology University of California at Berkeley Berkeley, CA 94720, USA Bing Li bxl9@psu.edu Department of Statistics Pennsylvannia State University University Park, PA 16802, USA Editor: Peter Spirtes In this article, we introduce a new method to estimate a directed acyclic graph (DAG) from multivariate functional data. We build on the notion of faithfulness that relates a DAG with a set of conditional independences among the random functions. We develop two linear operators, the conditional covariance operator and the partial correlation operator, to characterize and evaluate the conditional independence. Based on these operators, we adapt and extend the PC-algorithm to estimate the functional directed graph, so that the computation time depends on the sparsity rather than the full size of the graph. We study the asymptotic properties of the two operators, derive their uniform convergence rates, and establish the uniform consistency of the estimated graph, all of which are obtained while allowing the graph size to diverge to infinity with the sample size. We demonstrate the efficacy of our method through both simulations and an application to a time-course proteomic dataset. Keywords: Graphical model, faithfulness, functional regression, linear operator, reproducing kernel Hilbert space, uniform consistency. 1. Introduction In this article, we introduce a new method to estimate a directed acyclic graph (DAG) based on multivariate functional data. Functional graphical modeling is becoming increasingly important, as multivariate functional data, where the observations are sampled from a vector of random functions, are fast emerging in a wide variety of scientific applications. Examples include molecular network modeling based on time-course gene or phosphoprotein data (Hill et al., 2016), and brain effective connectivity analysis based on electrocorticography or functional magnetic resonance imaging data (Friston, 2011). A crucial problem in these applications is to investigate directional relations among the random functions, which is c 2024 Kuang-Yao Lee, Lexin Li, and Bing Li. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-1038.html. Lee, Li, and Li a challenging task. The DAG model offers a tractable solution, and yet functional DAG modeling is a relatively underdeveloped topic. Consider a graph G = (V, E), where V = {1, . . . , p} denotes a set of vertices associated with a vector of random variables or functions, X1, . . . , Xp, and E {(i, j) V V : i = j} a set of edges. Let (V V)0 denote the set {(i, j) V V : i = j} that excludes the diagonal pairs. A pair (i, j) E denotes an edge between vertices i and j, and is said to be directed from i to j if (j, i) / E. In this relation, i is called a parent of j, and j a child of i. If both (i, j) and (j, i) belong to E, then the edge is said to be undirected. If there is a directed path i . . . j from i to j, then i is called an ancestor of j, and j a descendant of i. A DAG is a graph that contains only directed edges and no directed cycles. If two edges meet head-to-head at a vertex i on a path, say j i k, then i is called a collider on the path. For S V\{i, j}, the vertices i and j are said to be d-connected by S if and only if there exists a path connecting i and j that satisfies: (i) every collider in the path is either in S or has a descendant in S, and (ii) no non-collider in the path is in S. By convention, (i) includes the cases where the path has no collider at all. Also note that the descendant of a collider in the path does not belong to the path. We say i and j are d-separated by S if they are not d-connected by S. In the classical random variable setting, directional relations among the variables can be depicted by faithfulness. Formally, X = (X1, . . . , Xp)T Rp is said to be faithful with respect to a DAG G, if and only if the collection of the triplets {(i, j, S) T : i and j are d-separated by S} is the same as {(i, j, S) T : Xi Xj | XS}, where denotes statistical independence, and T = {(i, j, S) : (i, j) (V V)0, S V\{i, j}}. Moreover, when X follows a multivariate Gaussian distribution, we have the equivalence that Xi Xj | XS cov (Xi, Xj|XS) = 0. In the functional data setting, the notion of faithfulness is essentially the same. Specifically, suppose X = (X1, . . . , Xp) T is a p-dimensional Gaussian random element in a Hilbert space of functions defined on an interval T in R. We say X is faithful with respect to a DAG G, if and only if the following equivalence holds: i and j are d-separated by S Xi Xj | XS. (1) The conditional independence in (1) is in terms of Hilbertian random elements, and is formally defined in Section 2. Through the faithfulness in (1), a DAG is associated with a collection of conditional independence relations among p random functions. To evaluate the conditional independence in (1), we develop two linear operators, the conditional covariance operator (CCO), and the partial correlation operator (PCO). Under the assumption that the p-variate random function X follows a Gaussian distribution, the conditional independence can be completely characterized by CCO or PCO, in the sense that CCO or PCO is zero if and only if the conditional independence between the random functions holds true. This agrees with the classical result that the conditional independence between two Gaussian variables is equivalent to their conditional covariance or partial correlation being zero. Henceforth, CCO and PCO can be viewed as the functional counterparts of conditional covariance and partial correlation. We next estimate the DAG by repeatedly evaluating the Hilbert-Schmidt norms of CCO and PCO, whose computation is straightforward and only involves eigen-decomposition of linear operators. We further embed CCO and PCO in the commonly used PC-algorithm (Spirtes et al., 2000), and turn Functional Directed Acyclic Graphs the NP-hard problem of evaluating conditional independence for every pair (i, j) and every possible subset S V\{i, j} to a computationally efficient algorithm of order pm, where m is the maximum degree of the DAG. We also establish the error bounds and the uniform convergence for the estimated CCO and PCO, as well as the uniform consistency of the estimated DAG. We carry out all these theoretical investigations by allowing the graph size to diverge to infinity along with the sample size. We note that our theoretical analysis faces a number of challenges. The first is, when characterizing the conditional independence between two random functions, we need to deal with diverging numbers of random variables from the Karhunen-Lo eve expansions of the functions, which requires a much more involved asymptotic analysis than in the classical setting. The second challenge is, most existing concentration inequalities as well as their sufficient conditions have been tailored for high-dimensional random variables, rather than high-dimensional random functions. To fill this gap, we develop suitable new asymptotic tools, including functional versions of the sub-Gaussianity and Bernstein s inequality for Hilbert space-valued random elements. The third challenge is, because the covariances are replaced by the covariance operators, we need to establish several concentration bounds and uniform convergences for the relevant linear operators in the high-dimensional setting. In fact, the theoretical framework we develop here is fairly general, and we expect it to be useful in other functional data analysis settings as well, especially when the number of functions involved is large compared to the sample size. It is also noteworthy that there is some novelty in our presentation of the classical DAG theory: We describe the CPDAG as a member of the quotient space of the collection of all DAGs, which makes its subtle definition more transparent and explicit. Our work is a natural step forward in the current research on statistical graphical modeling. The majority of existing solutions focus on undirected or directed graphical models for random variables. Notable examples of the former include Yuan and Lin (2007); Friedman et al. (2008); Ravikumar et al. (2011); Cai et al. (2011); Guo et al. (2011); Ren et al. (2015); Fan and Lv (2016); Liu et al. (2021), among others, whereas examples of the latter include Chickering (2002); Kalisch and B uhlmann (2007); Harris and Drton (2013); van de Geer and B uhlmann (2013); Li et al. (2020), among others. More recently, Zhu et al. (2016), Li and Solea (2018), Qiao et al. (2019), Qiao et al. (2020), Solea and Li (2020), and Zhao et al. (2022) extended undirected graphical models to random functions. Even though the undirected and directed graphs are related, we discuss their differences in Section 7.3. Despite the recent progress, the problem of estimating functional directed graphical models remains largely underdeveloped. G omez et al. (2021) reformulated DAG estimation as sparse function-on-function regression (Fan et al., 2015; Luo and Qi, 2016), but required the causal order is known a prior, which can be unrealistic in practice. Lee and Li (2022) relaxed this requirement and proposed to estimate DAG through a functional structural equation model (SEM) with two man steps: order determination, then sparse functional regression. Our proposal is considerably different in several ways. First, unlike G omez et al. (2021), our method does not assume the order is known, and thus the problem is more challenging; see also van de Geer and B uhlmann (2013) for the differences between DAG estimation with and without a known order in the random variable setting. Second, our method is built upon the proposed functional PC-algorithm, and belongs to the category of independence-based solutions, whereas Lee and Li (2022) is built upon function-on-function Lee, Li, and Li regression, and belongs to the category of SEM-based approaches; see also Peters et al. (2014) for the differences of these two categories of solutions under the random variable setting. Due to this methodological difference, the subsequent asymptotic analysis becomes utterly different too. In Section 7.1, we establish a one-to-one correspondence between the functional DAG and the functional linear SEM, which in turn reveals how the functional DAG factorizes the joint distribution, and how to interpret the identified edges. We further develop in Section 7.2 the notion of do-intervention for possible causal interpretation in the functional setting. None of these results are available in Lee and Li (2022). We also remark that our proposal hinges on linear operators, which are being increasingly employed in graphical modeling (Li et al., 2014; Lee et al., 2016, 2020). Nevertheless, we use linear operators to study a completely different problem. For instance, Lee et al. (2021) studied conditional undirected graphs that vary along with the external variables, whereas we target DAG estimation for multivariate random functions. Correspondingly, the estimation method and asymptotic theory are substantially different. The rest of the article is organized as follows. We introduce DAG and its equivalence class in Section 2, and develop the linear operators, CCO and PCO, in Section 3. We derive their sample estimation and the modified PC-algorithm in Section 4, and develop the asymptotic theory in Section 5. We conduct numerical studies in Section 6, and further discuss the model in Section 7. We relegate all technical proofs and additional numerical results to the Appendix. 2. DAG and its equivalence class for functions Let (Ω, F, P) denote a probability space. For an interval T R and t T, let X(t) = [X1(t), . . . , Xp(t)]T denote a vector of multivariate random functions of dimension p defined on Ωand taking values in ΩX = ΩX1 ΩXp, where each ΩXi is a Hilbert space of R-valued functions on T, i V. Let , ΩXi denote the inner product in ΩXi, and ΩXi the norm induced by this inner product. We allow p to increase with the sample size n; that is, p = p(n), and limn p(n) = . Henceforth, V also depends on n implicitly. We abbreviate X(t) and Xi(t) as X and Xi whenever there is no confusion. Next, we introduce two assumptions on X. Assumption 1 There exists M0 > 0 such that max{E Xi 2 ΩXi : i V} M0. Assumption 2 The p-variate random function X is a zero-mean Gaussian random element in ΩX and is faithful with respect to a DAG G; i.e., X satisfies (1). Assumption 1 is standard in high-dimensional functional data analysis, and ensures that the trace of the covariance operator of Xi is uniformly bounded. Assumption 2 is our main distributional assumption. The zero-mean condition is to simplify the development, and can be easily relaxed. It is also possible to relax the Gaussian condition, by employing the notions of functional additive conditional independence (Li and Solea, 2018) or copula graphical models (Liu et al., 2012; Solea and Li, 2020). Nevertheless, we feel the Gaussian case is important for its own sake and is worthy of a full investigation. We leave the non-Gaussian extension as future research. Functional Directed Acyclic Graphs Under Assumption 1, the bilinear form (f, g) 7 E( f, Xi ΩXi g, Xj ΩXj ) from ΩXi ΩXj to R is bounded, and induces a bounded linear operator ΣXi Xj : ΩXj ΩXi for each (i, j) V V. We call it the covariance operator from ΩXj to ΩXi. We then define the joint covariance operator ΣXX : ΩX ΩX as the p p matrix of operators whose (i, j)th entry is ΣXi Xj. This means, for any f = (f1, . . . , fp)T ΩX, we have ΣXXf = Pp j=1ΣX1Xjfj, . . . , Pp j=1ΣXp Xjfj T . Moreover, for any subsets A, B V, we define ΣXAXB to be the matrix of operators whose entries are {ΣXi Xj : i A, j B}, and its dimension is |A| |B|, where |A| denotes the cardinality of A. We note that, in Assumption 2, an ΩX-valued random element X is Gaussian if and only if f, X ΩX is a Gaussian random variable for every f = (f1, . . . , fp)T ΩX, or equivalently, E exp Pp i=1ι fi, Xi ΩXi = exp 1/2 Pp i=1 Pp j=1 fi, ΣXi Xjfj ΩXi , where ι = 1. Next, we formally define the notion of conditional independence of random functions. For the probability space (Ω, F, P), suppose (ΩX, FX), (ΩY , FY ), (ΩZ, FZ) are measurable spaces, and X : Ω ΩX, Y : Ω ΩY , Z : Ω ΩZ are random elements in (ΩX, FX), (ΩY , FY ), (ΩZ, FZ), respectively. We say that X and Y are conditionally independent given Z, if and only if, for every A FX, B FY , P(X A, Y B | Z) = P(X A | Z) P(Y B | Z) almost surely P. In our case, ΩX, ΩY , ΩZ are separable Hilbert spaces, and FX, FY , FZ, are the Borel σ-fields generated by the open sets in ΩX, ΩY , ΩZ, respectively. The specific forms of ΩX, ΩY , ΩZ are determined by contexts. This general definition satisfies all the relevant axioms of conditional independence such as those described in Lauritzen (1996, Chapter 3). In particular, as in the setting of undirected graphs, under the Gaussian assumption, the pairwise Markov property is equivalent to the global Markov property. For a DAG G defined on V, let H = {(i, j, S) : (i, j) V V, S V\{i, j}}, and H0 = {(i, j, S) H : i = j}. Moreover, let D(G) = {(i, j, S) H0 : i and j are d-separated by S under G}, F = {(i, j, S) H0 : Xi Xj | XS}. (2) By definition, X is faithful to G if D(G) = F. It is possible for two different DAGs, say G1 and G2, to share the same D; i.e., D(G1) = D(G2). Consequently, by conditional independence and faithfulness, we can only determine the class of DAGs with the same D. Let G be the collection of all DAGs defined on V. We say that G1, G2 G are equivalent, and write G1 G2 if and only if D(G1) = D(G2), where is an equivalence relation. This is called the Markov equivalence (Peters et al., 2014). Thus, each G G induces an equivalence class {G G : G G}. The collection of all Markov equivalence classes forms a partition of G, which is a quotient space of G, denoted by G/ (Kelley, 1955), and is referred to as the quotient space of Markov equivalence. A partially directed graph is a graph in which some edges are undirected. A completed partially acyclic directed graph (CPDAG) is a partially directed acyclic graph L such that there exists a member D of G/ satisfying the following properties: (i) each directed edge in L is an edge in every DAG in D; and (ii) for each undirected edge, say i j, in L, there is a DAG in D with i j being one of its edges, and another DAG in D with j i Lee, Li, and Li being one of its edges. Chickering (2002, Lemma 2) showed that two DAGs are Markov equivalent if and only if they have the same CPDAG. In other words, a member D of G/ corresponds to one and only one CPDAG. This also establishes a bijection between G/ and the collection of all CPDAGs. 3. Two linear operators In this section, we formally develop the conditional covariance operator and the partial correlation operator to estimate the CPDAG, and derive their population properties. We begin with some notation. Let Ωand Ω be two Hilbert spaces, B (Ω, Ω ) the collection of all bounded linear operators from Ωto Ω , B 2(Ω, Ω ) the collection of all Hilbert-Schmidt operators from Ωto Ω , and B 1(Ω, Ω ) the collection of all trace class operators from Ωto Ω . When Ω = Ω, we write B (Ω, Ω) as B (Ω), B 2(Ω, Ω) as B 2(Ω), and B 1(Ω, Ω) as B 1(Ω). Let , HS, and TR be the operator norm in B (Ω), the Hilbert-Schmidt norm in B 2(Ω), and the trace norm in B 1(Ω) respectively. For a linear operator A, let ker(A), ran(A), and ran(A) denote the kernel, range, and the closure of the range of A; i.e., ker(A) = {f Ω: Af = 0}, ran(A) = {Af : f Ω}, ran(A) = cl [ran(A)], where cl( ) stands for the closure of a set. For a bounded and self-adjoint linear operator A, its restriction on ran(A) is an injective function from ran(A) to ran(A). We call the inverse of this function A|ran(A) as the Moore-Penrose inverse, and denote it by A . That is, A is the mapping from ran(A) to ran(A) such that, for any x ran(A), A x is the unique member y ran(A) satisfying Ay = x. Let A denote the adjoint of A. For two positive sequences {an} and {bn}, we write an bn or bn an if an/bn is bounded, and write an bn or bn an if an/bn 0. If two sequences an and bn are ordered by , then we use an bn to denote the smaller sequence in terms of . Moreover, we write an bn if an bn and bn an. For each i V, let {(λa i , ηa i )}a N be the collection of eigenvalue-eigenfunction pairs of ΣXi Xi, with λ1 i λ2 i 0 and N being the set of natural numbers {1, 2, . . . , }. Then Xi = P a Nca i ηa i holds almost surely, where ca i = Xi, ηa i are from i.i.d. N(0, λa i ). This expansion is known as the Karhunen-Lo eve (K-L) expansion, and c1 i, c2 i, . . . are called the functional principal component scores (Bosq, 2000). We note that the K-L expansion has been widely used in functional data analysis (Yao et al., 2005; Yao and M uller, 2010; Li and Guan, 2014; Chen and Lei, 2015). Particularly, Qiao et al. (2019) also used the K-L expansion for undirected functional graphic modeling. We next formally define the conditional covariance operator. Let MXAXB = Σ XAXAΣXAXB, for A, B V. Given its resemblance to the regression coefficient matrix in the classical regression, we call MXAXB a regression operator, and the next assumption ensures it is well-defined. Assumption 3 Suppose ran(ΣXAXB) ran(ΣXAXA), and MXAXB is Hilbert-Schmidt. The Hilbert-Schmidt assumption on MXAXB regulates the degree of smoothness in the dependence between XA and XB. That is, the output of ΣXAXB needs to sufficiently concentrate on the leading eigenfunctions of ΣXAXA. Similar conditions are commonly imposed in the literature (see, e.g., Lee et al., 2016; Li, 2018). Functional Directed Acyclic Graphs For any A V, let ΩXA be the direct sum of {ΩXi : i A}; i.e., ΩXA is the Cartesian product of ΩXi, i A, and the inner product in ΩXA is the sum of the inner products in ΩXi, i A. The next proposition shows that the regression operator MXAXB in fact uniquely determines the mapping f 7 E( f, XB ΩXB | XA), which characterizes E(XB | XA). In other words, MXAXB has a one-to-one correspondence with E(XB | XA). Proposition 1 Suppose Assumptions 1 to 3 hold. Then, for f ΩXB, MXAXBf, XA ΩXA = E( f, XB ΩXB | XA). Definition 1 Suppose Assumptions 1 to 3 hold. For (i, j, S) H, let ΣXi Xj|XS : ΩXj ΩXi be the linear operator ΣXi Xj M XSXiΣXSXSMXSXj. We call ΣXi Xj|XS the conditional covariance operator (CCO) of Xi and Xj given XS. We note that, an equivalent form of CCO is ΣXi Xj|XS = ΣXi Xj ΣXi XSΣ XSXSΣXSXj. However, the form in Definition 1 involves the regression operator, and is more conducive for subsequent proofs. The next theorem establishes the properties of ΣXi Xj|XS. Theorem 1 Suppose Assumptions 1 to 3 hold. Then, for every (i, j, S) H0, (i) ΣXi Xj|XS = 0 if and only if Xi Xj | XS; (ii) ΣXi Xj|XS = P a,b N cov(ca i , cb j | XS)(ηa i ηb j), where ca i is the ath K-L coefficient of Xi, and ηa i is the eigenfunction of ΣXi Xi associated with its ath largest eigenvalue. Theorem 1 (i) generalizes the classical result when X is a vector of Gaussian random variables to the functional setting. Consequently, we can use the CCO to characterize the conditional independence between Xi and Xj given XS. We next define the partial correlation operator, which extends partial correlation to the functional setting. This is motivated by the observation that partial correlation achieves better scaling in the classical random variable setting (Peng et al., 2009; Lee et al., 2016; Liu, 2017). The next theorem establishes the existence of PCO and its connection with the conditional independence. Its proof is similar to that of Lee, Li, and Zhao (2016, Theorem 1), and is thus omitted. Theorem 2 Suppose Assumptions 1 to 3 hold. Then, there exists a unique operator RXi Xj|XS B (ΩXj, ΩXi) such that, (i) ΣXi Xj|XS = Σ 1/2 Xi Xi|XSRXi Xj|XSΣ 1/2 Xj Xj|XS; (ii) RXi Xj|XS 1, for every (i, j, S) H0. Moreover, RXi Xj|XS = 0 if and only if Xi Xj | XS. Definition 2 We call RXi Xj|XS in Theorem 2 the partial correlation operator (PCO) of Xi and Xj given XS. Lee, Li, and Li 4. Estimation Theorems 1 and 2 suggest two linear operators, ΣXi Xj|XS and RXi Xj|XS, to characterize the conditional independence. In this section, we develop their sample estimators, first at the operator level, then at the coordinate level. We then develop a procedure for evaluating Xi Xj | XS for a given triplet (i, j, S) H0, and a modified PC-algorithm for estimating the entire DAG. 4.1 Operator-level estimation We first derive the K-L expansion at the sample level. Suppose X1, . . . , Xn are i.i.d. samples from X, where Xk = (Xk 1 , . . . , Xk p)T, k = 1, . . . , n. Let En represent the sample mean operator; i.e., En(U) = Pn k=1U k/n, for a set of samples (U 1, . . . , U n). The covariance operator ΣXi Xj is estimated by ˆΣXi Xj = En [(Xi En Xi) (Xj En Xj)] , for any (i, j) V V. For each i V, let {(ˆλa i , ˆηa i )}a N be the sequence of eigenvalueeigenfunction pairs of ˆΣXi Xi. Then, the sample-level K-L expansion of Xk i En(Xi) is Xk i En(Xi) = P a Nˆc k,a i ˆηa i , where ˆc k,a i = Xk i En(Xi), ˆηa i ΩXi. To improve estimation efficiency, we further truncate this expansion at the dth term to obtain the approximation, Xk i En(Xi) Pd a=1ˆc k,a i ˆηa i , for all k = 1, . . . , n, and i V. Note that, we allow the truncation number d to depend on n. Correspondingly, the truncated estimate of ΣXi Xj is ˆΣd Xi Xj = Pd a,b=1En(ˆca i ˆcb j)(ˆηa i ˆηb j). We next estimate the regression operators MXSXi by ˆ MXSXi = (ˆΣd XSXS + ϵI) 1 ˆΣd XSXi, (3) for i V, where ϵ > 0 is a ridge-type tuning parameter, and I is the identity operator from ΩXS to ΩXS. The inverse is done by taking the inverse of the eigenvalues; see Proposition 3 in Section 4.2. Following Theorem 1(ii), we estimate the CCO by ˆΣ d,ϵ Xi Xj|XS = Pd a,b=1{En(ˆca i ˆcb j) En[( ˆ MXSXi ˆηa i )(XS)( ˆ MXSXj ˆηb j)(XS)]}(ˆηa i ˆηb j). Similarly, following Theorem 2(i), we estimate the PCO by ˆR d,ϵ,δ Xi Xj|XS = (ˆΣ d,ϵ Xi Xi|XS + δI) 1/2 ˆΣ d,ϵ Xi Xj|XS(ˆΣ d,ϵ Xj Xj|XS + δI) 1/2, (4) where δ > 0 is a ridge parameter that regularizes the inverses of the two matrices. Note that in (3) and (4), we both truncate the K-L expansion and employ the ridgetype regularization. We choose to do so for the following reasons. First of all, we note that the rank of ˆΣd XSXS is dns, where ns is the cardinality of S. This rank can be larger than the sample size n, and thus we need to introduce an extra ridge-type regularization in (3). Meanwhile, since the ranks of ˆΣ d,ϵ Xi Xi|XS and ˆΣ d,ϵ Xj Xj|XS are d, which is typically smaller than n, an alternative estimator of the PCO is not to employ the ridge regularization as in (3) and (4), which leads to Rd Xi Xj|XS = ( Σd Xi Xi|XS) 1/2 Σd Xi Xj|XS( Σd Xj Xj|XS) 1/2, (5) Functional Directed Acyclic Graphs where Σd Xi Xj|XS = ˆΣd Xi Xj ˆΣd Xi XS ˆΣ d XSXS ˆΣd XSXj, and ˆΣ d XSXS = Pd a=1(λa S) 1(ηa S ηa S) is the Moore- Penrose inverse of ˆΣd XSXS, for (i, j, S) H0. Comparing (5) to (4), this alternative PCO estimator has two fewer tuning parameters, and thus is computationally easier. However, as we remark after Theorem 4 in Section 5.2, the asymptotic analysis of (4) is much easier than that of (5). Moreover, when d = 1, (5) is closely related to one of the competing methods, linear-PC, that we numerically compare in Section 6.1, and we show that (4) achieves a better empirical performance than (5). Therefore, we propose (4) as our PCO estimator, and build our DAG estimation based on (4). 4.2 Coordinate-level evaluation for conditional independence We begin with constructing the spaces ΩXi and ΩX using functional bases. Let H be a generic finite-dimensional Hilbert space spanned by a set of functions B = {b1, . . . , bm} defined on T. For any h H, let [h]B = ([h]B,1 , . . . , [h]B,m)T denote the coordinate of h with respect to B; that is, h = Pm i=1 [h]B,i bi = b T 1:m [h]B, where b1:m denotes the vector of functions (b1, . . . , bm)T. Let KB = { bs, bt H} m s,t=1 be the Gram kernel matrix. Then the inner product h1, h2 H can be expressed as [h1] T B KB [h2]B. We next derive the coordinate representation of Xk, k = 1, . . . , n. Suppose each Xk is measured on uk time points Tk = {tk1, . . . , tkuk}, k = 1, . . . , n. Let T1:n = n k=1Tk, τ1 < τ2 < < τℓbe all the time points in T1:n, and ℓ= |T1:n| be the total number of distinctive time points. Let ΩXi be the linear space spanned by the functions {κT( , τ) : τ T1:n}. Let KT = {κT(τi, τj)} ℓ i,j=1. Suppose KT is of rank r, and has the spectral decomposition KT = UTDTU T T , where DT Rr r is the diagonal matrix whose diagonal elements are the sorted nonzero eigenvalues of KT, and UT Rℓ r is the matrix whose columns are the eigenvectors corresponding to the eigenvalues in DT. Let (b1, . . . , br)T = D 1/2 T U T T [κT( , τ1), . . . , κT( , τℓ)]T. It is straightforward to verify that Br = {b1, . . . , br} is an orthonormal basis of ΩXi. Using this basis, each Xk i can be represented as Xk i ( ) = Pr t=1 [Xk i ]Br,t bt( ) = b T 1:r( ) [Xk i ]Br. Note that the kth individual Xk i is observed only at uk time points in Tk. Let Xk i (Tk) = Xk i (tk1), . . . , Xk i (tkuk) T, and b1:r(Tk) = b1:r(tk1), . . . , b1:r(tkuk) . Therefore, we have Xk i (Tk) = b1:r(Tk)T [Xk i ]Br, on both side of which we then multiply b1:r(Tk) to get b1:r(Tk)Xk i (Tk) = b1:r(Tk)b1:r(Tk)T [Xk i ]Br. Solving this linear equation with a ridge-type regularization, we obtain the following coordinate representation of [Xk i ]Br, for i V and k = 1, . . . , n, [Xk i ]Br = [b1:r(Tk)b1:r(Tk)T + ϵk TIr] 1 b1:r(Tk)Xk i , (6) where Ir is the r r identity matrix, and ϵk T is a ridge tuning parameter. We next derive the coordinate representations of the truncated sample covariance operators. Let H and H be two finite-dimensional Hilbert spaces spanned by B = {b1, . . . , bm} and B = {b 1, . . . , b m }, respectively. Let A : H H be a linear operator. The coordinate representation of A with respect to B and B is defined as ([Ab1]B , . . . , [Abm]B ) B [A]B. If H is a third Hilbert space with basis B , and A is a linear operator from H to H , then B [A A]B = (B [A ]B )(B [A]B). For simplicity, we abbreviate B [A]B by [A] when the bases B, B are obvious from the context. Proposition 2 For each (i, j) V V, [ˆΣXi Xj] = En [Xi En Xi] [Xj En Xj] T . Lee, Li, and Li Therefore, by the definition of ˆ MXSXi, the coordinate representation of ˆ MXSXif is [ ˆ MXSXif] = ([ˆΣXSXS] + ϵIrns) 1[ˆΣXSXi][f], for each f ΩXi, i V, and S V, with ns being the cardinality of S. Let {(ˆλa i , [ˆηa i ])}r a=1 denote the collection of eigenvalue-eigenvector pairs of [ˆΣXi Xi], i V. Then the sample-level K-L expansion of Xk i En Xi is Xk i En Xi = Pd a=1 Xk i En Xi, ˆηa i ΩXi ˆηa i = Pd a=1 [Xk i En Xi] T [ˆηa i ] ˆηa i . (7) Finally, we derive the coordinate representations of the sample CCO and PCO. Recall that ˆc k,a i denotes the inner product Xk i En Xi, ˆηa i ΩXi. For i V, a = 1, . . . , d, let Ca i be the n-dimensional vector {c k,a i : k = 1, . . . , n}, and C1:d i be the n d matrix whose ath column is Ca i . For S = {a1, . . . , ans} V\{i, j}, let CS be the 1 ns block matrix C1:d a1 , . . . , C1:d ans . The sample estimates of CCO and PCO are given in Proposition 3. Its proof can be derived from Proposition 2 and (7), and is omitted. Proposition 3 The coordinate representations of ˆΣ d,ϵ Xi Xj|XS and ˆR d,ϵ,δ Xi Xj|XS with respect to B i = {ˆη1 i , . . . , ˆηd i } and B j = {ˆη1 j, . . . , ˆηd j } are: B i [ˆΣ d,ϵ Xi Xj|XS]B j = n 1{(ca i )T[In D(S)]cb j}d a,b=1 Mi,j|S(ϵ), (8) B i [ ˆR d,ϵ,δ Xi Xj|XS]B j = [Mi,i|S(ϵ) + δId] 1/2Mi,j|S(ϵ)[Mj,j|S(ϵ) + δId] 1/2, (9) where D(S) = CS [(CT S CS + ϵInsd) 1CT S CS(CT S CS + ϵInsd) 1] CT S . Based on (8), we compute the squared Hilbert-Schmidt (H-S) norm of ˆΣ d,ϵ Xi Xj|XS as ˆΣ d,ϵ Xi Xj|XS 2 HS = Pd a=1 ˆΣ d,ϵ Xi Xj|XS ˆηa j , ˆΣ d,ϵ Xi Xj|XS ˆηa j ΩXi = Pd a=1 ˆηa j T B j (B j [ˆΣ d,ϵ Xj Xi|XS]B i )(B i [ˆΣ d,ϵ Xi Xj|XS]B j ) ˆηa j B j = B i [ˆΣ d,ϵ Xi Xj|XS]B j 2 F, where F is the Frobenius norm. In other words, the H-S norm of ˆΣ d,ϵ Xi Xj|XS is simply the Frobenius norm of the matrix B j [ˆΣ d,ϵ Xi Xj|XS]B i . Similarly, the H-S norm of ˆR d,ϵ,δ Xi Xj|XS is the Frobenius norm of the matrix B i [ ˆR d,ϵ,δ Xi Xj|XS]B j in (9). We then threshold the H-S norms ˆΣ d,ϵ Xi Xj|XS HS and ˆR d,ϵ,δ Xi Xj|XS HS to evaluate the conditional independence; that is, we declare Xi Xj | XS if B j [ˆΣ d,ϵ Xi Xj|XS]B i F < ρCCO, B j [ ˆR d,ϵ,δ Xi Xj|XS]B i F < ρPCO, (10) where ρCCO and ρPCO are the threshold values determined adaptively given the data. Observing that for the random variable case, the partial correlation can be tested using its Fisher z-transformation, whose variance is approximated by (n |S| 3) 1 (Harris and Drton, 2013), we take ρCCO and ρPCO to be proportional to (n |S| 3) 1/2. Our numerical experiments have found that the results are not overly sensitive to the choice of these threshold values as long as they are within a reasonable range. We summarize the above estimation procedure in Algorithm 1. Functional Directed Acyclic Graphs Algorithm 1 Evaluation of Xi Xj | XS for a given triplet (i, j, S) H0. 1: Choose a kernel κT; e.g., the Brownian motion function κT(s, t) = min(s, t), or the radial basis function κT(s, t) = exp { γT(s t)2}, for (s, t) T T, where γT = P s 0, such that E (exp f, X H) exp (b Σf, f H/2), for all f H. We denote it as X sub G(Σ, b). The notion of sub-Gaussianity in Hilbert space was introduced by Antonini (1997, Definition 1.1). See also Chen and Yang (2021); Mirshani and Reimherr (2021); Zapata et al. (2021); Qin Fang and Qiao (2023); Waghmare et al. (2023). Our Definition is slightly different: Antonini (1997) allows b = 0, which leads to a degenerate distribution, whereas we require b to be strictly positive. 5.1 Uniform convergence of CCO estimation We first study the CCO. We begin with an assumption on the smoothness level of Xi. Assumption 4 There exists γ > 1, such that λa i a γ and λa i λa+1 i a 1 γ, as a , for every i V. As our estimators are built on the leading K-L coefficients, it is reasonable to assume the tail eigenvalues of ΣXi Xi diminish sufficiently fast. The first part of Assumption 4 implies that maxi V(P a=d+1λa i ) d γ. That is, the decaying rate of the tail eigenvalues of ΣXi Xi is in a polynomial order of d. The second part of Assumption 4 requires the decaying rate of the gaps of the adjacent eigenvalues, λa i λa+1 i , to be greater than a polynomial order of a, for a N. The parameter γ imposes a level of smoothness of the decaying rate; the larger the value of γ, the faster the decaying rate. For any m N {0}, let H0(m) = {(i, j, S) H0 : |S| m}, H(m) = {(i, j, S) H : |S| m}, and H1(m) = H(m)\H0(m). In the following development, we allow p , d , m , ϵ 0 as n . Moreover, given m N, let t(m) = min{ ΣXi Xj|XS HS : ΣXi Xj|XS = 0, (i, j, S) H0(m)}, and ζ(m, d, p, ϵ, n) = md3+γ(log p)1/2/ (n1/2ϵ) + mϵ 1d γ + ϵ1/2s(m). Here t(m) is the minimal H-S norm of the nonzero CCO, and thus t(m) 1. Next, we introduce an assumption on t(m). Assumption 5 Suppose t(m) ζ(m, d, p, ϵ, n). Assumption 5 places a lower bound on the order of t(m) to prevent it from diminishing too fast. Note that, in the random variable setting, the partial correlation-based PC-algorithm requires a strong faithfulness assumption to ensure the uniform consistency of the estimated graph (Uhler et al., 2013). Assumption 5 is similar and can be viewed as the functional version of strong faithfulness. Also note that, we require t(m) to go to 0 at a slower rate than ζ(m, d, p, ϵ, n), whose first term is md3+γ(log p)1/2/(n1/2ϵ). By comparison, for the partial correlation-based PC-algorithm, the order of the minimal nonzero partial correlation has to be greater than (m log p/n)1/2 (Uhler et al., 2013). Lee, Li, and Li We next establish the uniform convergence rate of ˆΣ d,ϵ Xi Xj|XS, and the uniform consistency of the estimated CPDAG, ˆECPDAG-f CCO, based on CCO. Theorem 3 (i) Uniform convergence rate of ˆΣ d,ϵ Xi Xj|XS: Suppose Assumptions 1, 3, 4, and 5 hold, Xi sub G(ΣXi Xi, b0) with E(Xi) = 0, ϵ 1, md3+γ(log p)1/2/(n1/2ϵ) 1, and the threshold ρCCO = t(m)/2. Then, max (i,j,S) H(m) ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS HS = OP[md3+γ(log p)1/2/(n1/2ϵ) + mϵ 1d γ + ϵ1/2s(m)], where s(m) = max(i,i,S) H1(m) MXSXi HS. (ii) Uniform graph consistency based on CCO: If we further assume Assumption 2 holds, then, P(ˆECPDAG-f CCO = E0 CPDAG) 1 and P(ˆℓf CCO = ℓ0) 1, as n . Theorem 3 requires the maximal degree m to satisfy md3+γ(log p)1/2/(n1/2ϵ) 1, which implies that m can grow at most at a polynomial rate, and thus in turn imposes a level of sparsity on the graph. This rate for m is consistent with the classical settings. For instance, in the sparse regression, the order of magnitude of the sparsity parameter can only grow in a polynomial order of n. The classical linear PC-algorithm (Kalisch and B uhlmann, 2007, condition A3) also requires m to grow in a polynomial order of n. 5.2 Uniform convergence of PCO estimation We next study the PCO. We show that the norm of ˆR d,ϵ,δ Xi Xj|XS is no greater than 1, which resembles the property of the partial correlation. We then introduce two assumptions. Proposition 5 For each (i, j, S) H0, we have ˆR d,ϵ,δ Xi Xj|XS 1. Assumption 6 There exists c0 > 0, such that max P b Mj,S(ρ a,b i,j,S)2/(µa i,Sµb j,S) : (i, j, S) H0 c0, where ρ a,b i,j,S = cor( νa i,S, Xi ΩXi, νb j,S, Xj ΩXj ), Mi,S = {a N : µa i,S > 0}, µ1 i,S µ2 i,S and ν1 i,S, ν2 i,S are the eigenvalues and eigenfunctions of ΣXi Xi|XS. Assumption 6 places a level of smoothness on the relation between Xi and Xj given XS. Under the Gaussian assumption, the correlation ρ a,b i,j,S measures the strength of dependency between Xi and Xj given XS. Moreover, because ΣXi Xi|XS is a trace-class operator, its eigenvalues decay to 0 sufficiently fast so that P a Nµa i,S < . Assumption 6 implies that ρ a,b i,j,S needs to converge to 0 faster than the product µa i,Sµb j,S, as a , b . Hence, intuitively, the dependency between Xi and Xj given XS has to be sufficiently concentrated on the leading eigenfunctions. Similar to Assumption 5 on CCO, we also require the minimum H-S norm of the nonzero PCO to be sufficiently large. For m N, let u(m) = min{ RXi Xj|XS HS : RXi Xj|XS = 0, (i, j, S) H0(m)}. The next assumption is the functional version of strong faithfulness based on PCO, which prevents u(m) to go to zero too fast. Assumption 7 Suppose u(m) δ 3/2ζ(m, d, p, ϵ, n) + δ1/2. Functional Directed Acyclic Graphs We next establish the uniform convergence rate of ˆR d,ϵ,δ Xi Xj|XS, and the uniform consistency of the estimated CPDAG, ˆECPDAG-f PCO, based on PCO. Theorem 4 (i) Uniform convergence rate of PCO: Suppose Assumptions 1, 3, 4, 6, and 7 hold, Xi sub G(ΣXi Xi, b0) with E(Xi) = 0, ζ(m, d, p, ϵ, n) 1, δ 1, and the threshold ρPCO = u(m)/2. Then, max (i,j,S) H0(m) ˆR d,ϵ,δ Xi Xj|XS RXi Xj|XS HS = OP[δ 3/2ζ(m, d, p, ϵ, n) + δ1/2]. (ii) Uniform graph consistency based on PCO: If we further assume Assumption 2 holds, then, P(ˆECPDAG-f PCO = ECPDAG) 1 and P(ˆℓf PCO = ℓ0) 1, as n . B uhlmann and van de Geer (2011, Theorem 13.1) derived the consistency of the PCalgorithm of Kalisch and B uhlmann (2007) for the random variable setting. We next compare our Theorem 4 to theirs. Specifically, both results establish the uniform consistency of the parameter estimation and the uniform graph consistency: whereas B uhlmann and van de Geer (2011) was based on the partial correlation, our result is based on the proposed partial correlation operator. As such, Theorem 4 can be viewed as the functional extension of the above-mentioned Theorem 13.1, though such an extension is far from trivial. The conditions imposed by the two theorems are generally similar. Both allow the graph size p to diverge at an exponential order of the sample size n, and both require the maximum degree of the DAG m to diverge at a polynomial order of n. However, there are some minor differences. For example, B uhlmann and van de Geer (2011, condition A4) required the absolute value of the smallest non-zero partial correlation to go to zero at a rate slower than (m log p/n)1/2. By contrast, our Assumption 7 requires the minimal H-S norm of all non-zero partial correlation operators u(m) to converge to zero at a rate slower than δ 3/2md3+γ(log p)1/2/(n1/2ϵ). Our rate is slower than that of B uhlmann and van de Geer (2011), but we feel this is reasonable, as our setting involves infinite-dimensional functions and is more complicated. Also, B uhlmann and van de Geer (2011, condition A4) imposed a regularization on the dependency among the random variables by upper-bounding the partial correlations. We impose a similar condition, by requiring the maximum of the H-S norms of the regression operator MXSXi, i.e., ϵ1/2s(m) 1, for any i V and any subset S V\{i} with |S| m. Note that ϵ goes to zero in a polynomial rate of n. If ϵ n b with some b (0, 1/2), then this implies that s(m) nb. As such, our condition also regulates the dependency among the random functions. Following up our discussion in the last paragraph of Section 4.1, using the estimator defined in (4), Theorem 4 establishes the uniform convergence rate of our proposed PCO estimator ˆR d,ϵ,δ Xi Xj|XS in (4). By contrast, the theoretical analysis of the alternative estimator Rd Xi Xj|XS in (5) would have involved inversion of ˆΣ d,ϵ Xi Xi|XS, where the norm of this inverse is identical to the smallest eigenvalue of ˆΣ d,ϵ Xi Xi|XS. Let µd i,S denote the smallest eigenvalue of Σd Xi Xi|XS, i.e., the population version of ˆΣ d,ϵ Xi Xi|XS. Then, it is easy to see that, the rate of convergence of Rd Xi Xj|XS in (5) depends on the rate at which µd i,S approaches to zero. On the other hand, the norm of the ridge-type estimator (ˆΣ d,ϵ Xi Xi|XS + δI) 1 in ˆR d,ϵ,δ Xi Xj|XS in (4) is Lee, Li, and Li upper bounded by δ 1, regardless of the magnitude of µd i,S. Consequently, the uniform rate of convergence of ˆR d,ϵ,δ Xi Xj|XS in (4) is independent of µd i,S. Therefore, the asymptotic analysis of our PCO estimator is simpler than that of the alternative estimator defined by (5). 6. Numerical studies We first evaluate the empirical performance of the proposed method through two simulation examples, then illustrate with an analysis of a time-course proteomic dataset. 6.1 Simulations Without loss of generality, we assume that {1, 2, . . . , p} is the order of the true DAG. Moreover, let pa(i) = {j : (j, i) EDAG} denote the parents of i, and {u1, . . . uk} a k-grid in [0, 1] with u1 = 1/k, . . . , uk = 1. We generate the p-dimensional vector of random functions X(t) = [X1(t), . . . , Xp(t)]T in a sequential manner as, Model I : X1(t) = ϵ1(t), Xi(t) = P (j,i) EDAGXj(t) + ϵi(t), i = 2, . . . , p, Model II : Xi(t) = Ui1η1(t) + Ui2η2(t), Ui2 = (|pa(i)| + 1) 1 P j pa(i)Uj2 + ϵi , i = 1, . . . , p. For Model I, the error functions ϵi(t) = Pv j=1ξjκ(t, sj), i = 1, . . . , p, are i.i.d. Gaussian random process with the Brownian motion covariance function, where κ(t, s) = min(t, s), {s1, . . . , sv} is a v-grid in [0, 1], ξ1, . . . , ξv are i.i.d. normal variables with mean zero and standard deviation 5, and v = 10. For Model II, U11, . . . , Up1, ϵ1, . . . , ϵp are i.i.d. standard normal variables, ηk(t) = akχk(t), χk(t) = 2 sin((k 0.5)πt) is the kth eigenfunction of the Brownian motion kernel, k = 1, 2, a1 = 2.5, and a2 = 1. We assume all subjects are observed at the same set of time points, Tk = {t1, . . . , tu}, which is taken to be an u-grid in [0, 1] with u = 50. We generate the edge set EDAG via the independent Bernoulli variable I[(i, j) EDAG], with P{I[(i, j) EDAG]} = 2q/(p 1). The expected number of edges in EDAG is qp, with q controlling the sparsity of the graph, and q = 1.05. We apply the proposed CCO and PCO estimators to the simulated data. We employ the Brownian motion kernel to construct ΩXi as the span of {κT( , ts) : s = 1, . . . , u}. Besides, we choose all the tuning parameters following the rules outlined in Algorithm 1. Our preliminary results have found that PCO outperforms CCO consistently, due to the benefit of proper scaling. As such, we only report the PCO results subsequently. We also compare with some alternative methods. Specifically, we consider the PCalgorithm based on the partial correlation test (linear-PC, Spirtes et al., 2000), the PCalgorithm based on the rank correlation test (rank-PC, Harris and Drton, 2013), and the causal additive models based on high-dimensional penalized regressions (CAM, B uhlmann et al., 2014). To adapt them to the functional setting, for each subject k and node i, we first extract from the observed function Xk i the first K-L expansion coefficient ˆc k,1 i = Xk i En Xi, ˆη1 i ΩXi using (7), which is the first functional PCA score. We then apply the three competing methods to the sample of the p-dimensional vectors, {(ˆc k,1 1 , . . . , ˆck,1 p )T : k = 1, . . . , n}, to estimate the CPDAG. We comment that such an adaption is intuitive, but there is no theoretical guarantee. We also note that the linear-PC method corresponds to the alternative estimator Rd Xi Xj|XS in (5) with d = 1. Functional Directed Acyclic Graphs Structure Hamming distance True discovery rate G n=50 n=100 G G G G G G Figure 1: Empirical performance under Model I. Four methods are compared, from left to right, the modified PC-algorithm based on PCO, linear-PC, rank-PC, and CAM. We evaluate the performance by two criteria, the structure Hamming distance (SHD, Tsamardinos et al., 2006), and the true discovery rate (TDR), which are defined as, SHD(ˆECPDAG, ECPDAG) = |ˆECPDAG ˆET CPDAG ECPDAG ET CPDAG|/2 + |ECPDAG ET CPDAG ˆECPDAG ˆET CPDAG|/2 + |ˆECPDAG (ECPDAG ˆET CPDAG ECPDAG ET CPDAG) ECPDAG|, TDR(ˆECPDAG, ECPDAG) = |{(i, j) ˆECPDAG : (i, j) ECPDAG}|/|ˆECPDAG|, where, for an edge set E, ET stands for {(j, i) : (i, j) E}, ECPDAG is the true CPDAG, and ˆECPDAG is its estimate. We note that the three terms in SHD(ˆECPDAG, ECPDAG) represent the numbers of deletions, insertions, and reorientations needed to transform ˆECPDAG to ECPDAG, Lee, Li, and Li Structure Hamming distance True discovery rate Figure 2: Empirical performance under Model II. Four methods are compared, from left to right, the modified PC-algorithm based on PCO, linear-PC, rank-PC, and CAM. respectively. A smaller SHD or a higher TDR indicates more accurate estimation. We choose to report the true discovery rate, instead of the true positive rate or false positive rate, because the underlying DAG is sparse, and the proportions of the true positives and negatives are highly imbalanced. We vary the sample size n in {50, 100}, and the graph size p in {25, 50, 75, 100}, resulting in 8 different scenarios. Figures 1 and 2 report the box plots for SHD and TDR based on 80 data replications, for Model I and Model II, respectively. We observe that the proposed PCO-based method has a comparable performance as the alternative methods in Model I, but clearly outperforms the alternatives in Model II. This is because the first PC captures most of variation in Model I, but not so in Model II. Moreover, the performance of PCO improves as the sample size increases, which agrees with our asymptotic theory. Functional Directed Acyclic Graphs We have also conducted additional simulations with more combinations of (n, p, q), different initializations, different kernel functions, and comparison with the SEM method of Lee and Li (2022). We report those results in Section A.5 of the Appendix. Overall our proposed PCO-based method achieves a competitive empirical performance. 6.2 Proteomic application We illustrate our method with a DREAM breast cancer proteomic dataset (https://www. synapse.org/#!Synapse:syn1720047/wiki/56213). The goal of the study is to estimate the directed relations among different proteins given the time-course proteomic measurements. We consider the in silico data generated using a nonlinear dynamic model whose characteristic satisfies the Reverse Phase Protein Array (RPPA) quantitative proteomics technology (Hill et al., 2016). Based on various combinations of stimuli and inhibitors, the true network, as shown in the first panel in Figure 3, is used to generate the time-courses of phosphoprotein abundance levels. There are a total of 20 different conditions, and for each condition, 3 independent copies of 20 time-course protein levels were collected at time t = 0, 1, 2, 4, 6, 10, 15, 30, 45, 60, 120 minutes. After the removal of 4 conditions whose protein levels had unusual distributions, the data consists of n = 48 subjects, each with p = 20 protein levels measured at m = 11 time points. Figure 11 in the Appendix plots the time-course data for all 20 protein levels. Erb B2#P RAF#P SHC#P Gab1#P MEK#P#PRas:GTP Erb B2#P RAF#P SHC#P Gab1#P MEK#P#PRas:GTP Erb B2#P RAF#P SHC#P Gab1#P MEK#P#PRas:GTP Erb B2#P RAF#P SHC#P Gab1#P MEK#P#PRas:GTP Erb B2#P RAF#P SHC#P Gab1#P MEK#P#PRas:GTP Erb B2#P RAF#P SHC#P Gab1#P MEK#P#PRas:GTP Figure 3: True and estimated graphs for the time-course proteomic data. From left to right, top to bottom: the truth, the modified PC-algorithm based on PCO, linear-PC, rank-PC, CAM, and SEM. Lee, Li, and Li Table 1: The performance and comparison for the time-course proteomic data. Method PCO linear-PC rank-PC CAM SEM Structure Hamming distance 44 50 48 51 50 True discovery rate 0.70 0.59 0.53 0.48 0.40 Figure 3 reports the estimated graphs by our method, linear-PC, rank-PC, CAM, and the SEM method of Lee and Li (2022), while Table 1 reports the corresponding structure Hamming distance and true discovery rate. We see that our PCO-based method performs the best, by achieving the smallest SHD and the highest TDR. 7. Discussions In this section, we discuss the interpretation and implication of the functional DAG model, including its relation to the linear structural equation, the factorization of joint distribution, the causal interpretation, and the comparison to an undirected graph. 7.1 Relation to the functional linear structural equation model We show that there is a one-to-one correspondence between the functional DAG and the functional linear structural equation model. Such a relation reveals how the functional DAG factorizes the joint distribution, and allows us to better interpret and understand the identified edges of the DAG. We first formally define the linear structural equation model (SEM) for Hilbert spacevalued random functions. For node i V, subset S V\{i}, and linear operator B(i, S) B (ΩXi, ΩXS), let Bj(i, S) denote the jth suboperator of B(i, S). Definition 4 We say that X = (X1, . . . , Xp)T follows a zero-mean, linear structural equation model with respect to a DAG G if, for each i V, there exists a B(i, pa(i)) B (ΩXi, ΩXpa(i)), such that j pa(i)B j (i, pa(i))Xj + ϵi, where ϵi is a zero-mean random element in ΩXi, and ϵ1, . . . , ϵp are independent. We next recall the notion of the global Markov property from (1). That is, X = (X1, . . . , Xp) T satisfies the global Markov property with respect to G, if i and j are d-separated by S in G Xi Xj | XS. (11) The next theorem establishes the equivalence between the functional linear SEM in Definition 4 and the global Markov property in (11) under the Gaussian assumption. Theorem 5 Suppose Assumptions 1 and 3 are satisfied, and X = (X1, . . . , Xp)T is a zeromean, Gaussian random element in ΩXi. Then the following two statements are equivalent: (i) X satisfies the global Markov property with respect to G, and (ii) X follows a linear structural equation model with respect to G. Functional Directed Acyclic Graphs The structural equation factorizes the joint distribution of X1, . . . , Xp into the product of the set of conditional distributions of Xi | Xpa(i), i = 1, . . . , p, under G, and provides an interpretation of the edge directions. Consequently, we can interpret the edges of the functional DAG following the functional linear SEM. For instance, in the proteomic application, let XSOS, XERK, XSHC and XERBB1 denote the random functions of the protein levels of SOS, ERK, SHC, and ERBB1, respectively, and suppose they all have zero-means. Because ERK, SHC, and ERBB1 are the parent nodes of SOS in the ground truth, we have, XSOS = B 1XERK + B 2XSHC + B 3XERBB1 + ϵ, where ϵ is a zero-mean, Gaussian random error function that is independent of XERK, XSHC and XERBB1, and B 1, B 2, B 3 are the linear operators from the ranges of XERK, XSHC and XERBB1, to the range of XSOS, respectively. 7.2 Potential causal interpretation Next, we introduce the do-intervention under the functional setting, and discuss potential causal interpretation of the functional DAG model. For any j V, and any x = (x1, . . . , xp) ΩX, let xpa(j) = {xk : k pa(j)}, and let LSEj,G : ΩXpa(j) ΩXj ΩXj denote the mapping LSEj,G(xpa(j), xj) = P k pa(j)B k(j, pa(j))xk + xj. For y ΩX, A V, i A, and xi ΩXi, let y A(yi xi) be the vector y A = {yk : k A} with its member yi replaced by xi. In other words, y A(yi xi) = (y A\{yi}) {xi}. Following Pearl (2009, Definition 3.2.1), we obtain the following definition. Definition 5 Suppose X = (X1, . . . , Xp)T follows a linear structural equation model with respect to a DAG G, a set of regression operators B(j, pa(j)) B (ΩXj, ΩXpa(j)), j = 2, . . . , p, and the error random functions {ϵj : j V}. For a fixed i V, and xi ΩXi, the causal effect of Xi = xi on XV\{i} is the joint distribution of XV\{i} induced by the following p 1 equations: ( LSEj,G(Xpa(j)(Xi xi), ϵj), if pa(j) i, LSEj,G(Xpa(j), ϵj), if pa(j) i, for all j V\{i}. We denote such a joint distribution as PXV\{i}|do(xi), and call this distribution the interventional distribution at Xi = xi. For any j V\{i}, let PXj|do(xi) denote the marginal distribution of Xj in the interventional distribution at Xi = xi, PXj the marginal distribution of Xj, and PXj|xi the conditional distribution of Xj | Xi = xi. Then by Definition 5, we have, PXj|do(xi) = ( PXj, if j is not a descendant of i, PXj|xi, if j is a descendant of i. Definition 5 offers one way to define the causal effect in functional data, and is general and potentially useful for time-course interventional studies (Luo and Zhao, 2011). Based Lee, Li, and Li Undirected graph CPDAG Figure 4: The induced undirected graph structure and the directed graph structure, based on the relation in model (12). on this definition, it is possible to develop a full methodology and theory for modeling functional interventional data, following the lines of Maathuis et al. (2009); Hauser and B uhlmann (2015), who studied the random variable-based linear SEM for interventional data. However, it requires a substantial amount of additional efforts, and to avoid too much digestion, we leave it for future research. 7.3 Comparison with undirected graph Finally, we illustrate the difference between the undirected functional graphical model and our functional DAG model by a specific example. Example 1 Suppose X = (X1, X2, X3)T is a random element in H H H, where H is a Hilbert space, and ϵ1, ϵ2, ϵ3 are i.i.d. random elements in H with zero-mean and the covariance operator Λ. Furthermore, suppose X1 = ϵ1, X2 = ϵ2, X3 = X1 + X2 + ϵ3. (12) Note that the conditional covariance operator between X1 and X2 given X3, by definition, is ΣX1X2|X3 = ΣX1X2 ΣX1X3Σ X3X3ΣX3X2 = ΛΛ Λ = Λ. Similarly, the conditional covariance operators ΣX1X3|X2 = ΣX2X3|X1 = Λ. Therefore, by Theorem 5, we have, X1 X2 | X3, X1 X3 | X2, X2 X3 | X1. Figure 4 shows the undirected graph structure and the CPDAG, both induced by the relation in (12). We see that the the two graphs are very different. The undirected graphical model studied in Qiao et al. (2019); Li and Solea (2018) does not offer any structural simplification of the joint distribution of X1, X2, X3 in this example, because there is no zero entry in the precision operator. On the other hand, the directed graphical model targeted by the functional DAG does provide a simplification of the joint distribution. Appendix A. Appendix In this Appendix, we first present some supporting theoretical results in Section A.1. We then prove the two main theorems, Theorems 3 and 4, in Sections A.2 and A.3. We collect the proofs of the rest of the theoretical results in Section A.4. We present additional numerical results in A.5. Functional Directed Acyclic Graphs A.1 Supporting theoretical results We first derive a useful property of a zero-mean, Hilbert space-valued, Gaussian random element. Let π : V V denote a permutation, i.e., an injective mapping from V to V. Let [a] denote the vector (1, . . . , a)T for integer a 1. Lemma 1 Suppose Assumptions 1 and 3 hold, and the p-variate random function X = (X1, . . . , Xp)T is a zero-mean Gaussian random element in ΩX. Then, for i = 2, . . . , p, there exists a linear operator B(i, [i i]) B2(ΩXi, ΩX[i 1]), and an ΩXi-valued random element ϵi, such that Xi = Pi 1 j=1B j (i, [i 1])Xj + ϵi. (13) Moreover, for i = 2, . . . , p, B(i, [i i]) = MX[i 1]Xi, and X1, ϵ2, . . . , ϵp are independent, zeromean Gaussian random element in ΩXi, with E(ϵi ϵi) = ΣXi Xi|X[i 1] B 1(ΩXi). This statement remains true if we replace 1, . . . , p by π(1), . . . , π(p). Proof of Lemma 1: We first show that, for any i V and any S V\{i}, E(Xi | XS) ΩXi. This is because, under Assumptions 1 and 3, MXSXi is defined and by Proposition 1, E(Xi | XS) = M XSXi XS , which is a member of ΩXi. For i = 2, . . . , p, let ϵi = Xi E(Xi | X[i 1]). Then Xi = E(Xi | X[i 1]) + ϵi. By Proposition 1, we have Xi = M X[i 1]Xi X[i 1] + ϵi = Pi 1 j=1(M X[i 1]Xi)j Xj + ϵi. Therefore, (13) holds with B(i, [i 1]) = MX[i 1]Xi. Next, we show that X1, ϵ2, . . . , ϵp are independent. For convenience, denote X1 by ϵ1. Since ϵ1, . . . , ϵp are jointly Gaussian, we only need to check cov[ f, ϵi ΩXi, g, ϵj ΩXj ] = f, Σϵiϵjg ΩXi = 0, for every f ΩXi, g ΩXj. Suppose i < j. We have that, Σϵiϵj = E(ϵi ϵj) = E{E[(Xi E(Xi | X[i 1])) (Xj E(Xj | X[j 1])) | X[i]]} = E{(Xi E(Xi | X[i 1])) E(Xj E(Xj | X[j 1]) | X[i])} = E{(Xi E(Xi | X[i 1])) [E(Xj | X[i]) E(Xj | X[i])]} = 0. It remains to show Σϵiϵi = ΣXi Xi|X[i 1] and Σϵiϵi is a member of B 1(ΩXi). By definition, Σϵiϵi = E(Xi Xi) + E[E(Xi | X[i 1]) E(Xi | X[i 1])] 2E[Xi E(Xi | X[i 1])] = E(Xi Xi) E[E(Xi | X[i 1]) E(Xi | X[i 1])], (14) where the second equality holds because E[Xi E(Xi | X[i 1])] = E[E(Xi | X[i 1]) E(Xi | X[i 1])]. Since the second term on the right-hand-side of (14) is equal to E[E(Xi | X[i 1]) E(Xi | X[i 1])] = E[(M X[i 1]Xi X[i 1]) (M X[i 1]Xi X[i 1])] = M X[i 1]XiΣX[i 1]X[i 1]MX[i 1]Xi, we have that E(ϵi ϵi) = ΣXi Xi|X[i 1]. Lee, Li, and Li To show that Σϵiϵi is a member of B 1(ΩXi), note that, for any f ΩXi, f, (ΣXi Xi ΣXi Xi|X[i 1])f ΩXi 0 f, ΣXi Xif ΩXi f, ΣXi Xi|X[i 1]f ΩXi, which further implies that ΣXi Xi|X[i 1] TR ΣXi Xi TR < by Assumption 1. This completes the proof of Lemma 1. 2 Let H be a generic separable Hilbert space. The next lemma extends the classical Bernstein s inequality (Boucheron et al., 2013, Chapter 2) to the functional setting. Lemma 2 (Bernstein s inequality in Hilbert space) Suppose X is a random element in H with E(X) = 0, and X1, . . . , Xn are i.i.d. samples of X. If E( X ℓ H) bℓℓ!, for some b > 0, and each ℓ N, (15) then, for any t > 0, P( En X H > t) 2 exp { n [t/(4b) t2/(8b2)]}. Proof of Lemma 2: By (15), we have Pn i=1E( X ℓ H) nbℓℓ!. Therefore, by Bosq (2000, Theorem 2.5), P( En X H > t) 2 exp nt2 2 exp[ nf(t)]. Moreover, note that f(t) > t/(4b) if t > 2b, and f(t) t2/(8b2) if t 2b. This completes the proof of Lemma 2. 2 Let {(λa i , ηa i )}a N denote the eigenvalue-eigenfunction pairs of ΣXi Xi, with λ1 i λ2 i 0. Let ca i = Xi, ηa i . The next lemma shows that, if X follows a sub-Gaussian distribution, then we can bound the moments E(ca i )2ℓ, for each a N and ℓ N. Lemma 3 If Xi sub G(Σ, b), then E(ca i )2ℓ (4b λa i )ℓℓ!, for a N and ℓ N. Proof of Lemma 3: By the definition of sub-Gaussianity of Xi, for any s, t > 0, P(ca i > t) = P [exp(sca i ) > exp(st)] exp(bλa i s2/2 st), (16) where the inequality follows from the Markov s inequality. Note that infs>0 exp(bλa i s2/2 st) exp [ t2/(2bλa i )], which, together with (16), further implies that P(ca i > t) exp [ t2/(2bλa i )] for each t > 0. Using a similar argument, we can show that P(ca i < t) exp [ t2/(2bλa i )], for t > 0. Therefore, we have P(|ca i | > t) 2 exp [ t2/(2bλa i )] . (17) Moreover, we note that, for ℓ N, E(ca i )2ℓ= Z 0 P |ca i | > t1/(2ℓ) dt 2 Z 0 exp t1/ℓ/(2bλa i ) dt, 0 exp t1/ℓ/(2bλa i ) dt = 2(2bλa i )ℓℓ Z 0 exp( x)xℓ 1dx (4bλa i )ℓℓ!. Functional Directed Acyclic Graphs Combining with (17), we obtain the desired bound for E(ca i )2ℓ. This completes the proof of Lemma 3. 2 The next lemma provides some properties regarding the Tychonoffregression. Its proof is similar to Lee et al. (2020, Lemma A4), and is omitted. Lemma 4 Let Σ and Γ be self-adjoint operators in B (H), and let I be the identity mapping. Then, for any ϵ > 0, (i) (Σ + ϵI) 1 ϵ 1; (ii) (Σ + ϵI) 1Γ 1 + ϵ 1 Σ Γ ; (iii) (Σ + ϵI) 1Γ1/2 ϵ 1/2(1 + ϵ 1 Σ Γ )1/2. The next lemma is about the perturbation of the covariance operators. Lemma 5 For a given i V, let {(λa i , ηa i )}m+1 a=1 denote the leading m + 1 eigenvalueeigenfunction pairs of ΣXi Xi, with λ1 i > λ2 i > > λm+1 i . Then, maxa=1,...,m ˆηa i ηa i 4κ 1 m ˆΣXi Xi ΣXi Xi , where κm = min{λa i λa+1 i : a = 1, . . . , m, i V}. Proof of Lemma 5: For a given i V, let λa i be the member of {ˆλ1 i, . . . , ˆλn i } that is closest to λa i . Then by Kato (1980, Theorem 4.10), max{| λa i λa i | : a = 1, . . . , n} ˆΣXi Xi ΣXi Xi . This implies, for all a = 1, . . . , n, λa i ˆΣXi Xi ΣXi Xi λa i λa i + ˆΣXi Xi ΣXi Xi . Therefore, for all a = m + 1, . . . , n, λa i λa i + ˆΣXi Xi ΣXi Xi λm+1 i + ˆΣXi Xi ΣXi Xi . (18) Similarly, for all a = 1, . . . , m, λa i λm i ˆΣXi Xi ΣXi Xi . (19) If κm 2 ˆΣXi Xi ΣXi Xi , then the asserted inequality holds automatically. If κm > 2 ˆΣXi Xi ΣXi Xi , then λm i ˆΣXi Xi ΣXi Xi > λm+1 i + ˆΣXi Xi ΣXi Xi , which, together with (18) and (19), implies that min{ λ1 i, . . . , λm i } > max{ λm+1 i , . . . , λn i }. Therefore, we have { λ1 i, . . . , λm i } = {ˆλ1 i, . . . , ˆλm i }. Moreover, for any a = 1, . . . , m 1, λa+1 i λa+1 i + ˆΣXi Xi ΣXi Xi < λa i ˆΣXi Xi ΣXi Xi λa i , which implies that λa i = ˆλa i , for all a = 1, . . . , m and i V. Therefore, max{|ˆλa i λa i | : a = 1, . . . , m} ˆΣXi Xi ΣXi Xi , which, by Kazdan (1971, Lemma 2), leads to the asserted inequality. This completes the proof of Lemma 5. 2 The next theorem establishes the concentration bound and the uniform convergence rate of ˆΣXi Xj ΣXi Xj HS. Lee, Li, and Li Theorem 6 Suppose Assumption 1 holds, and Xi sub G(ΣXi Xi, b0) with E(Xi) = 0 for i V. Then, for any t 0 and (i, j) V V, P( ˆΣXi Xj ΣXi Xj HS > t) 2 exp n t where C0 = max(2M0, 8M0b0), and M0 is as defined in Assumption 1. Moreover, if log p/n 0, as n , then, max i,j V ˆΣXi Xj ΣXi Xj HS = OP[(log p/n)1/2]. Proof of Theorem 6: For convenience, for two sets I, J, we use PI,J i,j to abbreviate the double sum P j J. Similarly, for two integers r and s, we use Pr,s i,j to abbreviate the double sum Pr i=1 Ps j=1. We first note that, by the triangular and Jensen s inequalities, E Xi Xj E(Xi Xj) ℓ HS 2ℓ 1[E Xi Xj ℓ HS + E(Xi Xj) ℓ HS] 2ℓ 1[M1(ℓ) + M2(ℓ)]. We next bound M1(ℓ) and M2(ℓ), respectively. For, M1(ℓ), let Ni = {a N : λa i = 0}. For ℓ= 1, M1(ℓ) E1/2[PN,N a,b (ca i )2(cb j)2] PN,N a,b E[(ca i )2(cb j)2] 1/2 M0 = M 1 0 1!. For any ℓ Ni, ℓ 2, we have M1(ℓ) E|PNi,Ni a,b Xi Xj, ηa i ηb j 2 HS|ℓ/2 = E[PNi,Ni a,b (ca i )2(cb j)2]ℓ/2 M ℓ 0 PNi,Ni a,b λa i λb j M 2 0 E[(ca i )2(cb j)2/(λa i λb j)]ℓ/2 , where the last inequality follows from Jensen s inequality as applied to the convex function f(u) = uℓ/2. By Lemma 3, we have E(ca i )2ℓ (4b0λa i )ℓℓ!. Substituting this into the righthand-side above, we obtain that M1(ℓ) (4b0M0)ℓℓ! for ℓ 2. Therefore, for any ℓ N, we have M1(ℓ) [M0 max(1, 4b0)]ℓℓ!. For M2(ℓ), by the Cauchy-Schwarz inequality, M2(ℓ) = [PN,N a,b E2(ca i cb j)]ℓ/2 PN,N a,b λa i λb j ℓ/2 M ℓ 0. (20) Putting the bounds for M1(ℓ) and M2(ℓ) together, we have E Xi Xj E(Xi Xj) ℓ HS [max(2M0, 8M0b0)]ℓℓ! Cℓ 0 ℓ!, which, by Lemma 2, implies the first statement of this theorem. Next, note that, for any t > 0, P max i,j V ˆΣXi Xj ΣXi Xj HS > t P i,j VP ˆΣXi Xj ΣXi Xj HS > t 2p2 exp n t which implies that the second statement of this theorem holds when log p/n 0. This completes the proof of Theorem 6. 2 Let Σd Xi Xj = Pd a=1 Pd b=1E(ca i cb j)(ηa i ηb j) be the truncated version of ΣXi Xj, (i, j) V V. The next theorem establishes the uniform convergence of ˆΣd Xi Xj Σd Xi Xj HS. Functional Directed Acyclic Graphs Theorem 7 Suppose Assumption 1 holds, Xi sub G(ΣXi Xi, b0) with E(Xi) = 0 for i V, and dγ+1(log p)1/2/n1/2 1. Then, we have max i,j V ˆΣd Xi Xj Σd Xi Xj HS = OP d3+γ(log p)1/2/n1/2 . Proof of Theorem 7: Note that maxi,j V ˆΣd Xi Xj Σd Xi Xj HS is upper-bounded by max i,j V Pd,d a,b[En(ˆca i ˆcb j) E(ca i cb j)] ˆηa i ˆηb j HS + max i,j V Pd,d a,b E(ca i cb j) [(ˆηa i ηa i ) ˆηb j + ηa i (ˆηb j ηb j)] HS M3(n) + M4(n). Next, we derive the orders of magnitude for M3(n) and M4(n). For M3(n), we have M3(n) max i,j V Pd,d a,b En[(ca i cb j) E(ca i cb j)] ˆηa i ˆηb j HS + max i,j V Pd,d a,b En[(ˆca i ca i )(ˆcb j cb j)] ˆηa i ˆηb j HS + max i,j V Pd,d a,b En[ca i (ˆcb j cb j)] ˆηa i ˆηb j HS + max i,j V Pd,d a,b En[(ˆca i ca i )cb j] ˆηa i ˆηb j HS M3,1(n) + + M3,4(n). For M3,1(n), it can be bounded as, M3,1(n) = max i,j V Pd,d a,b En Xi Xj, E(Xi Xj), ηa i ηb j HS ˆηa i ˆηb j HS, max i,j V Pd,d a,b En Xi Xj E(Xi Xj), ηa i ηb j HS ˆηa i ˆηb j HS max i,j V En[Xi Xj E(Xi Xj)] HS Pd,d a,b ˆηa i ˆηb j HS =d2 max i,j V En[Xi Xj E(Xi Xj)] HS. Therefore, by Theorem 6, M3,1(n) = OP[d2(log p/n)1/2]. For M3,2(n), it can be bounded as, M3,2(n) max i,j V Pd,d a,b En Xi Xj E(Xi Xj), (ˆηa i ηa i ) (ˆηb j ηb j) HS ˆηa i ˆηb j HS + max i,j V Pd,d a,b E(Xi Xj), (ˆηa i ηa i ) (ˆηb j ηb j) HS ˆηa i ˆηb j HS 16d2κ 2 d max i,j V[ ˆΣXi Xi ΣXi Xi ˆΣXj Xj ΣXj Xj ( En[Xi Xj E(Xi Xj)] HS + E(Xi Xj) HS)], where last inequality holds because, by Lemma 5, Pd,d a,b (ˆηa i ηa i ) (ˆηb j ηb j) HS 16d2κ 2 d ˆΣXi Xi ΣXi Xi ˆΣXj Xj ΣXj Xj . By (20), we have E(Xi Xj) HS M0. Therefore, M3,2(n) = OP[d2(log p)/(nκ2 d)]. For M3,3(n), by Lemma 5 again, it can be bounded by M3,3(n) 4d2κ 1 d max i,j V{ ˆΣXj Xj ΣXj Xj [ En[Xi Xj E(Xi Xj)] HS + E(Xi Xj) HS]}. Lee, Li, and Li By Theorem 6 again, M3,3(n) = OP[d2(log p)1/2/(n1/2κd)]. Similarly, we can show that M3,4(n) has the same order of magnitude as M3,3(n). For M4(n), we have M4(n) max i,j V E(Xi Xj) HS Pd,d a,b( ˆηa i ηa i + ˆηb j ηb j ) 4d2κ 1 d M0 max i,j V( ˆΣXi Xi ΣXi Xi + ˆΣXj Xj ΣXj Xj ), which leads to M4(n) = OP[d2(log p)1/2/(n1/2κd)]. Combining the orders of magnitude of M3,1(n), . . . , M3,4(n), and M4(n), we obtain, max i,j V ˆΣd Xi Xj Σd Xi Xj HS = OP[d2(log p/n)2] + OP[d2 log p/(nκd)] + Op[d2(log p)1/2/(n1/2κd)]. Because κd 0 and d2(log p)1/2/(n1/2κd) 1, the third term on the right-hand-side is the dominating term, which is equal to the desired rate because κ 1 d d1+γ by Assumption 4. This completes the proof of Theorem 7. 2 Theorem 7 generalizes the convergence result for the high-dimensional covariance matrix (Bickel and Levina, 2008) to the high-dimensional covariance operator. We remark that, under the Gaussian assumption, Qiao et al. (2019) also established the concentration bound for the sample covariance of the leading K-L expansion coefficients. However, Theorem 7 differs from the result of Qiao et al. (2019), in that it provides the uniform convergence at the operator level, which can not be derived directly from their result. Moreover, Theorems 6 and 7 do not require the random process to be Gaussian, but only require the distribution to be sub-Gaussian. A.2 Proof of Theorem 3 To prove this theorem, we first introduce an intermediate operator Σ d,ϵ Xi Xj|XS. We next derive the order of magnitude for the differences between Σ d,ϵ Xi Xj|XS and ˆΣ d,ϵ Xi Xj|XS in Lemma 6, and between Σ d,ϵ Xi Xj|XS and ΣXi Xj|XS in Lemma 7, respectively. These two Lemmas together lead to the first assertion, i.e. the uniform convergence of ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS. Lastly, we derive the uniform convergence of the estimated graph. For any A, B V, let Σd XAXB be the matrix of operators {Σd Xi Xj}i A,j B, and let Σ d,ϵ Xi Xj|XS = Σd Xi Xj Σd Xi XS[Σd XSXS(ϵ)] Σd XSXj, where ϵ > 0 is a tuning constant that decreases to 0 as n , and [A(ϵ)] represents (A + ϵI) 1A(A + ϵI) 1. This term Σ d,ϵ Xi Xj|XS plays the role of an intermediate operator between ΣXi Xj|XS and ˆΣ d,ϵ Xi Xj|XS. Lemma 6 Suppose Assumptions 1 and 3 hold, Xi sub G(ΣXi Xi, b0) with E(Xi) = 0, for i V, m 1, ϵ 1, and md3+γ(log p)1/2/(n1/2ϵ) 1. Then, max H(m) ˆΣ d,ϵ Xi Xj|XS Σ d,ϵ Xi Xj|XS HS = OP{md3+γ(log p)1/2/(n1/2ϵ)}. Functional Directed Acyclic Graphs Proof of Lemma 6: Because ˆ MXSXi = (ˆΣd XSXS + ϵI) 1 ˆΣd XSXi, we have ˆΣ d,ϵ Xi Xj|XS = ˆΣd Xi Xj ˆΣd Xi XS[ˆΣd XSXS(ϵ)] ˆΣd XSXj. Therefore, max (i,j,S) H(m) ˆΣ d,ϵ Xi Xj|XS Σ d,ϵ Xi Xj|XS HS max i,j V ˆΣd Xi Xj Σd Xi Xj HS + max (i,j,S) H(m) (ˆΣd Xi XS Σd Xi XS)[ˆΣd XSXS(ϵ)] ˆΣd XSXj HS + max (i,j,S) H(m) Σd Xi XS{[ˆΣd XSXS(ϵ)] [Σd XSXS(ϵ)] }ˆΣd XSXj HS + max (i,j,S) H(m) Σd Xi XS[Σd XSXS(ϵ)] (ˆΣd XSXj Σd XSXj) HS M5(n) + M6(n) + M7(n) + M8(n). The order of magnitude of M5(n) is given in Theorem 7. We next derive the orders of magnitude of M6(n), M7(n) and M8(n), respectively. For M6(n), we have M6(n) max (i,j,S) H(m) (ˆΣd Xi XS Σd Xi XS)(ˆΣd XSXS + ϵI) 1 ˆΣd XSXS(ˆΣd XSXS + ϵI) 1Σd XSXj HS + max (i,j,S) H(m) (ˆΣd Xi XS Σd Xi XS)(ˆΣd XSXS + ϵI) 1 ˆΣd XSXS (ˆΣd XSXS + ϵI) 1(ˆΣd XSXj Σd XSXj) HS. The first term on the right, by Lemma 4 and ˆΣd XSXS(ˆΣd XSXS + ϵI) 1 1, is upper-bounded by ϵ 1 max (i,j,S) H(m)[ ˆΣd Xi XS Σd Xi XS HS Σd XSXj HS] ϵ 1 max (i,i,S) H1(m) ˆΣd Xi XS Σd Xi XS HS max (j,j,S) H1(m) Σd XSXj HS. (22) By a special case of (20), ΣXj Xi HS M0. Henceforth, max (j,j,S) H1(m) Σd XSXj HS = max (j,j,S) H1(m) i S Σd Xi Xj 2 HS max (j,j,S) H1(m) i S ΣXi Xj 2 HS m1/2M0. So the right-hand-side of (22) is further bounded by ϵ 1m1/2M0(max(i,i,S) H1(m) ˆΣd Xi XS Σd Xi XS HS). Similarly, we can bound the second term on the right of (21) by ϵ 1 max (i,j,S) H(m)( ˆΣd Xi XS Σd Xi XS HS ˆΣd XSXj Σd XSXj HS). Therefore, we have, M6(n) ϵ 1[m1/2 M0( max (i,i,S) H1(m) ˆΣd Xi XS Σd Xi XS HS) + max (i,j,S) H(m)( ˆΣd Xi XS Σd Xi XS HS ˆΣd XSXj Σd XSXj HS)]. For M7(n), let [A(ϵ)] denote (A + ϵI) 1, so that [A(ϵ)] = [A(ϵ)] A[A(ϵ)] . In these notations, the difference [ˆΣd XSXS(ϵ)] [Σd XSXS(ϵ)] can be decomposed as {[ˆΣd XSXS(ϵ)] [Σd XSXS(ϵ)] }ˆΣd XSXS[ˆΣd XSXS(ϵ)] + [Σd XSXS(ϵ)] (ˆΣd XSXS Σd XSXS)[ˆΣd XSXS(ϵ)] +[Σd XSXS(ϵ)] Σd XSXS{[ˆΣd XSXS(ϵ)] [Σd XSXS(ϵ)] } Γ1(n) + Γ2(n) + Γ3(n). Lee, Li, and Li Moreover, Σd Xi XSΓ1(n)ˆΣd XSXj HS is bounded by Σd Xi XS[Σd XSXS(ϵ)] ˆΣd XSXS Σd XSXS HS [ˆΣd XSXS(ϵ)] ˆΣd XSXS[ˆΣd XSXS(ϵ)] ˆΣd XSXj . (23) The first norm in (23), by Baker (1973, Theorem 1) and Lemma 4, is upper-bounded by ϵ 1/2 Σd Xi Xi 1/2 [tr(ΣXi Xi)/ϵ]1/2 = (M0/ϵ)1/2. The third norm in (23) is upperbounded by ϵ 1/2[ (ˆΣd XSXS + ϵI) 1/2Σd XSXj + (ˆΣd XSXS + ϵI) 1/2(ˆΣd XSXj Σd XSXj) ]. By a similar argument as used to bound the first norm, we can be further bound the above by ϵ 1/2[(M0 + ϵ 1M0 ˆΣd XSXS Σd XSXS )1/2 + ϵ 1/2 ˆΣd XSXj Σd XSXj ]. Therefore, max (i,j,S) H(m) Σd Xi XSΓ1(n)ˆΣd XSXj HS M 1/2 0 ϵ 1( max (i,j,S) H(m) ˆΣd XSXS Σd XSXS HS) max (i,j,S) H(m)[(M0 + ϵ 1M0 ˆΣd XSXS Σd XSXS )1/2 + ϵ 1/2 ˆΣd XSXj Σd XSXj ]. (24) Moreover, max(i,j,S) H(m) Σd Xi XSΓ2(n)ˆΣd XSXj HS and max(i,j,S) H(m) Σd Xi XSΓ3(n)ˆΣd XSXj HS can be bounded by the right-hand-side of (24). Therefore, M7(n) is bounded by three times of the quantity on the right-hand-side of (24). For M8(n), similar to the derivation of the bound for M6(n), we can show that M8(n) ϵ 1m1/2 M0( max (i,j,S) H(m) ˆΣd XSXj Σd XSXj HS). On the other hand, note that max (i,j,S) H(m) ˆΣd XSXS Σd XSXS HS = max (i,j,S) H(m) i,j S ˆΣd Xi Xj Σd Xi Xj 2 HS, which is no greater than m (maxi,j V ˆΣd Xi Xj Σd Xi Xj HS). By the same derivation, we have max(i,i,S) H1(m) ˆΣd Xi XS Σd Xi XS HS is bounded by m1/2 (maxi,j V ˆΣd Xi Xj Σd Xi Xj HS). Combining the bounds for M6(n), M7(n), and M8(n), and applying Theorem 7 as well as the condition that md3+γ(log p)1/2/(n1/2ϵ) 1, we have M6(n)+M7(n) + M8(n) ϵ 1m(max i,j V ˆΣd Xi Xj Σd Xi Xj HS) + m(max i,j V ˆΣd Xi Xj Σd Xi Xj HS)2 + ϵ 1/2m1/2(max i,j V ˆΣd Xi Xj Σd Xi Xj HS) = OP[md3+γ(log p)1/2/(n1/2ϵ)] + Op[md6+2γ(log p)/n] + OP[m1/2d3+γ(log p)1/2/(n1/2ϵ1/2)]. Since d3+γ(log p)1/2/n1/2 1, the first term on the right is the dominating term. This completes the proof of Lemma 6. 2 Lemma 7 Suppose Assumptions 1, 3, and 4 hold. Then, for any (i, j) V V and S V\(i, j), we have, max (i,j,S) H(m) Σ d,ϵ Xi Xj|XS ΣXi Xj|XS HS = O{mϵ 1d γ + ϵ1/2s(m)}. Functional Directed Acyclic Graphs Proof of Lemma 7: Let Σϵ Xi Xj|XS = ΣXi Xj ΣXi XS[ΣXSXS(ϵ)] ΣXSXj be the intermediate operator between Σ d,ϵ Xi Xj|XS and ΣXi Xj|XS. By the triangular inequality, max (i,j,S) H(m) Σ d,ϵ Xi Xj|XS ΣXi Xj|XS HS max (i,j,S) H(m) Σ d,ϵ Xi Xj|XS Σϵ Xi Xj|XS HS + max (i,j,S) H(m) Σϵ Xi Xj|XS ΣXi Xj|XS HS M9(n) + M10(n). We next derive the orders of magnitude for M9(n) and M10(n). For M9(n), we have M9(n) max i,j V Σd Xi Xj ΣXi Xj HS + max (i,j,S) H(m) (Σd Xi XS ΣXi XS)[Σd XSXS(ϵ)] Σd XSXj HS + max (i,j,S) H(m) ΣXi XS{[Σd XSXS(ϵ)] [ΣXSXS(ϵ)] }Σd XSXj HS + max (i,j,S) H(m) ΣXi XS[ΣXSXS(ϵ)] (Σd XSXj ΣXSXj) HS, whose order of magnitude, by a similar argument as in the proof of Lemma 6, is no greater than that of max i,j V Σd Xi Xj ΣXi Xj + ϵ 1/2 max (i,j,S) H(m)( Σd Xi XS ΣXi XS HS + Σd XSXj ΣXSXj HS) +ϵ 1 max (i,j,S) H(m) Σd XSXS ΣXSXS . Let Nd = {d + 1, d + 2, . . .}. Then the term max(i,j,S) H(m) Σd XSXS ΣXSXS HS equals max (i,j,S) H(m) i,j S PNd,Nd a,b E2(ca i cb j) max (i,j,S) H(m) i,j S PNd,Nd a,b λa i λb j, whose order is O(md γ) by Assumption 4. Similarly, we can show that max (i,i,S) H1(m) Σd Xi XS ΣXi XS HS = O(m1/2d γ). Therefore, M9(n) mϵ 1d γ. For M10(n), we note that, by the definitions of Σϵ Xi Xj|XS, ΣXi Xj|XS, and MXSXj, Σϵ Xi Xj|XS ΣXi Xj|XS = ΣXi XS{[ΣXSXS(ϵ)] ΣXSXS I}MXSXj. By the definition of [A(ϵ)] , we can rewrite the right-hand-side of the above equation as ΣXi XS ϵ2{[ΣXSXS(ϵ)] }2 2ϵ[ΣXSXS(ϵ)] MXSXj. Therefore, M10(n) s(m) max (i,i,S) H1(m) ΣXi XS ϵ2{[ΣXSXS(ϵ)] }2 2ϵ[ΣXSXS(ϵ)] , whose order, by Lemma 4, is no greater than ϵ1/2s(m). Combining the orders of M9(n) and M10(n) completes the proof of Lemma 7. 2 Proof of Theorem 3: Combining Lemmas 6 and 7, we immediately obtain the uniform convergence rate of ˆΣ d,ϵ Xi Xj|XS. Lee, Li, and Li For the second assertion, we first note that, P {ˆECPDAG-f CCO = E0 CPDAG} {ˆℓf CCO = ℓ0} P[ ˆΣ d,ϵ Xi Xj|XS HS > ρf CCO, ΣXi Xj|XS = 0, for some (i, j, S) H0(m)] + P[ ˆΣ d,ϵ Xi Xj|XS HS ρf CCO, ΣXi Xj|XS = 0, for some (i, j, S) H0(m)]. The first term in (25) is further bounded by P[max{ ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS HS : (i, j, S) H0(m)} t(m)/2] p (m). Moreover, by the definition of t(m), the second term in (25) is no greater than P[ ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS HS t(m)/2, ΣXi Xj|XS = 0, for some (i, j, S) H0(m)]. It is further bounded by p (m), which tends to 0 by the first assertion and Assumption 5. We thus obtain the second assertion. This completes the proof of Theorem 3. 2 A.3 Proof of Theorem 4 Similar as the proof of Theorem 3, to prove this theorem, we first introduce another intermediate operator Rδ Xi Xj|XS between ˆR d,ϵ,δ Xi Xj|XS and RXi Xj|XS, Rδ Xi Xj|XS = (ΣXi Xi|XS + δI) 1/2ΣXi Xj|XS(ΣXi Xi|XS + δI) 1/2. Then by the triangular inequality, ˆR d,ϵ,δ Xi Xj|XS RXi Xj|XS HS ˆR d,ϵ,δ Xi Xj|XS Rδ Xi Xj|XS HS + Rδ Xi Xj|XS RXi Xj|XS HS. We next derive the order of magnitude for the differences between ˆR d,ϵ,δ Xi Xj|XS and Rδ Xi Xj|XS in Lemma 8, and between Rδ Xi Xj|XS and RXi Xj|XS in Lemma 9, respectively. Lemma 8 Suppose the conditions in Theorem 3(i) hold, ζ(m, d, p, ϵ, n) 1, and δ 1. Then, max (i,j,S) H0(m) ˆR d,ϵ,δ Xi Xj|XS Rδ Xi Xj|XS HS = OP[δ 3/2ζ(m, d, p, ϵ, n)]. Moreover, if δ 3/2ζ(m, d, p, ϵ, n) 1, then max(i,j,S) H0(m) ˆR d,ϵ,δ Xi Xj|XS Rδ Xi Xj|XS HS P 0. Proof of Lemma 8: Let Γ4(n) = ˆΣ d,ϵ Xi Xi|XS, Γ5(n) = ΣXi Xi|XS, Γ6(n) = ˆΣ d,ϵ Xi Xj|XS, Γ7(n) = ΣXi Xj|XS, Γ8(n) = ˆΣ d,ϵ Xj Xj|XS, and Γ9(n) = ΣXj Xj|XS. By the triangular inequality, ˆR d,ϵ,δ Xi Xj|XS R d,ϵ,δ Xi Xj|XS HS {Γ4(n) + δI] 1/2 [Γ5(n) + δI] 1/2}Γ6(n)[Γ8(n) + δI] 1/2 HS + [Γ5(n) + δI] 1/2[Γ6(n) Γ7(n)][Γ8(n) + δI] 1/2 HS + [Γ5(n) + δI] 1/2Γ7(n){[Γ8(n) + δI] 1/2 [Γ9(n) + δI] 1/2} HS M i,j,S 11 (n) + M i,j,S 12 (n) + M i,j,S 13 (n). Functional Directed Acyclic Graphs We next bound M i,j,S 11 (n), M i,j,S 12 (n), and M i,j,S 13 (n), respectively. For M i,j,S 11 (n), using the identity, A1/2 B1/2 = A1/2(B 3/2 A 3/2)B3/2 + (A 1 B 1)B3/2 (Fukumizu et al., 2007), we have M i,j,S 11 (n) [M i,j,S 11,1 (n) + M i,j,S 11,2 (n)]M i,j,S 11,3 (n), where M i,j,S 11,1 (n) = [Γ5(n) + δI] 1/2{[Γ5(n) + δI]3/2 [Γ4(n) + δI]3/2}[Γ4(n) + δI] 1 HS, M i,j,S 11,2 (n) = [Γ4(n) Γ5][Γ4(n) + δI] 1 HS, M i,j,S 11,3 (n) = [Γ4(n) + δI] 1/2Γ6(n)[Γ8(n) + δI] 1/2 . By the inequality max(a1/2, b1/2) (a+b)1/2, Lemma 4, and Fukumizu et al. (2008, Lemma 7), we have M i,j,S 11,1 (n) 3δ 3/2 max( Γ4(n) + δI 1/2, Γ5(n) + δI 1/2) Γ4(n) Γ5(n) HS 3δ 3/2[ Γ4(n) Γ5(n) + Γ5(n) + δI ]1/2 Γ4(n) Γ5(n) HS. Because ΣXi Xi|XS ΣXi Xi, we have Γ5(n)+δI = ΣXi Xi|XS +δI ΣXi Xi +δI M0+δ. Therefore, M i,j,S 11,1 (n) 3δ 3/2( Γ4(n) Γ5(n) + M0 + δ)1/2 Γ4(n) Γ5(n) HS. For M i,j,S 11,2 (n), we have, M i,j,S 11,2 (n) [Γ4(n) + δI] 1 Γ4(n) Γ5(n) HS δ 1 Γ4(n) Γ5(n) HS. For M i,j,S 11,3 (n), by Proposition 5, we have M i,j,S 11,3 (n) 1. Combining the upper bounds for M i,j,S 11,1 (n), M i,j,S 11,2 (n), and M i,j,S 11,3 (n), and taking maximum over H0(m), we obtain that, max (i,j,S) H0(m) M i,j,S 11 (n) ( max (i,i,S) H1(m) ˆΣ d,ϵ Xi Xi|XS ΣXi Xi|XS HS) [3δ 3/2( max (i,i,S) H1(m) ˆΣ d,ϵ Xi Xi|XS ΣXi Xi|XS HS + M0 + δ)1/2 + δ 1]. By Theorem 3, the right-hand-side is dominated by ζ(m, d, p, ϵ, n){3δ 3/2[ζ(m, d, p, ϵ, n) + M0 + δ]1/2 + δ 1}. Therefore, if ζ(m, d, p, ϵ, n) 1, then max (i,j,S) H0(m) M i,j,S 11 (n) = OP[δ 3/2ζ(m, d, p, ϵ, n)]. (26) It remains to show that the orders of magnitude of the terms max(i,j,S) H0(m) M i,j,S 12 (n) and max(i,j,S) H0(m) M i,j,S 13 (n) are dominated by (26). Using similar derivations for (26), max (i,j,S) H0(m) M i,j,S 12 (n) δ 1( max (i,j,S) H0(m) ˆΣ d,ϵ Xi Xj|XS ΣXi Xj|XS HS) = OP[δ 1ζ(m, d, p, ϵ, n)], max (i,j,S) H0(m) M i,j,S 13 (n) = OP[δ 3/2ζ(m, d, p, ϵ, n)], both of which are dominated by (26). This completes the proof of Lemma 8. 2 Lee, Li, and Li Lemma 9 Suppose Assumptions 1, 3 and 6 hold, and δ 1. Then, max (i,j,S) H0(m) Rδ Xi Xj|XS RXi Xj|XS HS = Op(δ1/2). Proof of Lemma 9: By definition, Rδ Xi Xj|XS RXi Xj|XS 2 HS is equal to PMi,S,Mj,S a,b νa i,S, [(ΣXi Xi|XS + δI) 1/2ΣXi Xj|XS(ΣXi Xi|XS + δI) 1/2 RXi Xj|XS]νb j,S 2 ΩXi = PMi,S,Mj,S a,b [(µa i,S + δ)1/2(µb j,S + δ)1/2 (µa i,S)1/2(µb j,S)1/2]2 νa i,S, RXi Xj|XSνb j,S 2 ΩXi (µa i,S + δ)(µb j,S + δ) , where Mi,S, Mj,S, µa i,S, µb j,S, νa i,S, νb j,S are as defined right before Assumption 6. Moreover, νa i,S, RXi Xj|XSνb j,S 2 ΩXi = νa i,S (µa i,S)1/2 , ΣXi Xj|XS νb j,S (µb j,S)1/2 ΩXi = (ρa,b i,j,S)2, which implies that Rδ Xi Xj|XS RXi Xj|XS 2 HS is upper-bounded by PMi,S,Mj,S a,b [(µa i,S + δ)1/2(µb j,S + δ)1/2 (µa i,S)1/2(µb j,S)1/2]2(ρa,b i,j,S)2/(µa i,Sµb j,S). By direct calculation, for any a Mi,S, b Mj,S, [(µa i,S + δ)1/2(µb j,S + δ)1/2 (µa i,S)1/2(µb j,S)1/2]2 δ(µa i,S + µb j,S) + δ2, which is no greater than 2δM0 + δ2 because µa i,S ΣXi Xi|XS ΣXi Xi M0. Therefore, by Assumption 6, max (i,j,S) H0(m) Rδ Xi Xj|XS RXi Xj|XS HS c1/2 0 (2δM0 + δ2)1/2 = Op(δ1/2). This completes the proof of Lemma 9. 2 Proof of Theorem 4: Combining Lemmas 8 and 9, we immediately obtain the uniform convergence rate of ˆR d,ϵ,δ Xi Xj|XS RXi Xj|XS HS. The second assertion can be obtained following a similar proof of Theorem 3. This completes the proof of Theorem 4. 2 A.4 Proofs of other theoretical results Proof of Proposition 1: By Assumption 2, for f ΩXB, g ΩXA, and (t1, t2) R2, E exp n ι h t1( f, XB ΩXB MXAXBf, XA ΩXA) + t2 g, XA ΩXA t1f t2g t1MXAXBf , ΣUU ΣUV ΣV U ΣV V t1f t2g t1MXAXBf where ι = 1. By direct calculation, the inner product in the above expression equals t2 1 f, ΣUUf ΩXB 2t2 1 f, ΣUV MXAXBf ΩXB + 2t1t2 f, ΣUV g ΩXB 2t1t2 g, ΣV V MXAXBf ΩXA + t2 1 MXAXBf, ΣV V MXAXBf ΩXA + t2 2 g, ΣV V g ΩXA = t2 1 f, (ΣUU ΣUV Σ V V ΣV U)f ΩXB + t2 2 g, ΣV V g ΩXA t2 1σ2 f + t2 2σ2 g, Functional Directed Acyclic Graphs where we have used the relations MXAXBf, ΣV V MXAXBf ΩXA = f, ΣUV MXAXBf ΩXB = f, ΣUV Σ V V ΣV Uf ΩXB, and f, ΣUV MXAXBf ΩXB = f, ΣUV g ΩXB. This implies, f, XB ΩXB MXAXBf, XA ΩXA and g, XA ΩXA are independent Gaussian variables with variances σ2 f and σ2 g, respectively. Since this holds for all f ΩXB and g ΩXA, we have f, XB ΩXB MXAXBf, XA ΩXA XA, which completes the proof of Proposition 1. 2 Proof of Theorem 1: Following the proof of Proposition 1, we can show that, for any (f, g) ΩXi ΩXj, the conditional distribution of ( f, Xi , g, Xj ) given XS is N MXSXif, XS MXSXjg, XS , f, ΣXi Xi|XSf f, ΣXi Xj|XSg g, ΣXj Xi|XSf g, ΣXj Xj|XSg which implies (i). Because, by Assumption 3, ran(ΣXi Xj|XS) ran(ΣXi Xi), and ran(ΣXj Xi|XS) ran(ΣXj Xj), it suffices to show that, for any f ran(ΣXi Xi) and g ran(ΣXj Xj), f, ΣXi Xj|XSg ΩXi = f, P a,b Ncov(ca i , cb j | XS)(ηa i ηb j) g ΩXi. (27) Since f ran(ΣXi Xi) and g ran(ΣXj Xj), we have f = P a N f, ηa i ΩXiηa i , and g = P b N g, ηb j ΩXj ηb j. Substituting these into the left-hand-side of (27), we can obtain the right-hand-side of (27). Thus, (ii) holds. This completes the proof of Theorem 1. 2 Proof of Proposition 2: Note that, for any a, b span(Br), [a b] = ([(a b)b1] , . . . , [(a b)br]) = ([a] b, b1 , . . . , [a] b, br ) , which equals [a] [b] T e1, . . . , [a] [b] T er , where ei is the r-dimensional vector with its ith element equal to 1 and other elements equal to 0. Noting that ˆΣXi Xj = En[(Xi En Xi) (Xj En Xj)] completes the proof of Proposition 2. 2 Proof of Proposition 4: We have, by the faithfulness, Xi Xj | XS if and only if i and j are d-separated by S. Following the proof in Kalisch and B uhlmann (2007, Proposition 1), we can show that the above equivalence implies that the output of Step 1 of the functional-PC0 is the true skeleton ESKE, which further implies ℓ0 m by the definition of m. Therefore (ii) holds. Moreover, by Meek (1995), the output from Step 2 of functional PC0 is the CPDAG of G, which implies (i). This completes the proof of Proposition 4. 2 Proof of Proposition 5: By the definition of ˆR d,ϵ,δ Xi Xj|XS, ran( ˆR d,ϵ,δ Xi Xj|XS) ran(ˆΣ d,ϵ Xi Xi|XS) = span{ˆηa i : a = 1, . . . , d} = span{B i }, ker( ˆR d,ϵ,δ Xi Xj|XS) ker(ˆΣ d,ϵ Xj Xj|XS) = span{ˆηa j : a = 1, . . . , d} = span{B j} . This implies that the operator norm of ˆR d,ϵ,δ Xi Xj|XS is the same as the largest singular value of the coordinate representation of ˆR d,ϵ,δ Xi Xj|XS with respect to B j and B i . Therefore, by Lee, Li, and Li Proposition 3, ˆR d,ϵ,δ Xi Xj|XS can be computed via the optimization: maximize : [fi]TAT i,SAj,S[fj], subject to : [fi]T(AT i,SAi,S + δIn)[fi] = [fj]T(AT j,SAj,S + δIn)[fj] = 1, [fi] Rd, [fj] Rd, where Ai,S = [In D(S)]1/2C1:d i . By the Cauchy-Schwarz inequality, we have [fi]TAT i,SAj,S[fj] ([fi]TAT i,SAi,S[fi])1/2([fj]TAT j,SAj,S[fj])1/2, which is no greater than [fi]T(AT i,SAi,S + δIn)[fi] 1/2 [fj]T(AT j,SAj,S + δIn)[fj] 1/2 = 1. This completes the proof of Proposition 5. 2 Proof of Theorem 5: First, we show (a) (b). We pick a permutation π, such that, for any i = 2, . . . , p, pa(i) π([i 1]). For convenience, we reset π(1), . . . , π(p) to 1, . . . , p. By Lauritzen et al. (1990, Corollary 2), the global Markov property is equivalent to Xi X[i 1] | Xpa(i), which means that the conditional distributions of Xi | X[i 1] and Xi | Xpa(i) are identical. Moreover, following a similar proof as that of Proposition 1, we can show that, Xi | X[i 1] N(M X[i 1]Xi X[i 1], ΣXi Xi|X[i 1]), Xi | Xpa(i) N(M Xpa(i)Xi Xpa(i), ΣXi Xi|Xpa(i)). Since the above two Gaussian distributions are the same, we have (MX[i 1]Xi)j = 0, for j / pa(i). Therefore, by Lemma 1, X satisfies the functional linear structural equation model with respect to G. Next, we show (b) (a). This holds because the global Markov property is implied by the local Markov property, which, under the Gaussian distribution, is implied by (b). This completes the proof of Theorem 5. 2 A.5 Additional numerical results Additional combinations of (p, q, n): We carry out an additional simulation study for Model I with more combinations of (p, q, n). In our simulation, the true DAG is generated by a random graph, whose level of sparsity is controlled by the expected neighborhoods size q. Meanwhile, as q increases, we expect the maximal degree m to increase as well. Table 2 reports the average values of m for different combinations of (p, q, n). This new study thus allows us to further examine the empirical performance of the proposed PCO method with denser graphs. Moreover, it offers new insight into the condition that Theorem 3 requires about m that, md3+γ(log p)1/2/(n1/2ϵ) 1. (28) This condition implies that m can grow at most at a polynomial rate, and thus in turn imposes a level of sparsity on the graph. We generate (p, q, n) following the relation, q(n) = c1 + c2 n, logc3[p(n)] = c4 nc5, Functional Directed Acyclic Graphs n p q ave. m TDR TPR FM 20 4 1.18 2.150 0.956 0.139 0.195 40 8 1.26 3.237 0.742 0.399 0.505 60 17 1.34 4.400 0.730 0.498 0.586 80 32 1.42 5.150 0.687 0.530 0.594 100 57 1.50 5.937 0.689 0.568 0.619 120 100 1.58 6.600 0.684 0.574 0.622 Table 2: The true discovery rate (TDR), true positive rate (TPR) and F-measure (FM) between the estimated and true CPDAG, as p grows in an exponential order of n, and q grows in a polynomial order of n. with c1 = 0.77, c2 = 0.003, c3 = 6, c4 = 0.09, and c5 = 0.7. As such, p grows in an exponential order of n, and q grows in a polynomial order of n. The resulting combinations are n = {20, 40, 60, 80, 100, 120}, p = {4, 8, 17, 32, 57, 100}, and q = {0.82, 0.88, 0.93, 0.99, 1.05, 1.10}. For the evaluation criteria, in addition to the structure Hamming distance (SHD) and the true discovery rate (TDR), we employ two additional criteria, the true positive rate (TPR) and the F-measure (FM), which are defined as, TPR(ˆECPDAG, ECPDAG) = |{(i, j) ˆECPDAG : (i, j) ECPDAG}|/|ECPDAG|, FM(ˆECPDAG, ECPDAG) = 2 TDR(ˆECPDAG, ECPDAG) TPR(ˆECPDAG, ECPDAG) TDR(ˆECPDAG, ECPDAG) + TPR(ˆECPDAG, ECPDAG) , where FM(ˆECPDAG, ECPDAG) is computed as the harmonic mean of TDR(ˆECPDAG, ECPDAG) and TPR(ˆECPDAG, ECPDAG). A motivation of adding the new criteria is that an estimator may sometimes have a high true discovery rate (TDR) but a low TPR (Rijsbergen, 1979). A higher TPR and a higher FM indicate a more accurate estimator. Figures 5 to 8 report the box plots of the SHD, TDR, TPR and FM, respectively, between the true CPDAG and the PCO estimate across 80 data replications. We see that, as sample size increase, the SHD decreases, and both TPR and FM increase with a stabilizing TDR. This shows that our method performs well for these new combinations, and, in particular, for the larger m s, as long as the sample size is reasonably large. Table 2 reports the average TDR, TPR, and FM for six combinations of (p, q, n), along with the average m. We see that, with the increasing sample size, while TDR slightly deceases, TPR and FM both increase substantially. This results indicates that the overall performance of our PCO method improves as n increases. It also provides additional support for our theoretical consistency and why we need condition (28). Undirected screening: We also investigate the performance of our method when coupled with an undirected graph estimation as a starting point. In the classical setting of random variables, the undirected graph is a supergraph of the skeleton ESKE of the DAG G, and it is easy to show this statement continues to hold in the setting of random functions. Specifically, we employ Qiao et al. (2019) to estimate an undirected graph, then feed it as an initial graph into our proposed PCO-based PC-algorithm. We have chosen Qiao et al. (2019) over Li and Solea (2018), since both our method and Qiao et al. (2019) consider the Lee, Li, and Li q = 0.82 q = 0.88 G G G GGGGGGGGGGGGG GGG G G G G G G G G GG p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0 20 40 60 80 0 20 40 60 80 G p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 q = 0.93 q = 0.99 G GGG G G GGGG GG GGGG GGGGGGG G p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0 30 60 90 120 0 30 60 90 120 GGGG G GGG G G GG GG G G G G G G G GG G G p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0 50 100 150 0 50 100 150 q = 1.05 q = 1.10 GGGG G GGG G G GG GG G G G G G G G GG G G p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0 50 100 150 0 50 100 150 GGGGGGGGGGG G G G GG GG GG GG GG G G G GGG p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0 50 100 150 0 50 100 150 Figure 5: The structure Hamming distance (SHD) between the estimated and true CPDAG for combinations of the sample size n, graph size p, and sparsity rate q. Gaussian distribution. There is a penalty constant in Qiao et al. (2019) that determines the level of sparsity of the estimated undirected graph. We have experimented with a range of penalty values, resulting in different percentage values of the selected edges among all edges, from 100% to 4%, for the initial estimation. When the percentage is 100%, there is no penalty in the undirected graph estimation, or effectively, no pre-screening for our method. Table 3 reports the average SHD and TDR and the standard error (in the parenthesis) based on 80 data replications for Model I with (n, p, q) = (50, 50, 0.7). We see that, as the Functional Directed Acyclic Graphs q = 0.82 q = 0.88 GG GG GG GGG p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 q = 0.93 q = 0.99 p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 q = 1.05 q = 1.10 p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Figure 6: The true discovery rate (TDR) between the estimated and true CPDAG for combinations of the sample size n, graph size p, and sparsity rate q. percentage of the pre-selected edges decreases to 10%, the performance of the combined algorithm improves. When this percentage drops below 10%, the performance begins to decline. This example shows the potential advantage of coupling an initial undirected graph estimation with our proposed DAG estimation method. Effect of the kernel function: In our simulations in Section 6.1, we employ the Brownian motion covariance function (BMC) kernel, κT = min(s, t). To investigate whether our method is sensitive to the choice of kernels, here we use the radial basis function (RBF) kernel, κT = exp{γG(s t)2}, where the bandwidth parameter is computed as Lee, Li, and Li q = 0.82 q = 0.88 G G G G G G G G G G G G p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 q = 0.93 q = 0.99 G p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 q = 1.05 q = 1.10 p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 p=32 p=57 p=100 p=4 p=8 p=17 20 40 60 80 100120 20 40 60 80 100120 20 40 60 80 100120 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Figure 7: The true positive rate (TPR) between the estimated and true CPDAG for combinations of the sample size n, graph size p, and sparsity rate q. s