# direct_estimation_of_differential_functional_graphical_models__7a4807d6.pdf Direct Estimation of Differential Functional Graphical Models Boxin Zhao Department of Statistics The Unveristy of Chicago Chicago, IL 60637 boxinz@uchicago.edu Y. Samuel Wang Booth School of Business The Unveristy of Chicago Chicago, IL 60637 swang24@uchicago.edu Mladen Kolar Booth School of Business The Unveristy of Chicago Chicago, IL 60637 mkolar@chicagobooth.edu We consider the problem of estimating the difference between two functional undirected graphical models with shared structures. In many applications, data are naturally regarded as high-dimensional random function vectors rather than multivariate scalars. For example, electroencephalography (EEG) data are more appropriately treated as functions of time. In these problems, not only can the number of functions measured per sample be large, but each function is itself an infinite dimensional object, making estimation of model parameters challenging. We develop a method that directly estimates the difference of graphs, avoiding separate estimation of each graph, and show it is consistent in certain high-dimensional settings. We illustrate finite sample properties of our method through simulation studies. Finally, we apply our method to EEG data to uncover differences in functional brain connectivity between alcoholics and control subjects. 1 Introduction Undirected graphical models are widely used to compactly represent pairwise conditional independence in complex systems. Let G = {V, E} denote an undirected graph where V is the set of vertices with |V | = p and E V 2 is the set of edges. For a random vector X = (X1, . . . , Xp)T , we say that X satisfies the pairwise Markov property with respect to G if Xv Xw|{Xu}u V \{v,w} implies {v, w} E. When X follows a multivariate Gaussian distribution with covariance Σ = Θ 1, then Θvw = 0 implies {v, w} E. Thus, recovering the structure of the undirected graph is equivalent to estimating the support of the precision matrix, Θ [10, 13, 4, 24, 25]. We consider a setting where we observe two samples X and Y from (possibly) different distributions, and the primary object of interest is the difference between the conditional dependencies of each population rather than the conditional dependencies in each population. For example, in Section 4.3 we analyze neuroscience data sampled from a control group and a group of alcoholics, and seek to understand how the brain functional connectivity patterns in the alcoholics differ from the control group. Thus, in this paper, the object of interest is the differential graph, G = {V, E }, which is defined as the difference between the precision matrix of X, ΘX and the precision matrix of Y , ΘY , = ΘX ΘY . When vw = 0, it implies {v, w} E . This type of differential model has been adopted in [30, 22, 3]. In this paper, we are interested in estimating the differential graph in a more complicated setting. Instead of observing vector valued data, we assume the data are actually random vector valued functions (see [5] for a detailed exposition of random functions). Indeed, we aim to estimate the difference between two functional graphical models and the method we propose combines ideas from graphical models for functional data and direct estimation of differential graphs. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Multivariate observations measured across time can be modeled as arising from distinct, but similar distributions [9]. However, in some cases, it may be more natural to assume the data are measurements of an underlying continuous process [31, 18, 11, 28]. [31, 18] treat data as curves distributed according to a multivariate Gaussian process (MGP). [31] shows that Markov properties hold for Gaussian processes, while [18] shows how to consistently estimate underlying conditional independencies. We adopt the functional data point of view and assume the data are curves distributed according to a MGP. However, we consider two samples from distinct populations with the primary goal of characterizing the difference between the conditional cross-covariance functions of each population. Naively, one could apply the procedure of [18] to each sample, and then directly compare the resulting estimated conditional independence structures. However, this approach would require sparsity in both of the underlying conditional independence graphs and would preclude many practical cases; e.g., neither graph could contain hub-nodes with large degree. We develop a novel procedure that directly learns the difference between the conditional independence structures underlying two MGPs. Under an assumption that the difference is sparse, we can consistently learn the structure of the differential graph, even in the setting where individual graphs are dense and separate estimation would suffer. Our paper builds on recent literature on graphical models for vector valued data, which suggests that direct estimation of the differences between parameters of underlying distributions may yield better results. [12] considers data arising from pairwise interaction exponential families and propose the Kullback-Leibler Importance Estimation Procedure (KLIEP) to explicitly estimate the ratio of densities. [21] uses KLIEP as a first step to directly estimate the difference between two directed graphs. Alternatively, [30, 26] consider two multivariate Gaussian samples, and directly estimate the difference between the two precision matrices. When the difference is sparse, it can be consistently estimated even in the high-dimensional setting with dense underlying precision matrices. [22] extends this approach to Gaussian copula models. The rest of the paper is organized as follows. In Section 2 we introduce our method for Functional Differential Graph Estimation (Fu DGE). In Section 3 we provide conditions under which Fu DGE consistently recovers the true differential graph. Simulations and real data analysis are provided in Section 41. Discussion is provided in Section 5. Appendix contains all the technical proofs and additional simulation results. We briefly introduce some notation used throughout the rest of the paper. Let | |p denote vector pnorm and p denote the matrix/operator p-norm. For example, for a p p matrix A with entries ajk, |A|1 = P j,k |ajk|, A 1 = maxk P j |ajk|, |A| = maxj,k |ajk|, and A = maxj P k |ajk|. Let an bn denote that z1 infn |an/bn| supn |an/bn| z2 for some positive constants z1 and z2. Let λmin(A) and λmax(A) denote the minimum and maximum eigenvalues, respectively. For a bivariate function g(s, t), we define the Hilbert-Schmidt norm of g(s, t) (or equivalently, the norm of the integral operator it corresponds to) as g 2 HS = R R {g(s, t)}2dsdt. 2 Methodology 2.1 Functional differential graphical model Let Xi(t) = (Xi1(t), . . . , Xip(t))T , i = 1, . . . , n X, and Yi(t) = (Yi1(t), . . . , Yip(t))T , i = 1, . . . , n Y , be iid p-dimensional multivariate Gaussian processes with mean zero and common domain T from two different, but connected population distributions, where T is a closed subset of the real line.2 Also, assume that for j = 1, . . . , p, Xij(t) and Yij(t) are random elements of a separable Hilbert space H. For brevity, we will generally only explicitly define notation for Xi(t); however, the reader should note that all notations for Yi(t) are defined analogously. Following [18], we define the conditional cross-covariance function for Xi(t) as CX jl (s, t) = Cov (Xij(s), Xil(t) | {Xik( )}k =j,l) . (2.1) If CX jl (s, t) = 0 for all s, t T , then the random functions Xij(t) and Xil(t) are conditionally independent given the other random functions. The graph GX = {V, EX} represents the pairwise 1The code for this part is on https://github.com/boxinz17/Fu DGE 2Both Xi(t) and Yi(t) are indexed by i, but they are not paired observations and are completely independent. Also, we assume mean zero and a common domain T to simplify the notation, but the methodology and theory generalize to non-zero means and different time domains TX and TY when fixing some bijection TX 7 TY . Markov properties of Xi(t) if EX = {(j, l) V 2 : j = l and {s, t} T 2 such that CX jl (s, t) = 0}. (2.2) In this paper, the object of interest is C (s, t) where C jl (s, t) = CX jl (s, t) CY jl(s, t). We define the differential graph to be G = {V, E }, where E = {(j, l) V 2 : j = l and C jl HS = 0}. (2.3) Again, we include an edge between j and l, if the conditional dependence between Xij(t) and Xil(t) given all the other curves differs from that of Yij(t) and Yil(t) given all the other curves. 2.2 Functional principal component analysis Since Xi(t) and Yi(t) are infinite dimensional objects, for practical estimation, we reduce the dimensionality using functional principal component analysis (FPCA). Similar to the way principal component analysis provides an L2 optimal lower dimensional representation of vector valued data, FPCA provides an L2 optimal finite dimensional representation of functional data. As in [18], for simplicity of exposition, we assume that we fully observe the functions Xi(t) and Yi(t). However, FPCA can also be applied to both densely and sparsely observed functional data, as well as data containing measurement errors. Such an extension is straightforward, cf. [23] and [20] for a recent overview. Let KX jj(t, s) = Cov(Xij(t), Xij(s)) denote the covariance function for Xij. Then, there exists orthonormal eigenfunctions and eigenvalues {φjk(t), λX jk}k N such that for all k N [5]: Z T KX jj(s, t)φX jk(t)dt = λX jkφX jk(s). (2.4) Without loss of generality, assume λX j1 λX j2 0. By the Karhunen-Loève expansion [5, Theorem 7.3.5], Xij(t) can be expressed as Xij(t) = P k=1 a X ijkφX jk(t) where the principal component scores satisfy a X ijk = R T Xij(t)φX jk(t)dt and a X ijk N(0, λX jk) with E(a X ijka X ijl) = 0 if k = l. Because the eigenfunctions are orthonormal, the L2 projection of Xij onto the span of the first M eigenfunctions is XM ij (t) = k=1 a X ijkφX jk(t). (2.5) Functional PCA constructs estimators ˆφX jk(t) and ˆa X ijk through the following procedure. First, we form an empirical estimate of the covariance function: ˆKX jj(s, t) = 1 n X i=1 (Xij(s) Xj(s))(Xij(t) Xj(t)), where Xj(t) = n 1 X Pn X i=1 Xij(t). An eigen-decomposition of ˆKX jj(s, t) then directly provides the estimates ˆλX jk and ˆφX jk which allow for computation of ˆa X ijk = R T Xij(t)ˆφX jk(t)dt. Let a X,M ij = (a X ij1, . . . , a X ij M)T RM and a X,M i = ((a X,M i1 )T , . . . , (a X,M ip )T )T Rp M with corresponding estimates ˆa X,M ij and ˆa X,M i . Since XM i (t) are p-dimensional MGP, a X,M i will have a multivariate Gaussian distribution with p M p M covariance matrix which we denote as ΣX,M = (ΘX,M) 1. In practice, M can be selected by cross validation as in [18]. For (j, l) V 2, let ΘX,M jl be the M M matrix corresponding to (j, l)th submatrix of ΘX,M. Let M = ΘX,M ΘY,M be the difference between the precision matrices of the first M principal component scores where M jl denotes the (j, l)th submatrix of M. In addition, let E M := {(j, l) V 2 : j = l and M jl F = 0}, (2.6) denote the set of non-zero blocks of the difference matrix M. In general E M = E ; however, we will see that for certain M, by constructing a suitable estimator of M we can still recover E . 2.3 Functional differential graph estimation We now describe our method, Fu DGE, for functional differential graph estimation. Let SX,M and SY,M denote the sample covariances of ˆa X,M i and ˆa Y,M i . To estimate M, we solve the following problem with the group lasso penalty, which promotes blockwise sparsity in ˆ M [27]: ˆ M arg min Rp M p M L( ) + λn X {i,j} V 2 ij F , (2.7) where L( ) = tr 1 2SY,M T SX,M T SY,M SX,M . Note that although the true M is symmetric, we do not enforce symmetry in ˆ M. The design of the loss function L( ) in equation (2.7) is based on [15], where in order to construct a consistent M-estimator, we want the true parameter value M to minimize the population loss E [L( )]. For a differentiable and convex loss function, this is equivalent to selecting L such that E L( M) = 0. Since M = ΣX,M 1 ΣY,M 1, it satisfies ΣX,M MΣY,M (ΣY,M ΣX,M) = 0. By this observation, a choice for L( ) is L( M) = SX,M MSY,M SY,M SX,M , (2.8) for which E L( M) = ΣX,M MΣY,M (ΣY,M ΣX,M) = 0. Using properties of the differential of the trace function, this choice of L( ) yields L( ) in (2.7). The chosen loss is quadratic (see (B.10) in supplement) and leads to an efficient algorithm. Such loss has been used in [22, 26, 14] and [30]. Finally, to form ˆE , we threshold ˆ M by ϵn > 0 so that: ˆE = {(j, l) V 2 : j = l and ˆ M jl F > ϵn or ˆ M lj F > ϵn}. (2.9) 2.4 Optimization algorithm for Fu DGE Algorithm 1 Functional differential graph estimation Input: SX,M, SY,M, λn, η. Output: ˆ M. Initialize (0) = 0p M. repeat A = η L( ) = η h S(M) X S(M) Y (S(M) Y S(M) X ) i for 1 i, j p do jl Ajl F λnη + Ajl end for until Converge The optimization problem (2.7) can be solved by a proximal gradient method [17], summarized in Algorithm 1. Specifically, in each iteration step, we update the current value of , denoted as old, by solving the following problem: new = arg min old η L old 2 F + η λn where L( ) is defined in (2.8) and η is a user specified step size. Note that L( ) is Lipschitz continuous with the Lipschitz constant SY,M SX,M 2 = λmax(SY,M)λmax(SX,M). Thus, for any η such that 0 < η 1/λS max, the proximal gradient method is guaranteed to converge [1], where λS max = λmax(SY,M)λmax(SX,M) is the largest eigenvalue of SX,M SY,M. The update in (2.10) has a closed-form solution: new jl = Aold jl F λnη / Aold jl F + Aold jl , 1 j, l p, (2.11) where Aold = old η L( old) and x+ = max{0, x}, x R represents the positive part of x. Detailed derivations are given in the appendix. After performing FPCA, the proximal gradient descent method converges in O λS max/tol iterations, where tol is error tolerance, each iteration takes O((p M)3) operations. See [19] for convergence analysis of proximal gradient descent algorithm. 3 Theoretical properties In this section, we present theoretical properties of the proposed method. Again, we state assumptions explicitly for Xi(t), but also require the same conditions on Yi(t). Assumption 3.1. Recall that λX jk and φX jk(t) are the the eigenvalues and eigenfunctions of KX jj(t), the covariance function for Xij(t), and λX jk > λX jk for all k > k. (i) Assume maxj V P k=1 λX jk < and there exists some constant βX > 1 such that for each k N, λX jk k βX and d X jkλX jk = O(k) uniformly in j V , where d X jk = 2 max λX j(k 1) λX jk 1 , λX jk λX j(k+1) 1 . (ii) Assume for all k N, φX jk(t) s are continuous on the compact set T and satisfy maxj V sups T supk 1 |φX jk(s)| = O(1). The parameter βX controls the decay rate of the eigenvalues and d X jkλX jk = O(k) controls the decay rate of eigen-gaps (see [2] for more details). To recover the exact functional differential graph structure, we need further assumptions on the difference operator C = {CX jl (s, t) CY jl(s, t)}j,l V . Let ν = ν(M) = max(j,l) V 2 C jl HS M jl F , and let τ = min(j,l) E C jl HS, where τ > 0 by the definition in (2.3). Roughly speaking, ν(M) measures the bias due to using an M-dimensional approximation, and τ measures the strength of signal in the differential graph. A smaller τ implies that the graph is harder to recover, and in Theorem 3.1, we require the bias to be small compared to the signal. Assumption 3.2. Assume that lim M ν(M) = 0. We also require Assumption 3.3 which assumes sparsity in E . Again, this does not preclude the case where EX and EY are dense, as long as the difference between the two graphs is sparse. This assumption is common in the scalar setting; e.g., Condition 1 in [30]. Assumption 3.3. There are s edges in the differential graph; i.e., |E | = s. Before we give conditions for recovering the differential graph with high probability, we first introduce some additional notation. Let n = min{n X, n Y }, σmax = max{|ΣX,M| , |ΣY,M| }, β = min{βX, βY }, and λ min = λmin ΣX,M λmin ΣY,M . Given positive constant c1, denote δ = (1/ c1)M 1+βp 2 (log p + log M + log n) /n (3.1) and Γ = 9λ2 ns κ2 L + 2λn κL (ω2 L + 2p2ν), (3.2) where λn = 2M δ2 + 2δσmax M 1 + 2δ , κL = (1/2)λ min 8M 2s δ2 + 2δσmax , and ωL = 4Mp2ν p δ2 + 2δσmax. Note that Γ implicitly depends on n through λn, κL, ωL and δ. Theoreom 3.1. There exist positive constants c1 and c2, such that for n and M large enough to simultaneously satisfy 0 < Γ < (1/2)τ ν(M) and δ < min (1/4) q (λ min + 16M 2s(σmax)2) / (M 2s) σmax, c1 setting ϵn (Γ + ν(M), τ (Γ + ν(M))) ensures that P ˆE = E 1 2c2/n2. [18] assumed for some finite M, for all j V , λX jm = 0 for all m > M. Under this assumption, Xij(t) = XM ij (t), and EX will correspond exactly to (j, l) V 2 such that ΘX,M jl F = 0 [18, Lemma 1]. If the same eigenvalue condition holds for Yi(t), then in our setting E = E M . When this holds and we can fix M, we obtain consistency even in the high-dimensional setting since ν = 0 and min{s log(pn)| M|2 1/n, s p log(pn)/n} 0 implies consistent estimation. However, even with an infinite number of positive eigenvalues, high-dimensional consistency is still possible for quickly decaying ν; e.g, if ν = o(p 2M 1) the same rate is achievable as when v(M) = 0. 4 Experiments 4.1 Simulation study In this section, we demonstrate properties of our method through simulations. In each setting, we generate n X p functional variables from graph GX via Xij(t) = b(t)T δX ij , where b(t) is a five dimensional basis with disjoint support over [0, 1] with bk(t) = cos (10π (x (2k 1)/10)) + 1 (k 1)/5 x < k/5; 0 otherwise, k = 1, . . . , 5. δX i = ((δX i1)T , , (δX ip)T )T R5p follows a multivariate Gaussian distribution with precision matrix ΩX. Yij(t) was generated in a similar way with precision matrix ΩY . We consider three models with different graph structures, and for each model, data are generated with n X = n Y = 100 and p = 30, 60, 90, 120. We repeat this 30 times for each p and model setting. Model 1: This model is similar to the setting considered in [30], but modified to the functional case. We generate support of ΩX according to a graph with p(p 1)/10 edges and a power-law degree distribution with an expected power parameter of 2. Although the graph is sparse with only 20% of all possible edges present, the power-law structure mimics certain real-world graphs [16] by creating hub nodes with large degree. For each nonzero block, ΩX jl = δ I5, where δ is sampled uniformly from [0.2, 0.5]. To ensure positive definiteness, we further scale each off-diagonal block by 1/2, 1/3, 1/4, 1/5 for p = 30, 60, 90, 120 respectively. Each diagonal element of ΩX is set to 1 and the matrix is symmetrized by averaging it with its transpose. To get ΩY , we first select the largest hub nodes in GX (i.e., the nodes with largest degree), and for each hub node we select the top (by magnitude) 20% of edges. For each selected edge, we set ΩY jl = ΩX jl + W where Wkm = 0 for |k m| 2, and Wkm = c otherwise, where c is generated in the same way as δ . For all other blocks, ΩY jl = ΩX jl. Model 2: We first generate a tridiagonal block matrix Ω X with Ω X,jj = I5, Ω X,j,j+1 = Ω X,j+1,j = 0.6I5, and Ω X,j,j+2 = Ω X,j+2,j = 0.4I5 for j = 1, . . . , p. All other blocks are set to 0. We then set Ω Y,j,j+3 = Ω Y,j+3,j = W for j = 1, 2, 3, 4, and let Ω Y,jl = Ω X,jl for all other blocks. Thus, we form GY by adding four edges to GX. We let Wkm = 0 when |k m| 1, and Wkm = c otherwise, with c = 1/10 for p = 30, c = 1/15 for p = 60, c = 1/20 for p = 90, and c = 1/25 for p = 120. Finally, we set ΩX = Ω X + δI, ΩY = Ω Y + δI, where δ = max {| min(λmin(Ω X), 0)|, | min(λmin(Ω Y ), 0)|}+0.05. Model 3: We generate Ω X according to an Erdös-Rényi graph. We first set Ω X,jj = I5. With probability .8, we set Ω X,jl = Ω X,lj = 0.1I5, and set it to 0 otherwise. Thus, we expect 80% of all possible edges to be present. Then, we form GY by randomly adding s new edges to GX, where s = 3 for p = 30, s = 4 for p = 60, s = 5 for p = 90, and s = 6 for p = 120. We set each corresponding block Ω Y,jl = W, where Wkm = 0 when |k m| 1 and Wkm = c otherwise. We let c = 2/5 for p = 30, c = 4/15 for p = 60, c = 1/5 for p = 90, and c = 4/25 for p = 120. Finally, we set ΩX = Ω X + δI, ΩY = Ω Y + δI, where δ > max {| min(λmin(Ω X), 0)|, | min(λmin(Ω Y ), 0)|}+0.05. Although the theory assumes fully observed functional data, in order to mimic a realistic setting, we use noisy observations at discrete time points, such that the actual data corresponding to Xij are h X ijk = Xij(tk) + eijk, eijk N(0, 0.52), Figure 1: Average ROC curves across 30 simulations. Different columns correspond to different models, different rows correspond to different dimensions. for 200 evenly spaced time points 0 = t1 t200 = 1. h Y ijk are obtained in a similar way. For each observation, we first estimate a function by fitting an L-dimensional B-spline basis. We then use these estimated functions for FPCA and our direct estimation procedure. Both M and L are chosen by 5-fold cross-validation as discussed in [18]. Since ϵn in (2.9) is usually very small in practice, we simply let ˆE = {(j, l) V 2 : j = l and ˆ M jl F + ˆ M lj F > 0}. We can form a receiver operating characteristics (ROC) curve for recovery of E by using different values of the group lasso penalty λn defined in (2.7). We compare Fu DGE to three competing methods. The first two competing methods separately estimate two functional graphical models using fglasso from [18]. Specifically, we use fglasso to estimate ˆΘX,M and ˆΘY,M. We then set ˆE to be all edges (j, l) V 2 such that ˆΘX,M jl ˆΘY,M jl F > ζ. For each separate fglasso problem, the penalization parameter is selected by maximizing AIC in first competing method and maximizing BIC in second competing method. We define the degrees of freedom for both AIC and BIC to be the number of edges included in the graph times M 2. We form an ROC curve by using different values of ζ. The third competing method ignores the functional nature of the data. We select 15 equally spaced time points and implement a direct estimation method at each time point. Specifically, for each t, Xi(t) and Yi(t) are simply p-dimensional random vectors, and we use their sample covariances in (2.7) to obtain a p p matrix ˆ . This produces 15 differential graphs, and we use a majority vote to form a single differential graph. The ROC curve is obtained by changing λn, the L1 penalty used for all time points. Figure 2: Average ROC curves across 30 simulations of example that multiple network strategy works better For each setting and method, the ROC curve averaged across the 30 replications is shown in Figure 1. We see that Fu DGE clearly has the best overall performance in recovering the support of differential graph. Among the competing methods, ignoring the functional structure and using a majority vote generally performs better than separately estimating two functional graphs. A table with the average area under the ROC curve is given in the appendix. 4.2 Example that combination of multiple networks at discrete time points works better By construction, the simulations presented in Section 4.1 are estimating E defined in (2.3), which is not equivalent to E (t) = n (j, l) V 2 : j = l, | CX jl (t) CY jl(t)| = 0 o , (4.1) where CX jl (t) = Cov (Xij(t), Xil(t) | {Xik(t)}k =j,l) , (4.2) and CY jl(t) defined similarly. However, when E (t) = E , t, then the differential structure can be recovered by considering individual time points. Since considering time points individually requires estimating fewer parameters than the functional version, the multiple networks strategy performs better than Fu DGE. Here, data are generated with n X = n Y = 100, and p = 30, 60, 90, 120. We repeat the simulation 30 times for each p. The model setting here is similar to model 2 in Section 4.1. However, we make two major changes. First, when we generate the functional variables, we use a 5-dimensional Fourier basis, so that all basis are supported over the entire interval, rather than disjoint support as in Section 4.1. Second, we set matrix W to be diagonal. Specifically, we let Wkk = c for k = 1, 2, , 5 and Wkm = 0 for k = m, where c is drawn uniformly from [0.6, 1], and scaled by 1/2 for p = 30, 1/3 for p = 60, and 1/4 for p = 90. All other settings are the same. The average ROC curves are shown in Figure 2, and the mean area under the curves are shown in Table 2 in section D.2 of supplementary material. In Section 4.1 we considered extreme settings where the data must be treated as functions, and here we consider an extreme setting where the functional nature is irrelevant. In practice, however, the data may often lie between these two settings, and the method which performs better should depend on the variation of the differential structure across time. However, as it may be hard to measure this variation in practice, treating the data as functional objects should be a more robust choice. Figure 3: Estimated differential graph for EEG data. The anterior region is the top of the figure and the posterior region is the bottom of the figure. 4.3 Neuroscience application We apply our method to electroencephalogram (EEG) data obtained from an alcoholism study [29, 6, 18] which included 122 total subjects; 77 in an alcoholic group and 45 in the control group. Specifically, the EEG data was measured by placing p = 64 electrodes on various locations on the subject s scalp and measuring voltage values across time. We follow the preprocessing procedure in [8, 31], which filters the EEG signals at α frequency bands between 8 and 12.5 Hz. [18] separately estimate functional graphs for both groups, but we directly estimate the differential graph using Fu DGE. We choose λn so that the estimated differential graph has approximately 1% of possible edges. The estimated edges of the differential graph are shown in Figure 3. We see that edges are generally between nodes located in the same region either the anterior region or the posterior region and there is no edge that crosses between regions. This observation is consistent with the result in [18] where there are no connections between frontal and back regions for both groups. We also note that electrode CZ, lying in the central region has a high degree in the estimated differential graph. While there is no direct connection between anterior and posterior regions, the central region may play a role in helping the two parts communicate. 5 Discussion In this paper, we propose a method to directly estimate the differential graph for functional graphical models. In certain settings, direct estimation allows for the differential graph to be recovered consistently, even if each underlying graph cannot be consistently recovered. Experiments on simulated data also show that preserving the functional nature of the data rather than treating the data as multivariate scalars can also result in better estimation of the difference graph. A key step in the procedure is first representing the functions with an M-dimensional basis using FPCA, and Assumption 3.2 ensures that there exists some M large enough so that the signal, τ, is larger than the bias due to using a finite dimensional representation, ν. Intuitively, ν is tied to the eigenvalue decay rate; however, we defer derivation of the explicit connection for future work. Finally, we have provided a method for direct estimation of the differential graph, but development of methods which allow for inference and hypothesis testing in functional differential graphs would be fruitful avenues for future work. For example, [7] has developed inference tools for high-dimensional Markov networks, future works may extend their results to functional graph setting. [1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci., 2:183 202, 2009. [2] D. Bosq. Linear processes in function spaces, volume 149 of Lecture Notes in Statistics. Springer-Verlag, New York, 2000. Theory and applications. [3] T. T. Cai. Global testing and large-scale multiple testing for high-dimensional covariance structures. Annual Review of Statistics and Its Application, 4(1):423 446, 2017. [4] M. Drton and M. H. Maathuis. Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4:365 393, 2017. [5] T. Hsing and R. Eubank. Theoretical foundations of functional data analysis, with an introduction to linear operators. Wiley Series in Probability and Statistics. John Wiley & Sons, Ltd., Chichester, 2015. [6] L. Ingber. Statistical mechanics of neocortical interactions: Canonical momenta indicators of electroencephalography. Physical Review E, 55(4):4578 4593, 1997. [7] B. Kim, S. Liu, and M. Kolar. Two-sample inference for high-dimensional markov networks. ar Xiv preprint ar Xiv:1905.00466, 2019. [8] G. G. Knyazev. Motivation, emotion, and their inhibitory control mirrored in brain oscillations. Neuroscience & Biobehavioral Reviews, 31(3):377 395, 2007. [9] M. Kolar, L. Song, A. Ahmed, and E. P. Xing. Estimating Time-varying networks. Ann. Appl. Stat., 4(1): 94 123, 2010. [10] S. L. Lauritzen. Graphical Models, volume 17 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press, New York, 1996. Oxford Science Publications. [11] B. Li and E. Solea. A nonparametric graphical model for functional data with application to brain networks based on f MRI. J. Amer. Statist. Assoc., 113(524):1637 1655, 2018. [12] S. Liu, J. A. Quinn, M. U. Gutmann, T. Suzuki, and M. Sugiyama. Direct learning of sparse changes in Markov networks by density ratio estimation. Neural Comput., 26(6):1169 1197, 2014. [13] N. Meinshausen and P. Bühlmann. High dimensional graphs and variable selection with the lasso. Ann. Stat., 34(3):1436 1462, 2006. [14] S. Na, M. Kolar, and O. Koyejo. Estimating differential latent variable graphical models with applications to brain connectivity. ar Xiv preprint ar Xiv:1909.05892, 2019. [15] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Stat. Sci., 27(4):538 557, 2012. [16] M. E. J. Newman. The structure and function of complex networks. SIAM Rev., 45(2):167 256, 2003. [17] N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127 239, 2014. [18] X. Qiao, S. Guo, and G. M. James. Functional Graphical Models. J. Amer. Statist. Assoc., 114(525): 211 222, 2019. [19] R. Tibshirani. Proximal gradient descent and acceleration. Lecture Notes, 2010. [20] J.-L. Wang, J.-M. Chiou, and H.-G. Müller. Functional data analysis. Annual Review of Statistics and Its Application, 3(1):257 295, 2016. [21] Y. Wang, C. Squires, A. Belyaeva, and C. Uhler. Direct estimation of differences in causal graphs. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada., pages 3774 3785, 2018. [22] P. Xu and Q. Gu. Semiparametric differential graph models. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1064 1072. Curran Associates, Inc., 2016. [23] F. Yao and T. C. M. Lee. Penalized spline models for functional principal component analysis. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(1):3 25, 2006. [24] M. Yu, V. Gupta, and M. Kolar. Statistical inference for pairwise graphical models using score matching. In Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 2016. [25] M. Yu, V. Gupta, and M. Kolar. Simultaneous inference for pairwise graphical models with generalized score matching. ar Xiv preprint ar Xiv:1905.06261, 2019. [26] H. Yuan, R. Xi, C. Chen, and M. Deng. Differential network analysis via lasso penalized D-trace loss. Biometrika, 104(4):755 770, 2017. [27] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B, 68:49 67, 2006. [28] C. Zhang, H. Yan, S. Lee, and J. Shi. Dynamic multivariate functional data modeling via sparse subspace learning. Co RR, abs/1804.03797, 2018, ar Xiv:1804.03797. [29] X. L. Zhang, H. Begleiter, B. Porjesz, W. Wang, and A. Litke. Event related potentials during object recognition tasks. Brain Research Bulletin, 38(6):531 538, 1995. [30] S. D. Zhao, T. T. Cai, and H. Li. Direct estimation of differential networks. Biometrika, 101(2):253 268, 2014. [31] H. Zhu, N. Strawn, and D. B. Dunson. Bayesian graphical models for multivariate functional data. J. Mach. Learn. Res., 17:Paper No. 204, 27, 2016.