# a_fractional_graph_laplacian_approach_to_oversmoothing__b33b2321.pdf A Fractional Graph Laplacian Approach to Oversmoothing Sohir Maskey Department of Mathematics, LMU Munich maskey@math.lmu.de Raffaele Paolino Department of Mathematics & MCML, LMU Munich paolino@math.lmu.de Aras Bacho Department of Mathematics, LMU Munich Gitta Kutyniok Department of Mathematics & MCML, LMU Munich Graph neural networks (GNNs) have shown state-of-the-art performances in various applications. However, GNNs often struggle to capture long-range dependencies in graphs due to oversmoothing. In this paper, we generalize the concept of oversmoothing from undirected to directed graphs. To this aim, we extend the notion of Dirichlet energy by considering a directed symmetrically normalized Laplacian. As vanilla graph convolutional networks are prone to oversmooth, we adopt a neural graph ODE framework. Specifically, we propose fractional graph Laplacian neural ODEs, which describe non-local dynamics. We prove that our approach allows propagating information between distant nodes while maintaining a low probability of long-distance jumps. Moreover, we show that our method is more flexible with respect to the convergence of the graph s Dirichlet energy, thereby mitigating oversmoothing. We conduct extensive experiments on synthetic and real-world graphs, both directed and undirected, demonstrating our method s versatility across diverse graph homophily levels. Our code is available on Git Hub. 1 Introduction Graph neural networks (GNNs) (Gori et al., 2005; Scarselli et al., 2009; Bronstein et al., 2017) have emerged as a powerful class of machine learning models capable of effectively learning representations of structured data. GNNs have demonstrated state-of-the-art performance in a wide range of applications, including social network analysis (Monti et al., 2019), molecular property prediction (Gilmer et al., 2017), and recommendation systems (J. Wang et al., 2018; Fan et al., 2019). The majority of existing work on GNNs has focused on undirected graphs (Defferrard et al., 2016; Kipf et al., 2017; Hamilton et al., 2017), where edges have no inherent direction. However, many real-world systems, such as citation networks, transportation systems, and biological pathways, are inherently directed, necessitating the development of methods explicitly tailored to directed graphs. Despite their success, most existing GNN models struggle to capture long-range dependencies, which can be critical for specific tasks, such as node classification and link prediction, and for specific graphs, such as heterophilic graphs. This shortcoming also arises from the problem of oversmoothing, where increasing the depth of GNNs results in the node features converging to similar values that only convey information about the node s degree (Oono et al., 2019; Cai et al., 2020). Consequently, Equal contribution. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). scaling the depth of GNNs is not sufficient to broaden receptive fields, and other approaches are necessary to address this limitation. While these issues have been extensively studied in undirected graphs (Q. Li et al., 2018; G. Li et al., 2019; Luan, M. Zhao, et al., 2019; D. Chen et al., 2020; Rusch et al., 2022), their implications for directed graphs remain largely unexplored. Investigating these challenges and developing effective solutions is crucial for applying GNNs to real-world scenarios. Over-smoothing has been shown to be intimately related to the graph s Dirichlet energy, defined as where A = (ai,j)N i,j=1 represents the adjacency matrix of the underlying graph, x RN K denotes the node features, and di R the degree of node i. Intuitively, the Dirichlet energy measures the smoothness of nodes features. Therefore, a GNN that minimizes the Dirichlet energy is expected to perform well on homophilic graphs, where similar nodes are likely to be connected. Conversely, a GNN that ensures high Dirichlet energy should lead to better performances on heterophilic graphs, for which the nodes features are less smooth. This paper aims to bridge the gap in understanding oversmoothing for directed graphs. To this aim, we generalize the concept of Dirichlet energy, providing a rigorous foundation for analyzing oversmoothing. Specifically, we consider the directed symmetrically normalized Laplacian, which accommodates directed graph structures and recovers the usual definition in the undirected case. Even though the directed symmetrically normalized Laplacian has been already used (Zou et al., 2022), its theoretical properties remain widely unexplored. However, a vanilla graph convolutional network (GCN) (Kipf et al., 2017) implementing this directed Laplacian alone is not able to prevent oversmoothing. For this reason, we adopt a graph neural ODE framework, which has been shown to effectively alleviate oversmoothing in undirected graphs (Bodnar et al., 2022; Rusch et al., 2022; Di Giovanni et al., 2023). 1.1 Graph Neural ODEs The concept of neural ODE was introduced by Haber et al. (2018) and R. T. Q. Chen et al. (2018), who first interpreted the layers in neural networks as the time variable in ODEs. Building on this foundation, Poli et al. (2021), Chamberlain et al. (2021), and Eliasof et al. (2021) extended the connection to the realm of GNNs, resulting in the development of graph neural ODEs. In this context, each node i of the underlying graph is described by a state variable xi(t) RK, representing the node i at time t. We can define the dynamics of x(t) via the node-wise ODE x (t) = fw(x(t)) , t [0, T] , subject to the initial condition x(0) = x0 RN K, where the function fw : RN K RN K is parametrized by the learnable parameters w. The graph neural ODE can be seen as a continuous learnable architecture on the underlying graph, which computes the final node representation x(T) from the input nodes features x0. Typical choices for fw include attention-based functions (Chamberlain et al., 2021), which generalize graph attention networks (GATs) (Velickovic et al., 2018), or convolutional-like functions (Di Giovanni et al., 2023) that generalize GCNs (Kipf et al., 2017). How can we choose the learnable function fw to accommodate both directed and undirected graphs, as well as different levels of homophily? We address this question in the following subsection. 1.2 Fractional Laplacians The continuous fractional Laplacian, denoted by ( )α for α > 0, is used to model non-local interactions. For instance, the fractional heat equation tu + ( )αu = 0 provides a flexible and accurate framework for modeling anomalous diffusion processes. Similarly, the fractional diffusionreaction, quasi-geostrophic, Cahn-Hilliard, porous medium, Schrödinger, and ultrasound equations are more sophisticated models to represent complex anomalous systems (Pozrikidis, 2018). Similarly to the continuous case, the fractional graph Laplacian (FGL) (Benzi et al., 2020) models non-local network dynamics. In general, the FGL does not inherit the sparsity of the underlying graph, allowing a random walker to leap rather than walk solely between adjacent nodes. Hence, the FGL is able to build long-range connections, making it well-suited for heterophilic graphs. 1.3 Main Contributions We present a novel approach to the fractional graph Laplacian by defining it in the singular value domain, instead of the frequency domain (Benzi et al., 2020). This formulation bypasses the need for computing the Jordan decomposition of the graph Laplacian, which lacks reliable numerical methods. We show that our version of the FGL can still capture long-range dependencies, and we prove that its entries remain reasonably bounded. We then propose two FGL-based neural ODEs: the fractional heat equation and the fractional Schrödinger equation. Importantly, we demonstrate that solutions to these FGL-based neural ODEs offer increased flexibility in terms of the convergence of the Dirichlet energy. Notably, the exponent of the fractional graph Laplacian becomes a learnable parameter, allowing our network to adaptively determine the optimal exponent for the given task and graph. We show that this can effectively alleviate oversmoothing in undirected and directed graphs. To validate the effectiveness of our approach, we conduct extensive experiments on synthetic and real-world graphs, with a specific focus on supervised node classification. Our experimental results indicate the advantages offered by fractional graph Laplacians, particularly in non-homophilic and directed graphs. 2 Preliminaries We denote a graph as G = (V, E), where V is the set of nodes, E is the set of edges, and N = |V| is the number of nodes. The adjacency matrix A := {ai,j} encodes the edge information, with ai,j = 1 if there is an edge directed from node j to i, and 0 otherwise. The inand out-degree matrices are then defined as Din = diag(A1), Dout = diag(AT1), respectively. The node feature matrix x RN K contains for every node its feature in RK. Given any matrix M Cn n, we denote its spectrum by λ(M) := {λi(M)}n i=1 in ascending order w.r.t. to the real part, i.e., ℜλ1(M) ℜλ2(M) . . . ℜλn(M). Furthermore, we denote by M 2 and M the Frobenius and spectral norm of M, respectively. Lastly, we denote by In the identity matrix, where we omit the dimension n when it is clear from the context. Homophily and Heterophily Given a graph G = (V, E) with labels y = {yi}i V, the homophily of the graph indicates whether connected nodes are likely to have the same labels; formally, |{j {1, . . . , N} : ai,j = 1 yi = yj}| |{j {1, . . . , N} : ai,j = 1}| , where the numerator represents the number of neighbors of node i V that have the same label yi (Pei et al., 2019). We say that G is homophilic if H (G) 1 and heterophilic if H (G) 0. 3 Dirichlet Energy and Laplacian for (Directed) Graphs In this section, we introduce the concept of Dirichlet energy and demonstrate its relationship to a directed Laplacian, thereby generalizing well-known results for undirected graphs. Definition 3.1. The Dirichlet energy is defined on the node features x RN K of a graph G as The Dirichlet energy measures how much the features change over the nodes of G, by quantifying the disparity between the normalized outflow of information from node j and the normalized inflow of information to node i. Definition 3.2. We define the symmetrically normalized adjacency (SNA) as L := D 1/2 in AD 1/2 out . ℑλ Chameleon Figure 1: Spectrum λ (L) of common directed real-world graphs. The Perron-Frobenius eigenvalue is λPF 0.94 for Chameleon, and λPF 0.89 for Squirrel. Figure 2: Examples of non-weakly balanced (left), weakly balanced (center), and balanced (right) directed graphs. The Perron-Frobenius eigenvalue of the left graph is λPF 0.97 = 1, while for the middle and right graphs λPF = 1. Note that L is symmetric if and only if G is undirected; the term symmetrically" refers to the both-sided normalization rather than the specific property of the matrix itself. It is well-known that the SNA s spectrum of a connected undirected graph lies within [ 1, 1] (Chung, 1997). We extend this result to directed graphs, which generally exhibit complex-valued spectra. Proposition 3.3. Let G be a directed graph with SNA L. For every λ λ(L), it holds |λ| 1. Proposition 3.3 provides an upper bound for the largest eigenvalue of any directed graph, irrespective of its size. However, many other spectral properties do not carry over easily from the undirected to the directed case. For example, the SNA may not possess a one-eigenvalue, even if the graph is strongly connected (see, e.g., Figures 1 to 2). The one-eigenvalue is of particular interest since its eigenvector v corresponds to zero Dirichlet energy E (v) = 0. Therefore, studying when 1 λ(L) is crucial to understanding the behavior of the Dirichlet energy. We fully characterize the set of graphs for which 1 λ(L); this is the scope of the following definition. Definition 3.4. A graph G = (V, E) is said to be balanced if din i = dout i for all i {1, . . . , N}, and weakly balanced if there exists k RN such that k = 0 and dout j ki p = 0 , i {1, . . . , N} . It is straightforward to see that a balanced graph is weakly balanced since one can choose ki = p din i . Hence, all undirected graphs are also weakly balanced. However, as shown in Figure 2, the set of balanced graphs is a proper subset of the set of weakly balanced graphs. Proposition 3.5. Let G be a directed graph with SNA L. Then, 1 λ(L) if and only if the graph is weakly balanced. Suppose the graph is strongly connected, then 1 λ(L) if and only if the graph is weakly balanced with an even period. Proposition 3.5 generalizes a well-known result for undirected graphs: 1 λ(L) if and only if the graph is bipartite, i.e., has even period. The next result shows that the Dirichlet energy defined in (1) and the SNA are closely connected. Proposition 3.6. For every x CN K, it holds E (x) = 1 2ℜ trace x H (I L) x . Moreover, there exists x = 0 such that E (x) = 0 if and only if the graph is weakly balanced. Proposition 3.6 generalizes the well-known result from the undirected (see, e.g., Cai et al., 2020, Definition 3.1) to the directed case. This result is an important tool for analyzing the evolution of the Dirichlet energy in graph neural networks. α = 2 3 α = 2 6 (a) Synthetic cycle graph. The values of the fractional Laplacian can also be negative. 1 5 10 15 100 Distance in the original graph N. of virtual edges 1 5 10 15 20 100 Distance in the original graph α 1 0.75 0.5 0.25 0 (b) Real-world graphs. We count the number of virtual edges built by the fractional Laplacian based on the distance d(i, j) in the original graph. The number of virtual edges increases as α decreases. Figure 3: Visual representation of long-range edges built by the fractional Laplacian. 4 Fractional Graph Laplacians We introduce the fractional graph Laplacian through the singular value decomposition (SVD). This approach has two key advantages over the traditional definition (Pozrikidis, 2018; Benzi et al., 2020) in the spectral domain. First, it allows defining the fractional Laplacian based on any choice of graph Laplacian, including those with negative or complex spectrum such as the SNA. Secondly, the SVD is computationally more efficient and numerically more stable than the Jordan decomposition, which would be necessary if the fractional Laplacian was defined in the spectral domain. Consider a directed graph with SNA L and its SVD L = UΣVH, where U, V CN N are unitary matrices and Σ RN N is a diagonal matrix. Given α R, we define the α-fractional graph Laplacian 2 (α-FGL in short) as Lα := UΣαVH . In undirected graphs, the α-FGL preserves the sign of the eigenvalues λ of L while modifying their magnitudes, i.e., λ 7 sign(λ) |λ|α. 3 The α-FGL is generally less sparse than the original SNA, as it connects nodes that are not adjacent in the underlying graph. The next theorem proves that the weight of such virtual edges is bounded. Theorem 4.1. Let G be a directed graph with SNA L. For α > 0, if the distance d(i, j) between nodes i and j is at least 2, then (Lα)i,j 1 + π2 L 2 (d(i, j) 1) We provide a proof of Theorem 4.1 in Appendix C. In Figure 3a, we visually represent the cycle graph with eight nodes and the corresponding α-FGL entries. We also refer to Figure 3b, where we 2We employ the term fractional graph Laplacian" instead of fractional symmetrically normalized adjacency" to highlight that, in the singular value domain, one can construct fractional powers of any matrix representations of the underlying graph (commonly referred to as graph Laplacians"). This fact is not generally true in the eigenvalue domain because eigenvalues can be negative or complex numbers. 3In fact, this property can be generalized to all directed graphs with a normal SNA matrix, including, among others, directed cycle graphs. See Lemma C.1 for more details. depict the distribution of α-FGL entries for the real-world graphs Cora (undirected) and Chameleon (directed) with respect to the distance in the original graph. Our empirical findings align with our theoretical results presented in Theorem 4.1. 5 Fractional Graph Laplacian Neural ODE This section explores two fractional Laplacian-based graph neural ODEs. First, we consider the fractional heat equation, x (t) = Lαx(t)W , x(0) = x0 , (2) where x0 RN K is the initial condition, x(t) RN K for t > 0 and α R. We assume that the channel mixing matrix W RK K is a symmetric matrix. Second, we consider the fractional Schrödinger equation, x (t) = i Lαx(t)W , x(0) = x0 , (3) where x0, x(t) CN K and W CK K is unitary diagonalizable. Both (2) and (3) can be analytically solved. For instance, the solution of (2) is given by vec(x)(t) = exp( t W Lα)vec(x0), where denotes the Kronecker product and vec( ) represents the vectorization operation. However, calculating the exact solution is computationally infeasible since the memory required to store W Lα alone grows as (NK)2. Therefore, we rely on numerical schemes to solve (2) and (3). In the remainder of this section, we analyze the Dirichlet energy for solutions to (2) and (3). We begin with the definition of oversmoothing. Definition 5.1. Neural ODE-based GNNs are said to oversmooth if the normalized Dirichlet energy decays exponentially fast. That is, for any initial value x0, the solution x(t) satisfies for every t > 0 E x(t) min λ(I L) exp ( Ct) , C > 0 . Definition 5.1 captures the actual smoothness of features by considering the normalized Dirichlet energy, which mitigates the impact of feature amplitude (Cai et al., 2020; Di Giovanni et al., 2023). Additionally, Proposition 3.6 shows that the normalized Dirichlet energy is intimately related to the numerical range of I L of the underlying graph. This shows that the Dirichlet energy and eigenvalues (or frequencies) of the SNA are intertwined, and one can equivalently talk about Dirichlet energy or frequencies (see also Lemma D.2). In particular, it holds that As seen in Section 3, the minimal possible value attained by the normalized Dirichlet energy is often strictly greater than 0 for directed graphs. This indicates that GNNs on general directed graphs inherently cannot oversmooth to the same extent as in undirected. However, we prove that a vanilla GCN implementing the directed SNA oversmooths with respect to Definition 5.1, see Appendix E.3. 5.1 Frequency Analysis for Graphs with Normal SNA This subsection focuses on the frequency analysis of FGL-based Neural ODEs for undirected graphs. Most classical GNNs (Kipf et al., 2017; Velickovic et al., 2018) and also graph neural ODEs (Chamberlain et al., 2021; Eliasof et al., 2021) have been shown to oversmooth. Di Giovanni et al. (2023) proved that the normalized Dirichlet energy for GNNs based on (2) with α = 1 can not only converge to its minimal value but also to its maximal possible value. A GNN exhibiting this property is then termed Highest-Frequency-Dominant (HFD). However, in real-world scenarios, most graphs are not purely homophilic nor purely heterophilic but fall somewhere in between. Intuitively, this suggests that mid-range frequencies might be more suitable. To illustrate this intuition, consider the cycle graph as an example. If we have a homophily of 1, low frequencies are optimal; with a homophily equal to 0, high frequencies are optimal. Interestingly, for a homophily of 1/2, the mid-range frequency is optimal, even though the eigendecomposition is label-independent. More information on this example can be found in Figure 4 and Appendix F. Based on this observation, we propose the following definition to generalize the concept of HFD, accommodating not only the lowest or highest frequency but all possible frequencies. 2/2 λ = 0 λ = H = 0 H = 1/2 H = 1 Figure 4: Eigendecomposition of L for the cycle graph C8 (see Appendix F). The first two rows show the eigenvectors corresponding to the eigenvalues λ. The last row shows how the (label-unaware) eigendecomposition can be used to study homophily, whose definition requires the labels. Definition 5.2. Let λ 0. Neural ODE-based GNNs initialized at x0 are λ-Frequency-Dominant (λ-FD) if the solution x(t) satisfies Suppose λ is the smallest or the largest eigenvalue with respect to the real part. In that case, we call it Lowest-Frequency-Dominant (LFD) or Highest-Frequency-Dominant (HFD), respectively. In the following theorem, we show that (2) and (3) are not limited to being LFD or HFD, but can also be mid-frequency dominant. Theorem 5.3. Let G be an undirected graph with SNA L. Consider the initial value problem in (2) with W RK K and α R. Then, for almost all initial values x0 RN K the following holds. (α > 0) The solution to (2) is either HFD or LFD. (α < 0) Let λ+(L) and λ (L) be the smallest positive and negative non-zero eigenvalue of L, respectively. The solution to (2) is either (1 λ+(L))-FD or (1 λ (L))-FD. Furthermore, the previous results also hold for solutions to the Schrödinger equation (3) if W CK K has at least one eigenvalue with non-zero imaginary part. Theorem 5.3 (α > 0) generalizes the result by Di Giovanni et al. (2023) for α = 1 to arbitrary positive values of α. The convergence speed in Theorem 5.3 (α > 0) depends on the choice of α R. By selecting a variable α (e.g., as a learnable parameter), we establish a flexible learning framework capable of adapting the convergence speed of the Dirichlet energy. A slower or more adjustable convergence speed facilitates broader frequency exploration as it converges more gradually to its maximal or minimal value. Consequently, the frequency component contributions (for finite time, i.e., in practice) are better balanced, which is advantageous for graphs with different homophily levels. Theorem 5.3 (α < 0) shows that solutions of the fractional neural ODEs in (2) and (3) are not limited to be LFD or HFD. To demonstrate this and the other results of Theorem 5.3, we solve (2) using an explicit Euler scheme for different choices of α and W on the Cora and Chameleon graphs. The resulting evolution of the Dirichlet energy with respect to time is illustrated in Figure 5. Finally, we refer to Theorem D.5 in Appendix D.1 for the full statement and proof of Theorem 5.3. Remark 5.4. Theorem 5.3 is stated for the analytical solutions of (2) and (3), respectively. As noted in Section 5, calculating the analytical solution is infeasible in practice. However, we show in Appendices D.2 to D.3 that approximations of the solution of (2) and (3) via explicit Euler schemes satisfy the same Dirichlet energy convergence properties if the step size is sufficiently small. Remark 5.5. Theorem 5.3 can be generalized to all directed graphs with normal SNA, i.e., satisfying the condition LL = L L. For the complete statement, see Appendix D.1. 100 101 102 0 Dirichlet energy 100 101 102 0.25 1 0.75 0.5 0.25 0 0.1 (a) Cora (undirected). 100 101 102 0 Dirichlet energy 100 101 102 0.25 1 0.75 0.5 0.25 0 0.1 (b) chameleon (directed). Figure 5: Convergence of Dirichlet energy for the solution of equation (2) using an explicit Euler scheme with a step size of h = 10 1. We consider different α-FGL in (2) and choose W as a random diagonal matrix. In the left plot, W has only a negative spectrum, while in the right plot, W has only a positive spectrum. The black horizontal line represents the theoretical limit based on Theorem 5.3. 5.2 Frequency Dominance for Directed Graphs Section 5.1 analyzes the Dirichlet energy in graphs with normal SNA. However, the situation becomes significantly more complex when considering generic directed graphs. In our experiments (see Figure 5), we observe that the solution to (2) and (3) does not necessarily lead to oversmoothing. On the contrary, the solution can be controlled to exhibit either LFD or HFD for α > 0, and midfrequency-dominance for α < 0 as proven for undirected graphs in Theorem 5.3. We present an initial theoretical result for directed graphs, specifically in the case of α = 1. Theorem 5.6. Let G be a directed graph with SNA L. Consider the initial value problem in (2) with diagonal channel mixing matrix W RK K and α = 1. Suppose λ1(L) is unique. For almost all initial values x0 RN K, the solution to (2) is either HFD or LFD. The proof of Theorem 5.6 is given in Appendix E.1. Finally, we refer to Appendix E.2 for the analogous statement and proof when the solution of (2) is approximated via an explicit Euler scheme. 6 Numerical Experiments This section evaluates the fractional Laplacian ODEs in node classification by approximating (2) and (3) with an explicit Euler scheme. This leads to the following update rules xt+1 = xt h Lαxt W , xt+1 = xt + i h Lαxt W , (4) for the heat and Schrödinger equation, respectively. In both cases, W, α and h are learnable parameters, t is the layer, and x0 is the initial nodes feature matrix. In accordance with the results in Section 5, we select W as a diagonal matrix. The initial features x0 in (4) are encoded through a MLP, and the output is decoded using a second MLP. We refer to the resulting model as FLODE (fractional Laplacian ODE). In Appendix A, we present details on the baseline models, the training setup, and the exact hyperparameters. Table 1: Test accuracy (Film, Squirrel, Chameleon, Citeseer) and test AUROC (Minesweeper, Tolokers, Questions) on node classification, top three models. The thorough comparison is reported in Table 4, Appendix A: FLODE consistently outperforms the baseline models GCN and GRAFF, and it achieves results comparable to state-of-the-art. (a) Undirected graphs. Squirrel Chameleon Citeseer 1st FLODE 64.23 1.84 FLODE 73.60 1.55 FLODE 78.07 1.62 2nd GREAD 59.22 1.44 GREAD 71.38 1.30 Geom-GCN 78.02 1.15 3rd GRAFFNL 59.01 1.31 GRAFFNL 71.38 1.47 GREAD 77.60 1.81 (b) Heterophily-specific graphs. Minesweeper Tolokers Questions GAT-sep 93.91 0.35 FLODE 84.17 0.58 FSGNN 78.86 0.92 Graph SAGE 93.51 0.57 GAT-sep 83.78 0.43 FLODE 78.39 1.22 FLODE 92.43 0.51 GAT 83.70 0.47 GT-sep 78.05 0.93 (c) Directed graphs. Film Squirrel Chameleon 1st FLODE 37.41 1.06 HLP 74.17 1.83 FSGNN 78.14 1.25 2nd GRAFF 37.11 1.08 FLODE 74.03 1.58 FLODE 77.98 1.05 3rd ACM 36.89 1.18 FSGNN 73.48 2.13 HLP 77.48 1.50 Ablation Study. In Appendix A.3, we investigate the influence of each component (learnable exponent, ODE framework, directionality via the SNA) on the performance of FLODE. The adjustable fractional power in the FGL is a crucial component of FLODE, as it alone outperforms the model employing the ODE framework with a fixed α = 1. Further, Appendix A.3 includes ablation studies that demonstrate FLODE s capability to efficiently scale to large depths, as depicted in Figure 8. Real-World Graphs. We report results on 6 undirected datasets consisting of both homophilic graphs, i.e., Cora (Mc Callum et al., 2000), Citeseer (Sen et al., 2008) and Pubmed (Namata et al., 2012), and heterophilic graphs, i.e., Film (Tang et al., 2009), Squirrel and Chameleon (Rozemberczki et al., 2021). We evaluate our method on the directed and undirected versions of Squirrel, Film, and Chameleon. In all datasets, we use the standard 10 splits from (Pei et al., 2019). The choice of the baseline models and their results are taken from (Di Giovanni et al., 2023). Further, we test our method on heterophily-specific graph datasets, i.e., Roman-empire, Minesweeper, Tolokers, and Questions (Platonov et al., 2023). The splits, baseline models, and results are taken from (Platonov et al., 2023). The top three models are shown in Table 1, and the thorough comparison is reported in Table 4. Due to memory limitations, we compute only 30% of singular values for Pubmed, Roman-Empire, and Questions, which serve as the best low-rank approximation of the original SNA. Synthetic Directed Graph. We consider the directed stochastic block model (DSBM) datasets (Zhang et al., 2021). The DSBM divides nodes into 5 clusters and assigns probabilities for interactions between vertices. It considers two sets of probabilities: {αi,j} for undirected edge creation and {βi,j} for assigning edge directions, i, j {1, . . . 5}. The objective is to classify vertices based on their clusters. In the first experiment, αi,j = α varies, altering neighborhood information s importance. In the second experiment, βi,j = β varies, changing directional information. The results are shown in Figure 6 and Table 6. The splits, baseline models, and results are taken from (Zhang et al., 2021). Results. The experiments showcase the flexibility of FLODE, as it can accommodate various types of graphs, both directed and undirected, as well as a broad range of homophily levels. While other methods, such as Mag Net (Zhang et al., 2021), perform similarly to our approach, they face limitations when applied to certain graph configurations. For instance, when applied to undirected graphs, Mag Net reduces to Cheb Net, making it unsuitable for heterophilic graphs. Similarly, GRAFF 0.05 0.08 0.1 0 0.1 0.2 0.3 0.4 β Mag Net Di Graph IB Di Graph DGCN GAT-D GIN-D Graph SAGE-D APPNP-D GCN-D Cheb Net FLODE Figure 6: Experiments on directed stochastic block model. Unlike other models, FLODE s performances do not deteriorate as much when changing the inter-cluster edge density α . 100 101 102 103 10 2 N. of singular values Explained Variance Test accuracy 100 101 102 103 10 2 0.35 0.45 0.55 0.65 0.75 N. of singular values Explained Variance Test accuracy Figure 7: Effect of truncated SVD on test accuracy (orange) for standard directed real-world graphs. The explained variance, defined as Pk i=1 σ2 i/PN j=1 σ2 j, measures the variability the first k singular values explain. For chameleon, the accuracy stabilized after 570 (25%) singular values, corresponding to an explained variance of 0.998. For squirrel, after 1600 (31%) singular values, which correspond to an explained variance 0.999, the improvement in test accuracy is only marginal. (Di Giovanni et al., 2023) performs well on undirected graphs but falls short on directed graphs. We note that oftentimes FLODE learns a non-trivial exponent α = 1, highlighting the advantages of FGL-based GNNs (see, e.g., Table 5). Furthermore, as shown in Table 9 and Appendix A.3, our empirical results align closely with the theoretical results in Section 5. 7 Conclusion In this work, we introduce the concepts of Dirichlet energy and oversmoothing for directed graphs and demonstrate their relation with the SNA. Building upon this foundation, we define fractional graph Laplacians in the singular value domain, resulting in matrices capable of capturing long-range dependencies. To address oversmoothing in directed graphs, we propose fractional Laplacian-based graph ODEs, which are provably not limited to LFD behavior. We finally show the flexibility of our method to accommodate various graph structures and homophily levels in node-level tasks. Limitations and Future Work. The computational cost of the SVD grows cubically in N, while the storage of the singular vectors grows quadratically in N. Both costs can be significantly reduced by computing only k N singular values via truncated SVD (Figure 7), giving the best k-rank approximation of the SNA. Moreover, the SVD can be computed offline as a preprocessing step. The frequency analysis of α-FGL neural ODEs in directed graphs is an exciting future direction. It would also be worthwhile to investigate the impact of choosing α = 1 on the convergence speed of the Dirichlet energy. Controlling the speed could facilitate the convergence of the Dirichlet energy to an optimal value, which has been shown to exist in synthetic settings (Keriven, 2022; X. Wu et al., 2022). Another interesting future direction would be to analyze the dynamics when approximating the solution to the FGL neural ODEs using alternative numerical solvers, such as adjoint methods. Acknowledgments S. M. acknowledges partial support by the NSF-Simons Research Collaboration on the Mathematical and Scientific Foundations of Deep Learning (Mo DL) (NSF DMS 2031985) and DFG SPP 1798, KU 1446/27-2. G. K. acknowledges partial support by the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. G. Kutyniok also acknowledges support from the Munich Center for Machine Learning (MCML) as well as the German Research Foundation under Grants DFG-SPP-2298, KU 1446/31-1 and KU 1446/32-1 and under Grant DFG-SFB/TR 109 and Project C09. Benzi, M., Bertaccini, D., Durastante, F., and Simunec, I. (2020). Non-Local Network Dynamics via Fractional Graph Laplacians . In: Journal of Complex Networks 8.3. Bo, D., Wang, X., Shi, C., and Shen, H. (2021). Beyond Low-frequency Information in Graph Convolutional Networks . In: Proceedings of the AAAI Conference on Artificial Intelligence 35.5, pp. 3950 3957. Bodnar, C., Di Giovanni, F., Chamberlain, B., Lió, P., and Bronstein, M. (2022). Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs . In: Advances in Neural Information Processing Systems 35, pp. 18527 18541. Bronstein, M. M., Bruna, J., Le Cun, Y., Szlam, A., and Vandergheynst, P. (2017). Geometric Deep Learning: Going beyond Euclidean Data . In: IEEE Signal Processing Magazine 34.4, pp. 18 42. Cai, C. and Wang, Y. (2020). A Note on Over-Smoothing for Graph Neural Networks. ar Xiv: 2006. 13318 [cs, stat]. Chamberlain, B., Rowbottom, J., Gorinova, M. I., Bronstein, M., Webb, S., and Rossi, E. (2021). GRAND: Graph Neural Diffusion . In: Proceedings of the 38th International Conference on Machine Learning. PMLR, pp. 1407 1418. Chen, D., Lin, Y., Li, W., Li, P., Zhou, J., and Sun, X. (2020). Measuring and Relieving the Over Smoothing Problem for Graph Neural Networks from the Topological View . In: Proceedings of the AAAI Conference on Artificial Intelligence 34.04, pp. 3438 3445. Chen, M., Wei, Z., Huang, Z., Ding, B., and Li, Y. (2020). Simple and Deep Graph Convolutional Networks . In: Proceedings of the 37th International Conference on Machine Learning. PMLR, pp. 1725 1735. Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural Ordinary Differential Equations . In: Advances in Neural Information Processing Systems. Vol. 31. Curran Associates, Inc. Chien, E., Peng, J., Li, P., and Milenkovic, O. (2021). Adaptive Universal Generalized Page Rank Graph Neural Network . In: International Conference on Learning Representations. Choi, J., Hong, S., Park, N., and Cho, S.-B. (2023). GREAD: Graph Neural Reaction-Diffusion Networks . In: Proceedings of the 40th International Conference on Machine Learning. Vol. 202. ICML 23. Honolulu, Hawaii, USA: JMLR.org, pp. 5722 5747. Chung, F. R. K. (1997). Spectral Graph Theory. Regional Conference Series in Mathematics no. 92. Providence, R.I: Published for the Conference Board of the mathematical sciences by the American Mathematical Society. Defferrard, M., Bresson, X., and Vandergheynst, P. (2016). Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering . In: Advances in Neural Information Processing Systems. Vol. 29. Curran Associates, Inc. Di Giovanni, F., Rowbottom, J., Chamberlain, B. P., Markovich, T., and Bronstein, M. M. (2023). Understanding convolution on graphs via energies . In: Transactions on Machine Learning Research. Du, L., Shi, X., Fu, Q., Ma, X., Liu, H., Han, S., and Zhang, D. (2022). GBK-GNN: Gated Bi-Kernel Graph Neural Networks for Modeling Both Homophily and Heterophily . In: Proceedings of the ACM Web Conference 2022. WWW 22. New York, NY, USA: Association for Computing Machinery, pp. 1550 1558. Eliasof, M., Haber, E., and Treister, E. (2021). PDE-GCN: Novel Architectures for Graph Neural Networks Motivated by Partial Differential Equations . In: Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., pp. 3836 3849. Fan, W., Ma, Y., Li, Q., He, Y., Zhao, E., Tang, J., and Yin, D. (2019). Graph Neural Networks for Social Recommendation . In: The World Wide Web Conference. San Francisco CA USA: ACM, pp. 417 426. Fey, M. and Lenssen, J. E. (2019). Fast Graph Representation Learning with Py Torch Geometric . In: ICLR 2019 Workshop on Representation Learning on Graphs and Manifolds. Gasteiger, J., Bojchevski, A., and Günnemann, S. (2018). Predict Then Propagate: Graph Neural Networks Meet Personalized Page Rank . In: International Conference on Learning Representations. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural Message Passing for Quantum Chemistry . In: Proceedings of the 34th International Conference on Machine Learning. PMLR, pp. 1263 1272. Gori, M., Monfardini, G., and Scarselli, F. (2005). A New Model for Learning in Graph Domains . In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. Vol. 2, 729 734 vol. 2. Haber, E. and Ruthotto, L. (2018). Stable Architectures for Deep Neural Networks . In: Inverse Problems 34.1, p. 014004. eprint: 1705.03341 (cs, math). Hamilton, W., Ying, Z., and Leskovec, J. (2017). Inductive Representation Learning on Large Graphs . In: Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 778. Keriven, N. (2022). Not Too Little, Not Too Much: A Theoretical Analysis of Graph (over)Smoothing . In: Advances in Neural Information Processing Systems. Kingma, D. P. and Ba, J. (2015). Adam: A Method for Stochastic Optimization. Ed. by Y. Bengio and Y. Le Cun. Kipf, T. N. and Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks . In: International Conference on Learning Representations. Li, G., Muller, M., Thabet, A., and Ghanem, B. (2019). Deep GCNs: Can GCNs Go As Deep As CNNs? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9267 9276. Li, Q., Han, Z., and Wu, X.-m. (2018). Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning . In: Proceedings of the AAAI Conference on Artificial Intelligence 32.1. Li, X., Zhu, R., Cheng, Y., Shan, C., Luo, S., Li, D., and Qian, W. (2022). Finding Global Homophily in Graph Neural Networks When Meeting Heterophily . In: Proceedings of the 39th International Conference on Machine Learning. PMLR, pp. 13242 13256. Lingam, V., Ragesh, R., Iyer, A., and Sellamanickam, S. (2021). Simple Truncated SVD Based Model for Node Classification on Heterophilic Graphs. eprint: 2106.12807 (cs). Luan, S., Hua, C., Lu, Q., Zhu, J., Zhao, M., Zhang, S., Chang, X.-W., and Precup, D. (2022). Revisiting Heterophily For Graph Neural Networks . In: Advances in Neural Information Processing Systems 35, pp. 1362 1375. Luan, S., Zhao, M., Chang, X.-W., and Precup, D. (2019). Break the Ceiling: Stronger Multi-scale Deep Graph Convolutional Networks . In: Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc. Maurya, S. K., Liu, X., and Murata, T. (2021). Improving Graph Neural Networks with Simple Architecture Design. ar Xiv: 2105.07634 [cs, stat]. Mc Callum, A. K., Nigam, K., Rennie, J., and Seymore, K. (2000). Automating the Construction of Internet Portals with Machine Learning . In: Information Retrieval 3.2, pp. 127 163. Monti, F., Frasca, F., Eynard, D., Mannion, D., and Bronstein, M. M. (2019). Fake News Detection on Social Media Using Geometric Deep Learning. ar Xiv: 1902.06673. Namata, G. M., London, B., Getoor, L., and Huang, B. (2012). Query-driven Active Surveying for Collective Classification . In: Workshop on Mining and Learning with Graphs. Oono, K. and Suzuki, T. (2019). Graph Neural Networks Exponentially Lose Expressive Power for Node Classification . In: International Conference on Learning Representations. Paszke, A. et al. (2019). Py Torch: An Imperative Style, High-Performance Deep Learning Library . In: Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc. Pei, H., Wei, B., Chang, K. C.-C., Lei, Y., and Yang, B. (2019). Geom-GCN: Geometric Graph Convolutional Networks . In: International Conference on Learning Representations. Perko, L. (2001). Differential Equations and Dynamical Systems. Ed. by J. E. Marsden, L. Sirovich, and M. Golubitsky. Vol. 7. Texts in Applied Mathematics. New York, NY: Springer New York. Platonov, O., Kuznedelev, D., Diskin, M., Babenko, A., and Prokhorenkova, L. (2023). A critical look at the evaluation of GNNs under heterophily: Are we really making progress? In: The Eleventh International Conference on Learning Representations. Poli, M., Massaroli, S., Park, J., Yamashita, A., Asama, H., and Park, J. (2021). Graph Neural Ordinary Differential Equations. ar Xiv: 1911.07532 [cs, stat]. Pozrikidis, C. (2018). The Fractional Laplacian. First. Boca Raton : Taylor & Francis, 2016. | A CRC title. : Chapman and Hall/CRC. Rong, Y., Huang, W., Xu, T., and Huang, J. (2020). Drop Edge: Towards Deep Graph Convolutional Networks on Node Classification . In: International Conference on Learning Representations. Rozemberczki, B., Allen, C., and Sarkar, R. (2021). Multi-Scale Attributed Node Embedding . In: Journal of Complex Networks 9.2. Rusch, T. K., Chamberlain, B., Rowbottom, J., Mishra, S., and Bronstein, M. (2022). Graph-Coupled Oscillator Networks . In: Proceedings of the 39th International Conference on Machine Learning. PMLR, pp. 18888 18909. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. (2009). The Graph Neural Network Model . In: IEEE Transactions on Neural Networks 20.1, pp. 61 80. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. (2008). Collective Classification in Network Data . In: AI Magazine 29.3, p. 93. Shi, Y., Huang, Z., Feng, S., Zhong, H., Wang, W., and Sun, Y. (2021). Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification . In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. Montreal, Canada: International Joint Conferences on Artificial Intelligence Organization, pp. 1548 1554. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting . In. Tang, J., Sun, J., Wang, C., and Yang, Z. (2009). Social Influence Analysis in Large-Scale Networks . In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris France: ACM, pp. 807 816. Tong, Z., Liang, Y., Sun, C., Li, X., Rosenblum, D., and Lim, A. (2020). Digraph Inception Convolutional Networks . In: Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., pp. 17907 17918. Tong, Z., Liang, Y., Sun, C., Rosenblum, D. S., and Lim, A. (2020). Directed Graph Convolutional Network. ar Xiv: 2004.13970 [cs, stat]. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph Attention Networks . In: International Conference on Learning Representations. Wang, J., Huang, P., Zhao, H., Zhang, Z., Zhao, B., and Lee, D. L. (2018). Billion-Scale Commodity Embedding for E-commerce Recommendation in Alibaba . In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD 18. New York, NY, USA: Association for Computing Machinery, pp. 839 848. Wang, X. and Zhang, M. (2022). How Powerful Are Spectral Graph Neural Networks . In: Proceedings of the 39th International Conference on Machine Learning. PMLR, pp. 23341 23362. Wang, Y., Yi, K., Liu, X., Wang, Y. G., and Jin, S. (2022). ACMP: Allen-Cahn Message Passing with Attractive and Repulsive Forces for Graph Neural Networks . In: The Eleventh International Conference on Learning Representations. Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Weinberger, K. (2019). Simplifying Graph Convolutional Networks . In: Proceedings of the 36th International Conference on Machine Learning. PMLR, pp. 6861 6871. Wu, X., Chen, Z., Wang, W. W., and Jadbabaie, A. (2022). A Non-Asymptotic Analysis of Oversmoothing in Graph Neural Networks . In: The Eleventh International Conference on Learning Representations. Xhonneux, L.-P., Qu, M., and Tang, J. (2020). Continuous Graph Neural Networks . In: Proceedings of the 37th International Conference on Machine Learning. PMLR, pp. 10432 10441. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2018). How Powerful Are Graph Neural Networks? In: International Conference on Learning Representations. Yan, Y., Hashemi, M., Swersky, K., Yang, Y., and Koutra, D. (2022). Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks . In: 2022 IEEE International Conference on Data Mining (ICDM). Orlando, FL, USA: IEEE, pp. 1287 1292. Zhang, X., He, Y., Brugnone, N., Perlmutter, M., and Hirn, M. (2021). Mag Net: A Neural Network for Directed Graphs . In: Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., pp. 27003 27015. Zhao, L. and Akoglu, L. (2019). Pair Norm: Tackling Oversmoothing in GNNs . In: International Conference on Learning Representations. Zhu, J., Rossi, R. A., Rao, A., Mai, T., Lipka, N., Ahmed, N. K., and Koutra, D. (2021). Graph Neural Networks with Heterophily . In: Proceedings of the AAAI Conference on Artificial Intelligence 35.12, pp. 11168 11176. Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L., and Koutra, D. (2020). Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs . In: Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., pp. 7793 7804. Zou, C., Han, A., Lin, L., and Gao, J. (2022). A Simple Yet Effective SVD-GCN for Directed Graphs. ar Xiv: 2205.09335 [cs]. AUROC Area under the ROC curve DSBM Directed Stochastic Block Model FD Frequency Dominant FGL Fractional Graph Laplacian GAT Graph Attention Network GCN Graph Convolutional Network GNN Graph Neural Network HFD Highest Frequency Dominant LCC Largest Connected Components LFD Lowest Frequency Dominant MLP Multi-Layer Perceptron ODE Ordinary Differential Equation SNA Symmetrically Normalized Adjacency SVD Singular Value Decomposition i Imaginary unit ℜ(z) Real part of z C ℑ(z) Imaginary part of z C diag(x) Diagonal matrix with x on the diagonal. 1 Constant vector of all 1s. MT Transpose of M M Conjugate of M MH Conjugate transpose of M M Spectral norm of M M 2 Frobenius norm of M λ (M) Spectrum of M σ (M) Singular values of M E (x) Dirichlet energy computed on x H (G) Homophily coefficient of the graph G A B Kronecker product between A and B vec (M) Vector obtained stacking columns of M. A Implementation Details In this section, we give the details on the numerical results in Section 6. We begin by describing the exact model. Model architecture. Let G be a directed graphs and x0 RN K the node features. Our architecture first embeds the input node features x via a multi-layer perceptron (MLP). We then evolve the features x0 according to a slightly modified version of (3), i.e, x (t) = i Lαx(t)W for some time t [0, T]. In our experiments, we approximate the solution with an Explicit Euler scheme with step size h > 0. This leads to the following update rule xt+1 = xt ih Lαxt W . The channel mixing matrix is a diagonal learnable matrix W CK K, and α R, h C are also learnable parameters. The features at the last time step x T are then fed into a second MLP, whose output is used as the final output. Both MLPs use Leaky Re LU as non-linearity and dropout (Srivastava et al., 2014). On the contrary, the graph layers do not use any dropout nor non-linearity. A sketch of the algorithm is reported in f Lode. Algorithm 1: f Lode % A, x0 are given. % Preprocessing 1 Din = diag (A1) 2 Dout = diag AT1 3 L = D 1/2 in AD 1/2 out 4 U, Σ, VH = svd(L) % The core of the algorithm is very simple 5 def training_step(x0): 6 x0 = input_MLP(x0) 7 for t {1, . . . , T} do 8 xt = xt 1 i h UΣαVHxt 1W 9 x T = output_MLP(x T ) 10 return x T Complexity. The computation of the SVD is O(N 3). However, one can compute only the first p N singular values: this cuts down the cost to O(N 2 p). The memory required to store the singular vectors is O(N 2), since they are not sparse in general. Each training step has a cost of O(N 2 K). Experimental details. Our model is implemented in Py Torch (Paszke et al., 2019), using Py Torch geometric (Fey et al., 2019). The computation of the SVD for the fractional Laplacian is implemented using the library linalg provided by Py Torch. In the case of truncated SVD, we use the function randomized_svd provided by the library extmath from sklearn. The code and instructions to reproduce the experiments are available on Git Hub. Hyperparameters were tuned using grid search. All experiments were run on an internal cluster with NVIDIA Ge Force RTX 2080 Ti and NVIDIA TITAN RTX GPUs with 16 and 24 GB of memory, respectively. Training details. All models were trained for 1000 epochs using Adam (Kingma et al., 2015) as optimizer with a fixed learning rate. We perform early stopping if the validation metric does not increase for 200 epochs. A.1 Real-World Graphs Undirected graphs We conducted 10 repetitions using data splits obtained from (Pei et al., 2019). For each split, 48% of the nodes are used for training, 32% for validation and 20% for testing. In all datasets, we considered the largest connected component (LCC). Chameleon, Squirrel, and Film are directed graphs; hence, we converted them to undirected. Cora, Citeseer, and Pubmed are already undirected graphs: to these, we added self-loops. We normalized the input node features for all graphs. As baseline models, we considered the same models as in (Di Giovanni et al., 2023). The results were provided by Pei et al. (2019) and include standard GNNs, such as GAT (Velickovic et al., 2018), GCN (Kipf et al., 2017), and Graph SAGE (Hamilton et al., 2017). We also included models designed to address oversmoothing and heterophilic graphs, such as Pair Norm (L. Zhao et al., 2019), GGCN (Yan et al., 2022), Geom-GCN (Pei et al., 2019), H2GCN (Zhu, Yan, et al., 2020), GPRGNN (Chien et al., 2021), and Sheaf (Bodnar et al., 2022). Furthermore, we included the graph neural ODE-based approaches, CGNN (Xhonneux et al., 2020) and GRAND (Chamberlain et al., 2021), as in (Di Giovanni et al., 2023), and the model GRAFF from (Di Giovanni et al., 2023) itself. Finally, we included GREAD (Choi et al., 2023), Graph CON (Rusch et al., 2022), ACMP (Y. Wang et al., 2022) and GCN and GAT equipped with Drop Edge (Rong et al., 2020). Heterophily-specific Models For heterophily-specific datasets, we use the same models and results as in (Platonov et al., 2023). As baseline models we considered the topology-agnostic Res Net (He et al., 2016) and two graph-aware modifications: Res Net+SGC(F. Wu et al., 2019) where the initial node features are multiplied by powers of the SNA, and Res Net+adj, where rows of the adjacency matrix are used as additional node features; GCN (Kipf et al., 2017), Graph SAGE (Hamilton et al., 2017); GAT (Velickovic et al., 2018) and GT (Shi et al., 2021) as well as their modification GAT-sep and GT-sep which separate egoand neighbor embeddings; H2GCN (Zhu, Yan, et al., 2020), CPGNN (Zhu, Rossi, et al., 2021), GPRGNN (Chien et al., 2021), FSGNN (Maurya et al., 2021), Glo GNN (X. Li et al., 2022), FAGCN (Bo et al., 2021), GBK-GNN (Du et al., 2022), and Jacobi Conv (X. Wang et al., 2022). The exact hyperparameters for FLODE are provided in Table 5. A.2 Synthetic Directed Graphs The dataset and code are taken from (Zhang et al., 2021). As baseline models, we considered the ones in (Zhang et al., 2021) for which we report the corresponding results. The baseline models include standard GNNs, such as Cheb Net (Defferrard et al., 2016), GCN (Kipf et al., 2017), Graph SAGE (Hamilton et al., 2017), APPNP (Gasteiger et al., 2018), GIN (Xu et al., 2018), GAT (Velickovic et al., 2018), but also models specifically designed for directed graphs, such as DGCN (Tong, Liang, Sun, Rosenblum, et al., 2020), Di Graph and Di Graph IB (Tong, Liang, Sun, X. Li, et al., 2020), Mag Net (Zhang et al., 2021)). The DSBM dataset. The directed stochastic block model (DSBM) is described in detail in (Zhang et al., 2021, Section 5.1.1). To be self-contained, we include a short explanation. The DSBM model is defined as follows. There are N vertices, which are divided into nc clusters (C1, C2, ...Cnc), each having an equal number of vertices. An interaction is defined between any two distinct vertices, u and v, based on two sets of probabilities: {αi,j}nc i,j=1 and {βi,j}nc i,j=1. The set of probabilities {αi,j} is used to create an undirected edge between any two vertices u and v, where u belongs to cluster Ci and v belongs to cluster Cj. The key property of this probability set is that αi,j = αj,i, which means the chance of forming an edge between two clusters is the same in either direction. The set of probabilities {βi,j} is used to assign a direction to the undirected edges. For all i, j {1, . . . , nc}, we assume that βi,j + βj,i = 1 holds. Then, to the undirected edge (u, v) is assigned the direction from u to v with probability βi,j if u belongs to cluster Ci and v belongs to cluster Cj, and the direction from v to u with probability βj,i. The primary objective here is to classify the vertices based on their respective clusters. There are several scenarios designed to test different aspects of the baseline models and our model. In the experiments, the total number of nodes is fixed at N = 2500 and the number of clusters is fixed at nc = 5. In all experiments, the training set contains 20 nodes per cluster, 500 nodes for validation, and the rest for testing. The results are averaged over 5 different seeds and splits. Table 4: Test accuracy on node classification: top three models indicated as 1st , 2nd , 3rd. (a) Undirected graphs. Film Squirrel Chameleon Citeseer Pubmed Cora GGCN 37.54 1.56 55.17 1.58 71.14 1.84 77.14 1.45 89.15 0.37 87.95 1.05 GPRGNN 34.63 1.22 31.61 1.24 46.58 1.71 77.13 1.67 87.54 0.38 87.95 1.18 FAGCN 35.70 1.00 36.48 1.86 60.11 2.15 77.11 1.57 89.49 0.38 87.87 1.20 GCNII 37.44 1.30 38.47 1.58 63.86 3.04 77.33 1.48 90.15 0.43 88.37 1.25 Geom-GCN 31.59 1.15 38.15 0.92 60.00 2.81 78.02 1.15 89.95 0.47 85.35 1.57 Pair Norm 27.40 1.24 50.44 2.04 62.74 2.82 73.59 1.47 87.53 0.44 85.79 1.01 Graph SAGE 34.23 0.99 41.61 0.74 58.73 1.68 76.04 1.30 88.45 0.50 86.90 1.04 GCN 27.32 1.10 53.43 2.01 64.82 2.24 76.50 1.36 88.42 0.50 86.98 1.27 GAT 27.44 0.89 40.72 1.55 60.26 2.50 76.55 1.23 87.30 1.10 86.33 0.48 MLP 36.53 0.70 28.77 1.56 46.21 2.99 74.02 1.90 75.69 2.00 87.16 0.37 CGNN 35.95 0.86 29.24 1.09 46.89 1.66 76.91 1.81 87.70 0.49 87.10 1.35 GRAND 35.62 1.01 40.05 1.50 54.67 2.54 76.46 1.77 89.02 0.51 87.36 0.96 Sheaf (max) 37.81 1.15 56.34 1.32 68.04 1.58 76.70 1.57 89.49 0.40 86.90 1.13 GRAFFNL 35.96 0.95 59.01 1.31 71.38 1.47 76.81 1.12 89.81 0.50 87.81 1.13 GREAD 37.90 1.17 59.22 1.44 71.38 1.30 77.60 1.81 90.23 0.55 88.57 0.66 Graph CON 35.58 1.24 35.51 1.40 49.63 1.89 76.36 2.67 88.01 0.47 87.22 1.48 ACMP 34.93 1.26 40.05 1.53 57.59 2.09 76.71 1.77 87.79 0.47 87.71 0.95 GCN+Drop Edge 29.93 0.80 41.30 1.77 59.06 2.04 76.57 2.68 86.97 0.42 83.54 1.06 GAT+Drop Edge 28.95 0.76 41.27 1.76 58.95 2.13 76.13 2.20 86.91 0.45 83.54 1.06 FLODE 37.16 1.42 64.23 1.84 73.60 1.55 78.07 1.62 89.02 0.38 86.44 1.17 (b) Directed graphs. Film Squirrel Chameleon ACM 36.89 1.18 54.4 1.88 67.08 2.04 HLP 34.59 1.32 74.17 1.83 77.48 1.50 FSGNN 35.67 0.69 73.48 2.13 78.14 1.25 GRAFF 37.11 1.08 58.72 0.84 71.08 1.75 FLODE 37.41 1.06 74.03 1.58 77.98 1.05 (c) Heterophily-specific graphs. For Minesweeper, Tolokers and Questions the evaluation metric is the AUROC. Roman-empire Minesweeper Tolokers Questions Res Net 65.88 0.38 50.89 1.39 72.95 1.06 70.34 0.76 Res Net+SGC 73.90 0.51 70.88 0.90 80.70 0.97 75.81 0.96 Res Net+adj 52.25 0.40 50.42 0.83 78.78 1.11 75.77 1.24 GCN 73.69 0.74 89.75 0.52 83.64 0.67 76.09 1.27 Graph SAGE 85.74 0.67 93.51 0.57 82.43 0.44 76.44 0.62 GAT 80.87 0.30 92.01 0.68 83.70 0.47 77.43 1.20 GAT-sep 88.75 0.41 93.91 0.35 83.78 0.43 76.79 0.71 GT 86.51 0.73 91.85 0.76 83.23 0.64 77.95 0.68 GT-sep 87.32 0.39 92.29 0.47 82.52 0.92 78.05 0.93 FAGCN 60.11 0.52 89.71 0.31 73.35 1.01 63.59 1.46 CPGNN 63.96 0.62 52.03 5.46 73.36 1.01 65.96 1.95 H2GCN 64.85 0.27 86.24 0.61 72.94 0.97 55.48 0.91 FSGNN 79.92 0.56 90.08 0.70 82.76 0.61 78.86 0.92 Glo GNN 59.63 0.69 51.08 1.23 73.39 1.17 65.74 1.19 FAGCN 65.22 0.56 88.17 0.73 77.75 1.05 77.24 1.26 GBK-GNN 74.57 0.47 90.85 0.58 81.01 0.67 74.47 0.86 Jacobi Conv 71.14 0.42 89.66 0.40 68.66 0.65 73.88 1.16 FLODE 74.97 0.53 92.43 0.51 84.17 0.58 78.39 1.22 Table 5: Selected hyperparameters, learned exponent, step size, and Dirichlet energy in the last layer for real-world datasets. (a) Undirected. Film Squirrel Chameleon Citeseer Pubmed Cora learning rate 10 3 2.5 10 3 5 10 3 10 2 10 2 10 2 weight decay 5 10 4 5 10 4 10 3 5 10 3 10 3 5 10 3 hidden channels 256 64 64 64 64 64 num. layers 1 6 4 2 3 2 encoder layers 3 1 1 1 3 1 decoder layers 2 2 2 1 1 2 input dropout 0 1.5 10 1 0 0 5 10 2 0 decoder dropout 10 1 10 1 0 0 10 1 0 exponent 1.001 0.003 0.17 0.03 0.35 0.15 0.92 0.03 0.82 0.07 0.90 0.02 step size 0.991 0.002 1.08 0.01 1.22 0.03 1.04 0.02 1.12 0.02 1.06 0.01 Dirichlet energy 0.246 0.006 0.40 0.02 0.13 0.03 0.021 0.001 0.015 0.001 0.0227 0.0006 (b) Directed. Film Squirrel Chameleon learning rate 10 3 2.5 10 3 10 2 weight decay 5 10 4 5 10 4 10 3 hidden channels 256 64 64 num. layers 1 6 5 encoder layers 3 1 1 decoder layers 2 2 2 input dropout 0 10 1 0 decoder dropout 0.1 10 1 0 exponent 1.001 0.005 0.28 0.06 0.30 0.11 step size 0.990 0.002 1.22 0.02 1.22 0.05 Dirichlet energy 0.316 0.005 0.38 0.02 0.27 0.04 (c) Heterophily-specific graphs. Roman-empire Minesweeper Tolokers Questions learning rate 10 3 10 3 10 3 10 2 weight decay 0 0 0 5 10 4 hidden channels 512 512 512 128 num. layers 4 4 4 5 encoder layers 2 2 1 2 decoder layers 2 2 2 2 input dropout 0 0 0 0 decoder dropout 0 0 0 0 exponent 0.689 0.038 0.749 0.017 1.053 0.041 1.090 0.046 step size 0.933 0.015 0.984 0.004 0.993 0.009 0.789 0.062 Dirichlet energy 0.059 0.003 0.173 0.019 0.155 0.013 0.092 0.039 Table 6: Node classification accuracy of ordered DSBM graphs: top three models as 1st , 2nd and 3rd. (a) Varying edge density. 0.1 0.08 0.05 Cheb Net 19.9 0.6 20.0 0.7 20.0 0.7 GCN-D 68.9 2.1 67.6 2.7 58.5 2.0 APPNP-D 97.7 1.7 95.9 2.2 90.3 2.4 Graph SAGE-D 20.1 1.1 19.9 0.8 19.9 1.0 GIN-D 57.3 5.8 55.4 5.5 50.9 7.7 GAT-D 42.1 5.3 39.0 7.0 37.2 5.5 DGCN 84.9 7.2 81.2 8.2 64.4 12.4 Di Graph 82.1 1.7 77.7 1.6 66.1 2.4 Di Graph IB 99.2 0.5 97.7 0.7 89.3 1.7 Mag Net 99.6 0.2 98.3 0.8 94.1 1.2 FLODE 99.3 0.1 98.8 0.1 97.5 0.1 (b) Varying net flow. 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Cheb Net 19.9 0.7 20.1 0.6 20.0 0.6 20.1 0.8 19.9 0.9 20.0 0.5 19.7 0.9 20.0 0.5 GCN-D 68.6 2.2 74.1 1.8 75.5 1.3 74.9 1.3 72.0 1.4 65.4 1.6 58.1 2.4 45.6 4.7 APPNP-D 97.4 1.8 94.3 2.4 89.4 3.6 79.8 9.0 69.4 3.9 59.6 4.9 51.8 4.5 39.4 5.3 Graph SAGE-D 20.2 1.2 20.0 1.0 20.0 0.8 20.0 0.7 19.6 0.9 19.8 0.7 19.9 0.9 19.9 0.8 GIN-D 57.9 6.3 48.0 11.4 32.7 12.9 26.5 10.0 23.8 6.0 20.6 3.0 20.5 2.8 19.8 0.5 GAT-D 42.0 4.8 32.7 5.1 25.6 3.8 19.9 1.4 20.0 1.0 19.8 0.8 19.6 0.2 19.5 0.2 DGCN 81.4 1.1 84.7 0.7 85.5 1.0 86.2 0.8 84.2 1.1 78.4 1.3 69.6 1.5 54.3 1.5 Di Graph 82.5 1.4 82.9 1.9 81.9 1.1 79.7 1.3 73.5 1.9 67.4 2.8 57.8 1.6 43.0 7.1 Di Graph IB 99.2 0.4 97.9 0.6 94.1 1.7 88.7 2.0 82.3 2.7 70.0 2.2 57.8 6.4 41.0 9.0 Mag Net 99.6 0.2 99.0 1.0 97.5 0.8 94.2 1.6 88.7 1.9 79.4 2.9 68.8 2.4 51.8 3.1 FLODE 99.3 0.1 98.5 0.1 96.7 0.2 92.8 0.1 87.2 0.3 77.1 0.5 63.8 0.3 50.1 0.5 Following Zhang et al. (2021), we train our model in both experiments for 3000 epochs and use early-stopping if the validation accuracy does not increase for 500 epochs. We select the best model based on the validation accuracy after sweeping over a few hyperparameters. We give exact numerical values for the experiments with the standard error in Table 6a and refer to Appendix A.2 for the chosen hyperparameters. DSBM with varying edge density. In the first experiment, the model is evaluated based on its performance on the DSBM with varying αi,j = α , α {0.1, 0.08, 0.05} for i = j, which essentially changes the density of edges between different clusters. The other probabilities are fixed at αi,i = 0.5, βi,i = 0.5 and βi,j = 0.05 for i > j. The results are shown in Figure 6 with exact numerical values in Table 6a. DSBM with varying net flow. In the other scenario, the model is tested on how it performs when the net flow from one cluster to another varies. This is achieved by keeping αi,j = 0.1 constant for all i and j, and allowing βi,j to vary from 0.05 to 0.4. The other probabilities are fixed at αi,i = 0.5 and βi,i = 0.5. The results are shown in Figure 6 with exact numerical values in Table 6b. A.3 Ablation Study We perform an ablation study on Chameleon and Squirrel (directed, heterophilic), and Citeseer (undirected, homophilic). For this, we sweep over different model options using the same hyperparameters Table 7: Selected hyperparameters for DSBM dataset. (a) Varying edge density. 0.1 0.08 0.05 learning rate 5 10 3 5 10 3 5 10 3 decay 1 10 3 1 10 3 5 10 4 input dropout 1 10 1 2 10 1 1 10 1 decoder dropout 1 10 1 5 10 2 1 10 1 hidden channels 256 256 256 (b) Varying net flow. 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 learning rate 5 10 3 5 10 3 1 10 3 1 10 3 1 10 3 1 10 3 1 10 3 1 10 3 decay 1 10 3 1 10 3 5 10 4 1 10 3 1 10 3 1 10 3 5 10 4 1 10 3 input dropout 1 10 1 1 10 1 2 10 1 1 10 1 1 10 1 2 10 1 5 10 2 2 10 1 decoder dropout 1 10 1 1 10 1 5 10 2 5 10 2 5 10 2 1 10 1 2 10 1 1 10 1 hidden channels 256 256 256 256 256 256 256 256 via grid search. The test accuracy corresponding to the hyperparameters that yielded maximum validation accuracy is reported in Table 8. The ablation study on Chameleon demonstrates that all the components of the model (learnable exponent, ODE framework with the Schrödinger equation, and directionality via the SNA) contribute to the performance of FLODE. The fact that performance drops when any of these components are not used suggests that they all play crucial roles in the model s ability to capture the structure and evolution of heterophilic graphs. It is important to note that the performance appears to be more dependent on the adjustable fraction in the FGL than on the use of the ODE framework, illustrating that the fractional Laplacian alone can effectively capture long-range dependencies. However, when the ODE framework is additionally employed, a noticeable decrease in variance is observed. From Theory to Practice. We conduct an ablation study to investigate the role of depth on Chameleon, Citeseer, Cora, and Squirrel datasets. The results, depicted in Figure 8, demonstrate that the neural ODE framework enables GNNs to scale to large depths (256 layers). Moreover, we see that the fractional Laplacian improves over the standard Laplacian in the heterophilic graphs which is supported by our claims in Section 5.2. We highlight that using only the fractional Laplacian without the neural ODE framework oftentimes outperforms the standard Laplacian with the neural ODE framework. This indicates the importance of the long-range connections built by the fractional Laplacian. We further demonstrate the close alignment of our theoretical and experimental results, which enables us to precisely anticipate when the models will exhibit HFD or LFD behaviors. In this context, we calculate parameters (according to Theorem D.5) and illustrate at each depth the expected and observed behaviors. For Squirrel and Chameleon, which are heterophilic graphs, we observe that both their theoretical and empirical behaviors are HFD. Additionally, the learned exponent is small. In contrast, for Cora and Citeseer, we see the opposite. Finally, we employ the best hyperparameters in Table 5a to solve both fractional heat and Schrödinger graph ODEs, further substantiating the intimate link between our theoretical advancements and practical applications. Table 8: Ablation study on node classification task: top two models are indicated as 1st and 2nd (a) Chameleon (directed, heterophilic). Update Rule Test Accuracy Dirichlet Energy xt+1 = xt ih Lαxt W 77.79 1.42 0.213 (t=5) xt+1 = xt ih L xt W 75.72 1.13 0.169 (t=6) xt+1 = i Lαxt W 77.35 2.22 0.177 (t=4) xt+1 = i L xt W 69.61 1.59 0.178 (t=4) xt+1 = xt ih Lαxt W 73.60 1.68 0.131 (t=4) xt+1 = xt ih L xt W 70.15 0.86 0.035 (t=4) xt+1 = i Lαxt W 71.25 3.04 0.118 (t=4) xt+1 = i L xt W 67.19 2.49 0.040 (t=4) xt+1 = xt h Lαxt W 77.33 1.47 0.378 (t=6) xt+1 = xt h L xt W 73.55 0.94 0.165 (t=6) xt+1 = Lαxt W 74.12 3.60 0.182 (t=4) xt+1 = L xt W 68.47 2.77 0.208 (t=4) (b) Squirrel (directed, heterophilic). Update Rule Test Accuracy Dirichlet Energy xt+1 = xt ih Lαxt W 74.03 1.58 0.38 0.02 xt+1 = xt ih L xt W 64.04 2.25 0.35 0.02 xt+1 = i Lαxt W 64.25 1.85 0.46 0.01 xt+1 = i L xt W 42.04 1.58 0.29 0.05 xt+1 = xt ih Lαxt W 64.23 1.84 0.40 0.02 xt+1 = xt ih L xt W 55.19 1.52 0.26 0.03 xt+1 = i Lαxt W 61.40 2.15 0.43 0.01 xt+1 = i L xt W 41.19 1.95 0.20 0.02 xt+1 = xt h Lαxt W 71.86 1.65 0.50 0.01 xt+1 = xt h L xt W 59.34 1.78 0.43 0.03 xt+1 = Lαxt W 42.91 7.86 0.32 0.08 xt+1 = L xt W 35.37 1.69 0.25 0.05 xt+1 = xt h Lαxt W 62.95 2.02 0.61 0.08 xt+1 = xt h L xt W 52.19 1.17 0.51 0.07 xt+1 = Lαxt W 59.04 0.02 0.44 0.02 xt+1 = L xt W 39.69 1.54 0.20 0.02 (c) Citeseer (undirected, homphilic). Update Rule Test Accuracy Dirichlet Energy xt+1 = xt ih Lαxt W 78.07 1.62 0.021 (t=5) xt+1 = xt ih L xt W 77.97 2.29 0.019 (t=4) xt+1 = i Lαxt W 77.27 2.10 0.011 (t=6) xt+1 = i L xt W 77.97 2.23 0.019 (t=4) Table 9: Learned α and spectrum of W. According to Theorem 5.3, we denote FD := λK(W) fα(λ1(L)) λ1(W) and FD := ℑ(λK(W)) fα(λ1(L)) ℑ(λ1(W)) for the fractional heat (H) and Schrödinger (S) graph ODEs, respectively. The heterophilic graphs Squirrel and Chameleon exhibit HFD since FD < 0, while the homophilic Cora, Citeseer, Pubmed exhibit LFD since FD > 0. λ1(L) Film 0.9486 Squirrel 0.8896 Chameleon 0.9337 Citeseer 0.5022 Pubmed 0.6537 Cora 0.4826 α 1.008 0.007 0.19 0.05 0.37 0.14 0.89 0.06 1.15 0.08 0.89 0.01 λ1(W) 2.774 0.004 1.62 0.03 1.81 0.02 1.76 0.01 1.66 0.06 1.81 0.01 λK(W) 2.858 0.009 2.21 0.03 2.29 0.05 2.28 0.06 1.1 0.3 2.32 0.01 FD 0.367 0.001 0.54 0.02 0.42 0.04 0.52 0.02 0.97 0.09 0.60 0.01 α 1.000 0.002 0.17 0.03 0.34 0.11 0.90 0.07 0.76 0.07 0.90 0.02 ℑ(λ1(W)) 2.795 0.001 1.68 0.01 1.79 0.01 1.70 0.04 1.74 0.01 1.78 0.01 ℑ(λK(W)) 2.880 0.002 2.21 0.03 2.46 0.02 2.29 0.07 0.98 0.09 2.30 0.02 FD 0.4945 0.0001 0.48 0.03 0.62 0.03 0.46 0.06 1.03 0.05 0.59 0.01 0.2 0.4 0.6 0.8 Test accuracy xl+1 = xl h Lαxl W xl+1 = xl h Lxl W xl+1 = Lαxl W xl+1 = Lxl W Cora Citeseer Chameleon Squirrel Figure 8: Ablation study on the effect of different update rules and different number of layers on undirected datasets. The x-axis shows the number of layers 2L for L {0, . . . , 8}. FD is calculated according to Theorem 5.3. B Appendix for Section 3 Proposition 3.3. Let G be a directed graph with SNA L. For every λ λ(L), it holds |λ| 1 and λ(I L) = 1 λ(L). Proof. We show that the numerical range W(L) = x HLx : x Hx = 1 satisfies W(L) [ 1, 1]. As W(L) contains all eigenvalues of L the thesis follows. Let A be the adjacency matrix of G and x CN with x Hx = 1. Applying the Cauchy-Schwartz inequality in (2) and (3), we get j=1 ai,j |xi| |xj| q din i dout j j=1 ai,j |xj| q j=1 ai,j |xj|2 j=1 ai,j |xj|2 i=1 |xi|2 N X j=1 ai,j |xj|2 i=1 |xi|2 , where we used a2 i,j = ai,j. We have PN i=1 |xi|2 = x Hx = 1 such that W(L) [ 1, 1] follows. The second claim follows directly by (I L) v = v λv = (1 λ)v. Proposition 3.5. Let G be a directed graph with SNA L. Then 1 λ(L) if and only if the graph is weakly balanced. Suppose the graph is strongly connected; then 1 λ(L) if and only if the graph is weakly balanced with an even period. Proof. Since the numerical range is only a superset of the set of eigenvalues, we cannot simply consider when the inequalities (1) (3) in the previous proof are actual equalities. Therefore, we have to find another way to prove the statement. Suppose that the graph is weakly balanced, then dout j ki p = 0 , j {1, . . . , N} . We will prove that k = (ki)N i=1 is an eigenvector corresponding to the eigenvalue 1, din i dout j kj = 1 p dout j kj = 1 p din i ki = 1 For the other direction, suppose that there exists x RN such that x = 0 and x = Lx. Then for all i {1, . . . , N} 0 = (Lx)i xi = din i dout j xj xi = din i dout j xj dout j xi p hence, the graph is weakly balanced. By Perron-Frobenius theorem for irreducible non-negative matrices, one gets that L has exactly h eigenvalues with maximal modulus corresponding to the h roots of the unity, where h is the period of L. Hence, 1 is an eigenvalue of L if and only if the graph is weakly balanced and h is even. Proposition 3.6. For every x CN K, we have ℜ trace x H (I L) x = 1 Moreover, there exists x = 0 such that E (x) = 0 if and only if the graph is weakly balanced. Proof. By direct computation, it holds k=1 ai,j |xi,k|2 k=1 ai,j |xj,k|2 k=1 ai,j x i,kxj,k q din i dout j 1 k=1 ai,j xi,kx j,k q din i dout j k=1 |xi,k|2 + 1 k=1 |xj,k|2 1 k=1 ai,j x i,kxj,k q din i dout j 1 k=1 ai,j xi,kx j,k q din i dout j k=1 |xi,k|2 1 k=1 ai,j (x H)k,ixj,k q din i dout j 1 k=1 ai,j xi,k(x H)k,j q din i dout j k=1 |xi,k|2 1 k=1 ai,j (x H)k,ixj,k q din i dout j 1 k=1 ai,j (x H)k,ixj,k q din i dout j k=1 |xi,k|2 k=1 ai,j x i,kxj,k q din i dout j = ℜ trace x H (I L) x . The last claim can be proved as follows. For simplicity, suppose x RN. The = is clear since one can choose x to be k. To prove the = , we reason by contradiction. Suppose there exists a x = 0 such that E (x) = 0 and the underlying graph is not weakly connected, i.e., dout j xi p ! > 0 , i {1, . . . , N} , Then, since x = 0, 0 = E (x) = 1 where we used Cauchy-Schwartz and triangle inequalities. We give the following simple corollary. Corollary B.1. For every x RN K, it holds E (x) = 1 2ℜ vec(x)H(I (I L))vec(x) . C Appendix for Section 4 In this section, we provide some properties about FGLs. The first statement shows that the FGL of a normal SNA L only changes the magnitude of the eigenvalues of L. Lemma C.1. Let M be a normal matrix with eigenvalues λ1, . . . , λN and corresponding eigenvectors v1, . . . , v N. Suppose M = LΣRH is its singular value decomposition. Then it holds Σ = |Λ| , L = V , R = V exp (iΘ) , Θ = diag {θi}N i=1 , θi = atan2 (ℜλi, ℑλi) . Proof. By hypothesis, there exist a unitary matrix V such that M = VΛVH, then MHM = VΛ VHVΛVH = V |Λ|2 VH , MHM = RΣLHLΣRH = LΣ2LH . Therefore, Σ = |Λ| and L = V M = R |Λ| VH Finally, we note that it must hold R = V exp(iΘ) where Θ = diag {atan2(ℜλi, ℑλi)}N i=1 and atan2 is the 2-argument arctangent. We proceed by proving Theorem 4.1, which follows the proof of a similar result given in (Benzi et al., 2020) for the fractional Laplacian defined in the spectral domain of an in-degree normalized graph Laplacian. However, our result also holds for directed graphs and in particular for fractional Laplacians that are defined via the SVD of a graph SNA. Lemma C.2. Let M Rn n with singular values σ(M) [a, b]. For f : [a, b] R, define f(M) = Uf(Σ)VH, where M = UΣVH is the singular value decomposition of M. If f has modulus of continuity ω and d(i, j) 2, it holds |f(M)|i,j 1 + π2 2 |d(i, j) 1| 1 . Proof. Let g : [a, b] R be any function, then f(M) g(M) 2 = Uf(Σ)VH Ug(Σ)VH 2 = f(Σ) g(Σ) 2 = f(λ) g(λ) ,σ(M) . The second equation holds since the 2-norm is invariant under unitary transformations. By Jackson s Theorem, there exists for every m 1 a polynomial pm of order m such that f(M) pm(M) 2 f pm ,[a,b] 1 + π2 Fix i, j {1, . . . , n}. If d(i, j) = m + 1, then any power of M up to order m has a zero entry in (i, j), i.e., (Mm)i,j = 0. Hence, f(M)i,j = f(M)i,j pm(M)i,j, and we get |f(M)i,j| f(M) g(M) 2 ω 1 + π2 2 |d(i, j) 1| 1 from which the thesis follows. Finally, we give a proof of Theorem 4.1, which is a consequence of the previous statement. Proof of Theorem 4.1. The eigenvalues of L are in the unit circle, i.e., L 1. Hence, LLH 1, and the singular values of L are in [0, 1]. By Lemma C.2 and the fact that f(x) = xα has modulus of continuity ω(t) = tα the thesis follows. D Appendix for Section 5 In this section, we provide the appendix for Section 5. We begin by analyzing the solution of linear matrix ODEs. For this, let M CN N. For x0 CN, consider the initial value problem x (t) = Mx(t), x(0) = x0. (5) Theorem D.1 (Existence and uniqueness of linear ODE solution). The initial value problem given by (5) has a unique solution x(t) CN for any initial condition x0 CN. The solution of (5) can be expressed using matrix exponentials, even if M is not symmetric. The matrix exponential is defined as: where Mk is the k-th power of the matrix M. The solution of (5) can then be written as x(t) = exp( Mt)x0. (6) D.1 Appendix for Section 5.1 In this section, we analyze the solution to (2) and (3). We further provide a proof for Theorem 5.3. We begin by considering the solution to the fractional heat equation (2). The analysis for the Schrödinger equation (3) follows analogously. The fractional heat equation x (t) = Lαx W can be vectorized and rewritten via the Kronecker product as vec(x) (t) = W Lαvec(x)(t). (7) In the undirected case L and I L are both symmetric, and the eigenvalues satisfy the relation λi(I L) = 1 λi(L). The corresponding eigenvectors ψi(L) and ψi(I L) can be chosen to be the same for L and I L. In the following, we assume that these eigenvectors are orthonormalized. If L is symmetric, we can decompose it via the spectral theorem into L = UDUT , where U = [ψ1(L), . . . , ψN(L)] is an orthogonal matrix containing the eigenvectors of L, and D is the diagonal matrix of eigenvalues. Due to Lemma C.1, the fractional Laplacian Lα can be written as Lα = U fα(D)UT , where fα : R R is the map x 7 sign(x) |x|α and is applied element-wise. Clearly, the eigendecomposition of Lα is given by the eigenvalues {fα(λ1(L)), . . . , fα(λN(L))} and the corresponding eigenvectors {ψ1(L), . . . , ψN(L)}. Now, by well-known properties of the Kronecker product, one can write the eigendecomposition of W Lα as {λr(W) fα (λl(L))}r {1,...,K} , l {1,...,N} , {ψr(W) ψl(L)}r {1,...,K} , l {1,...,N} . Note that 1 λ (L) and, since trace (L) = 0, the SNA has at least one negative eigenvalue. This property is useful since it allows to retrieve of the indices (r, l) corresponding to eigenvalues with minimal real (or imaginary) parts in a simple way. The initial condition vec(x0) can be decomposed as l=1 cr,l ψr(W) ψl(L) , cr,l = vec(x0) , ψr(W) ψl(W) . Then, the solution vec(x)(t) of (7) can be written as vec(x)(t) = l=1 cr,l exp ( tλr (W) fα (λl (L))) ψr(W) ψl(L). (8) The following result shows the relationship between the frequencies of I L and the Dirichlet energy and serves as a basis for the following proofs. Lemma D.2. Let G be a graph with SNA L. Consider x(t) CN K such that there exists ϕ CN K \ {0} with vec(x)(t) vec(x)(t) 2 and (I (I L))vec(ϕ) = λvec(ϕ). Then, Proof. As vec(ϕ) is the limit of unit vectors, vec(ϕ) is a unit vector itself. We calculate its Dirichlet energy, E (vec(ϕ)) = 1 2ℜ vec(ϕ)H(I (I L))vec(ϕ) = 1 2ℜ λ vec(ϕ)Hvec(ϕ) = 1 Since x 7 E (x) is continuous, the thesis follows. Another useful result that will be extensively used in proving Theorem 5.3 is presented next. Lemma D.3. Suppose x(t) can be expressed as n=1 ck,n exp ( t λk,n) vk wn , for some choice of ck,n, λk,n, {vk}, {wn}. Let (a, b) be the unique index of λk,n with minimal real part and corresponding non-null coefficient ck,n, i.e. (a, b) := arg min (k,n) [K] [N] {ℜ(λk,n) : ck,n = 0} . Then x(t) x(t) 2 t ca,b va wb ca,b va wb 2 . Proof. The key insight is to separate the addend with index (a, b). It holds n=1 ck,n exp ( t λk,n) vn wm = exp ( t λa,b) ca,bva wb + X (k,n) [K] [N] (k,n) =(a,b) ck,n exp ( t (λk,n λa,b)) vk wn We note that lim t |exp ( t (λk,n λa,b))| = lim t |exp ( t ℜ(λk,n λa,b)) exp ( i t ℑ(λk,n λa,b))| = lim t exp ( t ℜ(λk,n λa,b)) for all (k, n) = (a, b), since ℜ(λk,n λa,b) > 0. Therefore, one gets x(t) x(t) 2 t ca,b va wb ca,b va wb 2 , where the normalization removes the dependency on exp ( t λa,b) When λa,b is not unique, it is still possible to derive a convergence result. In this case, x will converge to an element in the span generated by vectors corresponding to λa,b, i.e., x(t) x(t) 2 (a,b) A ca,b va wb (a,b) A ca,b va wb 2 , where A := {(k, n) : ℜ(λk,n) = ℜ(λa,b) , ck,n = 0}. A similar result to Lemma D.3 holds for a slightly different representation of x(t). Lemma D.4. Suppose x(t) can be expressed as n=1 ck,n exp (i t λk,n) vk wn , for some choice of ck,n, λk,n, {vk}, {wn}. Let (a, b) be the unique index of λk,n with minimal imaginary part and corresponding non-null coefficient ck,n, i.e. (a, b) := arg min (k,n) [K] [N] {ℑ(λk,n) : ck,n = 0} . Then x(t) x(t) 2 t ca,b va wb ca,b va wb 2 . Proof. The proof follows the same reasoning as in the proof of Lemma D.3. The difference is that the dominating frequency is the one with the minimal imaginary part, since ℜ(i λk,n) = ℑ(λk,n) , and, consequently, arg max (k,n) [K] [N] {ℜ(i λk,n)} = arg min (k,n) [K] [N] {ℑ(λk,n)} . D.1.1 Proof of Theorem 5.3 We denote the eigenvalues of L closest to 0 from above and below as λ+(L) := arg min l {λl(L) : λl(L) > 0} , λ (L) := arg max l {λl(L) : λl(L) < 0} . (9) We assume that the channel mixing W RK K and the graph Laplacians L, I L RN N are real matrices. Finally, we suppose the eigenvalues of a generic matrix M are sorted in ascending order, i.e., λi(M) λj(M), i < j. We now reformulate Theorem 5.3 for the fractional heat equation (2) and provide its full proof, which follows a similar frequency analysis to the one in (Di Giovanni et al., 2023, Theorem B.3) Theorem D.5. Let G be an undirected graph with SNA L. Consider the initial value problem in (2) with channel mixing matrix W RK K and α R. Then, for almost all initial conditions x0 RN K the following is satisfied. (α > 0) The solution to (2) is HFD if λK(W) fα (λ1(L)) < λ1(W) , and LFD otherwise. (α < 0) The solution to (2) is (1 λ (L))-FD if λK(W) fα (λ (L)) < λ1(W) fα (λ+(L)) , and (1 λ+(L))-FD otherwise. Proof of (α > 0). As derived in (8), the solution of (2) with initial condition x0 can be written in a vectorized form as vec(x)(t) = exp t WT Lα vec(x0) l=1 cr,l exp ( t λr(W) fα (λl(L))) ψr(W) ψl(L), where λr(W) are the eigenvalues of W with corresponding eigenvectors ψr(W), and λl(L) are the eigenvalues of L with corresponding eigenvectors ψl(L). The coefficients cr,l are the Fourier coefficients of x0, i.e., cr,l := vec(x0) , ψr(W) ψl(L) . The key insight is to separate the eigenprojection corresponding to the most negative frequency. By Lemma D.3, this frequency component dominates for t going to infinity. Suppose λK(W) fα (λ1(L)) < λ1(W) fα (λN(L)) = λ1(W) . In this case, λK(W) fα (λ1(L)) is the most negative frequency. Assume for simplicity that λK(W) has multiplicity one; the argument can be applied even if this is not the case, since the corresponding eigenvectors are orthogonal for higher multiplicities. For almost all initial conditions x0, the coefficient c K,1 is not null; hence vec(x)(t) vec(x)(t) 2 t c K,1 ψK (W) ψ1 (L) c K,1 ψK (W) ψ1 (L) 2 . By standard properties of the Kronecker product, we have (I L) (ψK (W) ψ1 (L)) = (I ψK (W)) (L ψ1 (L)) = λ1(L) ψK (W) ψ1 (L) , (10) i.e., ψK (W) ψ1 (L) is an eigenvector of I L corresponding to the eigenvalue λ1(L). Then, by Proposition 3.3, ψK (W) ψ1 (L) is also an eigenvector of I I L corresponding to the eigenvalue 1 λ1(L) = λN(I L). An application of Lemma D.2 finishes the proof. Similarly, we can show that if α > 0 and λK(W) fα (λ1 (L)) > λ1 (W) the lowest frequency component λ1(I L) is dominant. Proof of (α < 0). In this case either fα (λ+ (L)) λ1 (W) or fα (λ (L)) λK (W) are the most negative frequency components. Hence, if fα (λ (L)) λK (W) > fα (λ+ (L)) λ1 (W) the frequency fα (λ+ (L)) λ1 (W) is dominating and otherwise the frequency fα (λ (L)) λK (W). We can see this by following the exact same reasoning of (i). Remark D.6. In the proof of (α < 0), we are tacitly assuming that L has only non-zero eigenvalues. If not, we can truncate the SVD and remove all zeros singular values (which correspond to zeros eigenvalues). In doing so, we obtain the best invertible approximation of L to which the theorem can be applied. We now generalize the previous result to all directed graphs with normal SNA. Theorem D.7. Let G be a a strongly connected directed graph with normal SNA L such that λ1(L) R. Consider the initial value problem in (2) with channel mixing matrix W RK K and α > 0. Then, for almost all initial values x0 RN K the solution to (2) is HFD if λK(W)|λ1(L)|α < λ1(W)|λN(L)|α, and LFD otherwise. Proof. Any normal matrix is unitary diagonalizable, i.e., there exist eigenvalues λ1, . . . , λN and corresponding eigenvectors v1, . . . , v N such that L = VΛVH. Then, by Lemma C.1, the singular value decomposition of L is given by L = UΣVH, where Σ = |Λ| , U = V exp (iΘ) , Θ = diag {θi}N i=1 , θi = atan2 (ℜλi, ℑλi) . Hence, Lα = UΣαVH = V |Λ|α exp (iΘ) VH. Then, equivalent to the derivation of (8), the solution to the vectorized fractional heat equation vec(x) (t) = W Lαvec(x)(t) is given by vec(x)(t) = l=1 cr,l exp ( tλr (W) fα (λl (L))) ψr(W) ψl(L). with fα(λl(L)) = |λ(L)l|α exp(iθl). Now, equivalent to the proof of Theorem 5.3, we apply Lemma D.3. Therefore, the dominating frequency is given by the eigenvalue of W Lα with the most negative real part. The eigenvalues of W Lα are given by λr(W) fα(λl(L)) for r = 1, . . . , K, l = 1, . . . , N. The corresponding real parts are given by ℜ(λr(W) fα(λl(L))) = λr(W) |λ(L)i|α cos(θi) = λr(W) |λ(L)i|α 1 ℜ(λ(L)i). By Perron-Frobenius, the eigenvalue of L with the largest eigenvalues is given by λN(L) R. Hence, for all l = 1, . . . , N, |λ(L)l|α cos(θl) |λ(L)N|α . Similarly, for all l = 1, . . . , N with ℜ(λ(L)l) < 0, |λ(L)l|α cos(θl) |λ(L)1|α . Thus, the frequency with the most negative real part is either given by λK(W) fα (λ1(L)) or λ1(W) fα (λN(L)). The remainder of the proof is analogous to the proof of Theorem D.7. In the following, we provide the complete statement and proof for the claims made in Theorem 5.3 when the underlying ODE is the Schrödinger equation as presented in (3). Theorem D.8. Let G be a undirected graph with SNA L. Consider the initial value problem in (3) with channel mixing matrix W CK K and α R. Suppose that W has at least one eigenvalue with non-zero imaginary part and sort the eigenvalues of W in ascending order with respect to their imaginary part. Then, for almost initial values x0 CN K, the following is satisfied. (α > 0) Solutions of (3) are HFD if ℑ(λK(W)) fα (λ1(L)) < ℑ(λ1(W)) , and LFD otherwise. (α < 0) Let λ+(L) and λ (L) be the smallest positive and biggest negative non-zero eigenvalue of L, respectively. Solutions of (3) are (1 λ (L))-FD if ℑ(λK(W)) fα (λ (L)) < ℑ(λ1(W)) fα (λ+(L)) . Otherwise, solutions of (3) are (1 λ+(L))-FD. Proof. The proof follows the same reasoning as the proof for the heat equation in Theorem D.5. The difference is that we now apply Lemma D.4 instead of Lemma D.3. Therefore, the dominating frequency is either λK(W) fα (λ1(L)) or λ1(W) fα (λN(L)) if α > 0, and λK(W) fα (λ (L)) or λ1(W) fα (λ+(L)) if α < 0. D.2 Frequency Dominance for Numerical Approximations of the Heat Equation For n N and h R, h > 0, the solution of (2) at time nh > 0 can be approximated with an explicit Euler scheme vec(x)(n h) = hk( W Lα)kvec(x0) , which can be further simplified via the binomial theorem as vec(x)(n h) = (I h (W Lα))n vec(x0) . (11) Hence, it holds the representation formula vec(x)(n h) = X r,l cr,l (1 h λr (W) fα (λl(L)))n ψr (W) ψl (L) . In this case, the dominating frequency maximizes |1 h λr (W) fα (λl(L))|. When h < W 1, the product h λr (W) fα (λl(L)) is guaranteed to be in [ 1, 1], and |1 h λr (W) fα (λl(L))| = 1 h λr (W) fα (λl(L)) [0, 2] . Therefore, the dominating frequency minimizes h λr (W) fα (λl(L)). This is the reasoning behind the next result. Proposition D.9. Let h R, h > 0. Consider the fractional heat equation (2) with α R. Let {x(n h)}n N be the trajectory of vectors derived by approximating (2) with an explicit Euler scheme with step size h. Suppose h < W 1. Then, for almost all initial values x0 2 , if λK(W) fα (λ1(L)) < λ1(W) , 0 , otherwise . Proof. Define (λa, λb) := arg max r,l {|1 hλr (W) fα (λl(L))| : r {1, . . . , K} , l {1, . . . , N}} . By the hypothesis on h, this is equivalent to (λa, λb) = arg min r,l {λr (W) fα (λl(L)) : r {1, . . . , K} , l {1, . . . , N}} . Therefore, (λa, λb) is either (λ1(W), λN(L)) or (λK(W), λ1(L)). Hence, vec(x)(n h) vec(x)(n h) 2 n ca,b ψa (W) ψb (L) ca,b ψa (W) ψb (L) 2 . If the condition λK(W) fα (λ1(L)) < λ1(W) is satisfied, we have b = 1. Then by (10), the normalized vec(x) converges to the eigenvector of I I L corresponding to the largest frequency 1 λ1(L) = λN(I L). An application of Lemma D.2 finishes the proof. If λK(W) fα (λ1(L)) < λ1(W) is not satisfied, we have b = N, and the other direction follows with the same argument. Similarly to Proposition D.9 one can prove the following results for negative fractions. Proposition D.10. Let h R, h > 0. Consider the fractional heat equation (2) with α < 0. Let {x(n h)}n N be the trajectory of vectors derived by approximating the solution of (2) with an explicit Euler scheme with step size h. Suppose that h < W 1. The approximated solution is (1 λ (L))-FD if λ1(W) fα (λ+(L)) < λK(W) fα (λ (L)) , and (1 λ+(L))-FD otherwise. Proof. The proof follows the same reasoning as the proof of Proposition D.9 by realizing that the dominating frequencies (λa, λb) are either given by (λ1(W), λ+(L)) or (λK(W), λ (L)). D.3 Frequency Dominance for Numerical Approximations of the Schrödinger Equation For n N and h R, h > 0, the solution of (3) at time n h > 0 can be approximated with an explicit Euler scheme as well. Similarly to the previous section, we can write vec(x)(n h) = (I + i h (W Lα))n vec(x0) . and vec(x)(n h) = X r,l cr,l (1 + i h λr (W) fα (λl(L)))n ψr (W) ψl (L) . The dominating frequency will be discussed in the following theorem. Proposition D.11. Let h R, h > 0. Let {x(n h)}n N be the trajectory of vectors derived by approximating (3) with an explicit Euler scheme with sufficiently small step size h. Sort the eigenvalues of W in ascending order with respect to their imaginary part. Then, for almost all initial values x0 2 , if fα (λ1(L)) ℑ(λK(W)) < fα (λN(L)) ℑ(λ1(W)) 0 , otherwise. Proof. Define (λa, λb) := arg max r,l {|1 + i hλr (W) fα (λl(L))| : r {1, . . . , K} , l {1, . . . , N}} . By definition of a and b, for all r and l it holds |1 + i h λa (W) fα (λb(L))| > |1 + i h λr (W) fα (λl(L))| . (12) Hence, vec(x)(t) vec(x)(t) 2 t ca,bψa (W) ψb (L) ca,bψa (W) ψb (L) 2 . We continue by determining the indices a and b. To do so, we note that (12) is equivalent to fα (λl (L)) ℑ(λr (W)) fα (λb (L)) ℑ(λa (W)) fα (λl (L))2 |λr (W)|2 fα (λb(L))2 |λa (W)|2 for all r, l. Denote by ε the gap 0 < ε := min (r,l) =(a,b) {fα (λl (L)) ℑ(λr (W)) fα (λb (L)) ℑ(λa (W))} . Noting that fα (λl (L))2 |λr (W)|2 fα (λb(L))2 |λa (W)|2 h W 2 L 2α = h W 2 , fα (λl (L))2 |λr (W)|2 fα (λb(L))2 |λa (W)|2 < ε one gets that (12) is satisfied for h < ε W 2. Therefore, for sufficiently small h, the dominating frequencies are the ones with minimal imaginary part, i.e., either fα (λ1(L)) ℑ(λK(W)) or fα (λN(L)) ℑ(λ1(W)). If fα (λ1(L)) ℑ(λK(W)) < fα (λN(L)) ℑ(λ1(W)), then b = 1, and the normalized vec (x) converges to the eigenvector corresponding to the smallest frequency λ1(L). By (10), this is also the eigenvector of I I L corresponding to the largest frequency 1 λ1(L) = λN(I L). An application of Lemma D.2 finishes the proof. Finally, we present a similar result for negative powers. Proposition D.12. Let h R, h > 0. Consider the fractional Schrödinger equation (3) with α < 0. Let {x(n h)}n N be the trajectory of vectors derived by approximating the solution of (3) with an explicit Euler scheme with step size h. Suppose that h is sufficiently small. Sort the eigenvalues of W in ascending order with respect to their imaginary part. The approximated solution is (1 λ+(L))-FD if λ1(W) fα (λ+(L)) < λK(W) fα (λ (L)) , and (1 λ (L))-FD otherwise. Proof. Similar to Proposition D.11, we can prove the statement by realizing that the dominating frequencies (λa, λb) in (12) are either given by (λ1(W), λ+(L)) or (λK(W), λ (L)). E Appendix for Section 5.2 We begin this section by describing the solution of general linear matrix ODEs of the form (6) in terms of the Jordan decomposition of M. This is required when M is not diagonalizable. For instance, the SNA of a directed graph is not in general a symmetric matrix, hence, not guaranteed to be diagonalizable. We then proceed in Appendix E.1 with the proof of Theorem 5.6. For a given matrix M CN N, the Jordan normal form is given by M = PJP 1, where P CN N is an invertible matrix whose columns are the generalized eigenvectors of M, and J CN N is a block-diagonal matrix with Jordan blocks along its diagonal. Denote with λ1, . . . , λm the eigenvalues of M and with J1, . . . , Jm the corresponding Jordan blocks. Let kl be the algebraic multiplicity of the eigenvalue λl, and denote with ψi l(M) i {1,...,kl} the generalized eigenvectors of the Jordan block Jl. We begin by giving the following well-known result, which fully characterizes the frequencies for the solution of a linear matrix ODE. Lemma E.1. Let M = PJP 1 CN N be the Jordan normal form of M. Let x : [0, T] Rn be a solution to x (t) = Mx(t) , x(0) = x0. Then, x is given by l=1 exp (λl(M)t) (i j)!ψj l (M), i=1 ci l Pei l , and ei l : i {1, . . . kl} , l {1, . . . , m} is the standard basis satisfying Pei l = ψi l(M). Proof. By (Perko, 2001, Section 1.8), the solution can be written as exp (M t) x0 = P exp (J t) P 1 m X i=1 ci l Pei l = P exp (J t) i=1 ci lei l where exp (J t) = diag ({exp (Jl t)}m l=1) and exp (Jl t) = exp (λl(M) t) 1 t t2 2! tkl (kl 1)! 1 ... t2 2! ... t 1 Since exp (J t) = Lm l=1 exp (Jl t), we can focus on a single Jordan block. Fix l {1, . . . , m}, it holds P exp (Jl t) i=1 ci lei l = P exp (λl(M) t) c1 l e1 l + c2 l t e1 l + e2 l + c3 l 2!e1 l + t e2 l + e3 l = exp (λl(M) t) c1 l ψ1 l (M) + c2 l t ψ1 l (M) + ψ2 l (M) 2!ψ1 l (M) + t ψ2 l (M) + ψ3 l (M) + . . . = exp (λl(M) t) (i j)!ψj l (M) . Bringing the direct sums together, we get exp (M t) x0 = l=1 exp (λl(M) t) (i j)!ψj l (M) , from which the thesis follows. In the following, we derive a formula for the solution of ODEs of the form x (t) = Mx(t)W , x(0) = x0 , (13) for a diagonal matrix W CK K and a general square matrix M CN N with Jordan normal form PJP 1. By vectorizing, we obtain the equivalent linear system vec(x) (t) = W M vec(x)(t) , vec(x)(0) = vec(x0) . (14) Then, by properties of the Kronecker product, there holds W M = W (PJP 1) = (I P)(W J)(I P 1) = (I P)(W J)(I P) 1. Note that (I P)(W J)(I P) 1 is not the Jordan normal form of D M. However, we can characterize the Jordan form of W M as follows. Lemma E.2. The Jordan decomposition of W J is given by W J = P J P 1 where J is a block diagonal matrix with blocks wjλl(J) 1 wjλl(J) 1 ... wjλl(J) 1 wjλl(J) and P is a diagonal matrix obtained by concatenating Pj,l = diag w n+1 j kl n=1 Proof. As J = Lm l=1 Jl, we can focus on a single Jordan block. Fix l {1, . . . , m}. We have W Jl = diag {wj Jl}K j=1 = j=1 wj Jl , hence, we can focus once again on a single block. Fix j {1, . . . , K}; the Jordan decomposition of wj Jl is given by Pl = diag w n+1 j kl n=1 wjλl(J) 1 wjλl(J) 1 ... ... wjλl(J) 1 wjλl(J) To verify it, compute the (n, m) element Pl Jl P 1 l Since Pl is a diagonal matrix, the only non-null entries are on the diagonal; therefore, i = n and k = m = Pl m,m and the only non-null entries of Jl are when m = n or m = n + 1, hence n,n = wjλl (J) , m = n , Pl n+1,n+1 = w n+1 j wn j = wj , m = n + 1 . The thesis follows from assembling the direct sums back. Lemma E.2 leads to the following result that fully characterizes the solution of (14) in terms of the generalized eigenvectors and eigenvalues of M and W. Proposition E.3. Consider (14) with M = PJP 1 and W J = P J P 1, where J and P are given in Lemma E.2. The solution of (14) is vec(x)(t) = l2=1 exp (λl1(W)λl2(M)t) i=1 ci l1,l2 (i j)! (λl1(W))1 j el1 ψj l2(M) , where the coefficients ci l1,l2 are given by i=1 ci l1,l2(I P) Pel1 ei l2 where ei l2 : l2 {1, . . . , m} , i {1, . . . , kl2} is the standard basis satisfying Pei l2 = ψi l2(M). Proof. By Lemma E.2, the eigenvalues of W M and the corresponding eigenvectors and generalized eigenvectors are λl1(W)λl2(M) , el1 ψ1 l2(M) , (λl1(W)) i+1el1 ψi l2(M) for l1 {1, . . . , K}, l2 {1, . . . , m} and i {2, . . . , kl}. Hence, by Lemma E.1, the solution of (14) is given by vec(x)(t) = l2=1 exp (λl2(M)λl1(W)t) i=1 ci l1,l2 (i j)!(λl1(W))1 j(el1 ψj l2(L)) , where the coefficients ci l1,l2 are given by i=1 ci l1,l2(I P) P(el1 ei l2(M)) . E.1 Proof of Theorem 5.6 In the following, we reformulate and prove Theorem 5.6. Corollary E.4. Let G be a strongly connected directed graph with SNA L RN N. Consider the initial value problem in (2) with diagonal channel mixing matrix W RK K and α = 1. Then, for almost all initial values x0 RN K, the solution to (2) is HFD if λK(W)ℜλ1(L) < λ1(W)λN(L) and λ1(L) is the unique eigenvalue that minimizes the real part among all eigenvalues of L. Otherwise, the solution is LFD. Proof. Using the notation from Proposition E.3 and its proof, we can write the solution of the vectorized form of (2) as vec(x)(t) = l2=1 exp ( λl1(W)λl2(L)t) i=1 ci l1,l2 (i j)!(λl1(W))1 j(el1 ψj l2(L)). As done extensively, we separate the terms corresponding to the frequency with minimal real part. This frequency dominates as the exponential converges faster than polynomials for t going to infinity. Consider the case λK(W)ℜ(λ1(L)) < λ1(W)ℜ(λN(L)). As λ1(L) is unique, the product λK(W)ℜ(λ1(L)) is the unique most negative frequency. Assume without loss of generality that λK(W) has multiplicity one. The argument does not change for higher multiplicities as the corresponding eigenvectors are orthogonal since W is diagonal. Then, λK(W)λ1(L) has multiplicity one, and we calculate vec(x)(t) as l2=1 exp ( λl1(W)λl2(L)t) i=1 ci l1,l2 (i j)!(λl1(W))1 j(el1 ψj l2(L)) = ck1 K,1 exp ( tλK(W)λ1(L)) tk1 1 (k1 1)!(e K ψ1 1(L)) + ck1 K,1 exp ( tλK(W)λ1(L)) (k1 j)!(λK(W))1 j(e K ψj 1(L)) + exp ( tλK(W)λ1(L)) (i j)!(λK(W))1 j(e K ψj 1(L)) l2=2 exp ( λl1(W)λl2(L)t) i=1 ci l1,l2 (i j)!(λl1(W))1 j(el1 ψj l2(L)) = exp ( tλK(W)λ1(L)) tk1 1 ck1 K,1 1 (k1 1)!(e K ψ1 1(L)) (k1 j)!(λK(W))1 j(e K ψj 1(L)) 1 (i j)!ti j k1+1(λK(W))1 j(e K ψj 1(L)) l2=2 exp ( t(λl1(W)λl2(L) λK(W)λ1(L))) i=1 ci l1,l2 (i j)! (λl1(W))1 j(el1 ψj l2(L)) We can then write the normalized solution as ck1 K,1 (k1 1)!(e K ψ1 1(L)) + ck1 K,1 (k1 j)!(λK(W))1 j(e K ψj 1(L)) (i j)! (λK(W))1 j(e K ψj 1(L)) l2=2 e t(λK(W)λl2(L) λK(W)λ1(L)) kl2 X i=1 ci l1,l2 (i j)!(λl1(W))1 j(el1 ψj l2(L)) ck1 K,1 (k1 1)! e K ψ1 1(L) + ck1 K,1 (k1 j)!(λK(W))1 j(e K ψj 1(L)) (i j)! (λl1(W))1 j(e K ψj 1(L)) l2=2 exp ( t(λl1(W)λl2(L) λK(W)λ1(L))) i=1 ci l1,l2 (i j)!(λl1(W))1 j(el1 ψj l2(L)) All summands, except the first, converge to zero for t going to infinity. Hence, vec(x)(t) vec(x)(t) 2 ck1 K,1 (k1 1)!(e K ψ1 1(L)) ck1 K,1 (k1 1)!(e K ψ1 1(L)) We apply Lemma D.2 to finish the proof for the HFD case. Note that ψ1 1(L) is an eigenvector corresponding to λ1(L). The LFD case is equivalent. By Perron-Frobenius for irreducible nonnegative matrices, there is no other eigenvalue with the same real part as 1 λN(L) = λ1(I L). Remark E.5. If the hypotheses are met, the convergence result also holds for Lα. With the same reasoning, we can prove that the normalized solution converges to the eigenvector corresponding to the eigenvalue of Lα with minimal real part. It suffices to consider the eigenvalues and generalized eigenvectors of Lα. However, we do not know the relationship between the singular values of Lα, where we defined the fractional Laplacian, and the eigenvalues of L. Hence, it is much more challenging to draw conclusions on the Dirichlet energy. E.2 Explicit Euler In this subsection, we show that the convergence properties of the Dirichlet energy from Theorem 5.6 are also satisfied when (2) is approximated via an explicit Euler scheme. As noted in (11), the vectorized solution to (2) can be written as vec(x)(n h) = (I h (W L))n vec(x0) , when α = 1. We thus aim to analyze the Jordan decomposition of Ln for L Cn n and n N. Let L = PJP 1, where J is the Jordan form, and P is a invertible matrix of generalized eigenvectors. Consider a Jordan block Ji associated with the eigenvalue λi(M). For a positive integer n, the n-th power of the Jordan block can be computed as: Jn l = λl(L)n 1 n 1 λl(L) 1 n 2 λl(L) 2 n kl 1 λl(L) kl+1 1 n 1 λl(L) 1 n kl 2 λl(L) kl+2 1 ... ... n 1 λl(L) 1 We compute the n-th power of L as Ln = (PJP 1)n = PJn P 1, and we expand x0 as i=1 ci l Pei l , where ei l : i {1, . . . kl} , l {1, . . . , m} is the standard basis and Pei l = ψi l(L) are the generalized eigenvectors of L. It is easy to see that Lnx0 = PJn P 1 m X i=1 ci l Pei l i=1 ci lei l As Jn = Lm l=1 Jn l , we can focus on a single Jordan block. Fix l {1, . . . , m}, and compute i=1 ci lei l = P λl(M)nc1 l e1 l + P n 1 λl(M)n 1c1 l e1 l + λl(M)nc2 l e2 l λl(M)n 2c1 l e1 l + n 1 λl(L)n 1c2 l e2 l + λl(M)nc3 l e3 l We can summarize our findings in the following lemma. Lemma E.6. For any L = PJP 1 RN N and x0 = m P i=1 ci lψi l (L), we have min{kl,n 1} X λl(L)n i+jcj l ψj l (L) . We proceed with the main result of this subsection. Proposition E.7. Let G be a strongly connected directed graph with SNA L RN N. Consider the initial value problem in (2) with diagonal channel mixing matrix W RK K and α = 1. Approximate the solution to (2) with an explicit Euler scheme with a sufficiently small step size h. Then, for almost all initial values x0 CN K the following holds. If λ1(L) is unique and λK(W)ℜλ1(L) < λ1(W)ℜλN(L), (15) the approximated solution is HFD. Otherwise, the solution is LFD. Proof. As noted in (11), the vectorized solution to (2) with α = 1, can be written as vec(x)(n h) = (I h (W L))n vec(x0). Consider the Jordan decomposition of L = PJP 1 and the Jordan decomposition of W J = P J P 1, where J and P are specified in Lemma E.2. Then, vec(x)(n h) = I + h W (PJP 1) n vec(x0) = (I P)(I h W J)n(I P) 1vec(x0) = (I P)(I h P J P 1)n(I P) 1vec(x0) = (I P) P(I h J)n P 1(I P) 1vec(x0) = (I P) P(I h J)n((I P) P) 1vec(x0). Now, decompose x0 into the basis of generalized eigenvectors, i.e., i=1 ci l((I P) P)(el1 ei l2(L)). Then, by Lemma E.6, we have vec(x)(n h) = min kl2,t 1 X (1 hλl1(W)λl2(L))n i+j cj l1,l2 (λl1(W))1 j ψl1(W) ψj l2(L). Now, consider the maximal frequency, i.e., L1, L2 = arg max l1,l2 {|1 hλl1(W)λl2(L)|} . Then, the solution vec(x)(n h) can be written as min kl2,n 1 X (1 hλl1(W)λl2(L))n i+j cj l1,l2ψl1(W) ψj l2(L) = (1 hλL1(W)λL2(L))n min kl2,t 1 X (1 hλl1(W)λl2(L))n i+j (1 hλL1(W)λL2(L))n cj l1,l2ψl1(W) ψj l2(L). With a similar argument as in the proof of Theorem 5.6, we can then see that vec(x)(n h) vec(x)(n h) 2 t c1 L1,L2 ψL1(W) ψ1 L2(L) c1 L1,L2 ψL1(W) ψ1 L2(L) 2 , where ψ1 L2(L) is the eigenvector corresponding to λL2(L). Note that for almost all x, we have c1 L1,L2 = 0. Then ψ1 L2(L) is also an eigenvector of I L corresponding to the eigenvalue 1 λL2(L). By Lemma D.2, we have that the approximated solution is (1 λL2(L))-FD. We finish the proof by showing that L2 = 1 if (15) is satisfied, and L2 = N otherwise. First, note that either λK(W)ℜλ1(L) or λ1(W)ℜλN(L) are the most negative real parts among all {λl(W)ℜλr(L)}l {1,...,K},r {1...,N}. Assume first that λK(W)ℜλ1(L) has the most negative real part, i.e., (15) holds. Then, define ε := max l,r |λK(W)ℜλ1(L) λl(W)ℜλr(L)| , and assume h < ε W 2. Now it is easy to see that 2λK(W)ℜλ1(L) hλK(W)2|λ1(L)|2 < 2λl(W)ℜλr(L) hλl(W)2|λr(L)|2, which is equivalent to (K, 1) = (L1, L2). Hence, the dynamics are (1 λ1(L))-FD. As (1 λ1(L)) is highest frequency of I L, we get HFD dynamics. Similary, we can show that if λ1(W)ℜλN(L) is the most negative frequency, we get LFD dynamics. Note that for the HFD argument, we must assume that λ1(L) is the unique eigenvalue with the smallest real part. For the LFD argument, it is already given that λN(L) has multiplicity one by Perron-Frobenius Theorem. E.3 GCN oversmooths Proposition E.8. Let G be a strongly connected and aperiodic directed graph with SNA L RN N. A GCN with the update rule xt+1 = Lxt W, where x0 RN K are the input node features, always oversmooths. Proof. The proof follows similarly to the proof of Proposition E.7. The difference is that instead of (16), we can write the node features after t layers as min kl2,t 1 X (λl1(W)λl2(L))t i+j cj l1,l2ψl1(W) ψj l2(L). Now note that by Perron-Frobenius the eigenvalue λN(L) with the largest absolute value is real and has multiplicity one. Then, maxl1,l2 |λl1(W)λl2(L)| is attained at either λ1(W)λN(L) or λK(W)λN(L). Equivalently to the proof of Proposition E.7, we can show that the corresponding GCN is 1 λN(L)-FD. Now 1 λN(L) = λ1(I L), and λ1(I L)-FD corresponds to LFD, hence the GCN oversmooths. F Appendix for the Cycle Graph Example Consider the cycle graph with N nodes numbered from 0 to N 1. Since each node has degree 2, the SNA L = A/2 is a circulant matrix produced by the vector v = (e1 + e N 1) /2. Denote ω = exp (2πi/N), the eigenvectors can be computed as 1, ωj, ω2j, . . . , ω(N 1)j asociated to the eigenvalue λj = cos(2πj/N). First, we can note that λj = λN j for all j {1, . . . , N/2}; therefore, the multiplicity of each eigenvalue is 2 except λ0 and, if N is even, λN/2. Since the original matrix is symmetric, there exists a basis of real eigenvectors. A simple calculation Lℜvj + i Lℑvj = Lvj = λjvj = λjℜvj + iλjℑvj shows that ℜvj and ℑvj, defined as n=0 , ℑvj = 1 n=0 are two eigenvectors of the same eigenvalue λj. To show that they are linearly independent, we compute under which conditions 0 = aℜvj + bℑvj . We note that the previous condition implies that for all n / {0, N/2} 0 = a cos 2πj n + b sin 2πj n a2 + b2 sin 2πj n N + arctan b Suppose a, b = 0, then it must be N + arctan b which is equivalent to The left-hand side is always an integer, while the right-hand side is an integer if and only if b = 0. This reduces the conditions to a cos 2πj n |a| sin 2πj n which is true if and only if a = 0. Consider now an even number of nodes N; the eigenspace of λN/2 = 1 is N (( 1)n)N 1 n=0 hence, the maximal eigenvector of I L guarantees homophily 0. Consider now a number of nodes N divisible by 4; the eigenspace of λN/4 = 0 has basis n=0 , ℑv N/4 = 1 Their sum is then equivalent to ℜv N/4 + ℑv N/4 = 1 4 (2n + 1) N 1 N (1, 1, 1, 1, . . .) hence, the mid eigenvector of L guarantees homophily 1/2. A visual explanation is shown in Figure 4.