# spectroriemannian_graph_neural_networks__4d2dede6.pdf Published as a conference paper at ICLR 2025 SPECTRO-RIEMANNIAN GRAPH NEURAL NETWORKS Karish Grover1, , Haiyang Yu2, Xiang Song3, Qi Zhu3, Han Xie3, Vassilis N. Ioannidis3, Christos Faloutsos1 1Carnegie Mellon University, 2Texas A&M University, 3Amazon {karishg,christos}@cs.cmu.edu, {haiyang}@tamu.edu, {xiangsx,qzhuamzn,hanxie,ivasilei}@amazon.com Can integrating spectral and curvature signals unlock new potential in graph representation learning? Non-Euclidean geometries, particularly Riemannian manifolds such as hyperbolic (negative curvature) and spherical (positive curvature), offer powerful inductive biases for embedding complex graph structures like scale-free, hierarchical, and cyclic patterns. Meanwhile, spectral filtering excels at processing signal variations across graphs, making it effective in homophilic and heterophilic settings. Leveraging both can significantly enhance the learned representations. To this end, we propose Spectro-Riemannian Graph Neural Networks (CUSP) the first graph representation learning paradigm that unifies both CUrvature (geometric) and SPectral insights. CUSP is a mixed-curvature spectral GNN that learns spectral filters to optimize node embeddings in products of constant-curvature manifolds (hyperbolic, spherical, and Euclidean). Specifically, CUSP introduces three novel components: (a) Cusp Laplacian, an extension of the traditional graph Laplacian based on Ollivier-Ricci curvature, designed to capture the curvature signals better; (b) Cusp Filtering, which employs multiple Riemannian graph filters to obtain cues from various bands in the eigenspectrum; and (c) Cusp Pooling, a hierarchical attention mechanism combined with a curvaturebased positional encoding to assess the relative importance of differently curved substructures in our graph. Empirical evaluation across eight homophilic and heterophilic datasets demonstrates the superiority of CUSP in node classification and link prediction tasks, with a gain of up to 5.3% over state-of-the-art models. The code is available at: https://github.com/amazon-science/cusp. 1 INTRODUCTION Graph representation learning has garnered significant research interest in recent years, owing to its fundamental relevance in domains such as natural language processing (Mihalcea & Radev, 2011), biology (Zhang et al., 2021a), and social network analysis (Grover et al., 2022). Recent advances have rigorously examined constant curvature spaces to learn distortion-free graph representations, as they provide suitable inductive biases for particular structures while avoiding the intrinsic problems of very high dimensionality. For example, hyperbolic space (negative curvature) is optimal for hierarchical tree-like graphs (Chami et al., 2019), while spherical geometry (positive curvature) is best suited for cyclic graphs (Gu et al., 2019).The idea behind all Riemannian GNNs is that optimal embeddings are achieved when the underlying manifold s curvature aligns with the graphs discrete curvature. To model real-world graphs with complex topologies that cannot be adequately represented within a single constant-curvature manifold, various mixed-curvature GNNs have been proposed, that operate across multiple manifolds (Zhu et al., 2020). Despite their success in managing complex graph topologies, existing mixed-curvature GNNs still have significant limitations (Figure 1 (b)). (a) L1 (Low-pass filtering bias): State-of-the-art mixedcurvature Riemannian GNNs, such as κGCN (Bachmann et al., 2020) and QGCN (Xiong et al., 2022), inherently mimic low-pass spectral filters. These models are adaptations of the Euclidean GCN (Kipf & Welling, 2016) to Riemannian manifolds, which are known to operate under the homophily assumption (low-pass filters) predominantly. Consequently, their performance degrades on The work was done during Karish Grover s internship at Amazon, US. Published as a conference paper at ICLR 2025 Figure 1: Motivation behind CUSP. (a) Diversity of curvatures ( ve to +ve) and frequencies in homophilic (Cora) and heterophilic (Texas, Chameleon) real-world graphs (Section 5). (b) Spectral GNNs overlook the curvature, whereas Riemannian GNNs are restricted by a cut-off frequency. CUSP aims for the best of both a rich geometric representation + full access to the spectrum. graphs with varying degrees of heterophily. (b) L2 (Task-specific relevance of curvature): Different datasets and tasks may emphasize more on different curvatures within a graph (Figure 1 (a)). For example, the task of community detection focuses on clustered nodes typically associated with a positive curvature (Tian et al., 2023). In contrast, fake news detection on social media graphs prioritizes tree-like cascade structures characterized by a negative curvature (Grover et al., 2022). Current models do not adapt curvatures to tasks. (c) L3 (Lack of geometry-equivariance in spectral GNNs): Spectral GNNs like Bern Net (He et al., 2021) and GPRGNN (Chien et al., 2020) incorporate flexible graph filters to learn representations based on different parts of the eigenspectrum but assume a flat Euclidean manifold, ignoring the underlying geometry. As the geometry of the graph changes, these filters should adapt to reflect the changing geometric properties, which is currently not the case. To bridge these gaps, we introduce CUSP, a mixed-curvature spectral GNN that operates on a product manifold composed of multiple constant-curvature spaces and computes geometrically and spectrally parameterized graph filters. We begin by introducing the (a) Cusp Laplacian, a curvatureaware Laplacian operator for graphs, inspired by the equation of heat flow (Thanou et al., 2017) and the discrete Ollivier-Ricci curvature (Ollivier, 2007). At the core of our approach is (b) Cusp Filtering, where we propose a filter bank consisting of multiple mixed-curvature graph filters to ensure that our GNN captures information from multiple bands in the eigenspectrum. Using the Cusp Laplacian, we extend the generalized Page Rank (GPR)-based spectral GNN (Chien et al., 2020) to mixed-curvature spaces by incorporating operations based on the κ-stereographic model of Riemannian geometry (Bachmann et al., 2020). We chose GPRGNN as the spectral backbone of CUSP because of its ability to capture node features and topological graph signals simultaneously. Lastly, we introduce the (c) Cusp Pooling mechanism, complemented by functional curvature embeddings based on Bochner s Theorem (Xu et al., 2020), to weigh differently curved substructures, enhancing its ability to model real-world graphs with diverse geometric and spectral properties. We perform an extensive empirical evaluation of CUSP for Node Classification (NC) and Link Prediction (LP) tasks, on eight real-world datasets. CUSP records state-of-the-art performance, with a gain of up to 5.3%. We summarize our main contributions as follows: To the best of our knowledge, this is the first attempt towards a graph learning paradigm that seamlessly integrates both geometry and spectral cues. We introduce a curvature-aware Cusp Laplacian operator, design a mixed-curvature spectral graph filtering framework, Cusp Filtering, and propose a curvature embedding method using classical harmonic analysis and a hierarchical attention mechanism called Cusp Pooling. We conduct extensive experimentation on eight real-world benchmarking datasets, featuring homophilic and heterophilic graphs, for node classification and link prediction tasks. Published as a conference paper at ICLR 2025 2 RELATED WORKS Riemannian Geometry in Graph Neural Networks. Graph Neural Networks (GNNs) have set new benchmarks for tasks like node classification and link prediction. Recently, non-Euclidean (Riemannian) spaces particularly hyperbolic (Sala et al., 2018) and spherical (Liu et al., 2017; Wilson et al., 2014) geometries have garnered attention for their ability to produce less distorted representations, aligning well with hierarchical and cyclic data structures, respectively. Several approaches have emerged in this context. (a) Single Manifold GNNs: GNNs such as HGAT (Zhang et al., 2021b), HGCN (Chami et al., 2019), and HVAE (Sun et al., 2021) have demonstrated stateof-the-art performance on tree-like or hierarchical graphs by learning representations in hyperbolic space. (b) Mixed-Curvature GNNs: To model more complex topologies (for example, a tree branching from a cyclic graph), mixed-curvature GNNs have been proposed. Gu et al. (2019) pioneered this direction by embedding graphs in a product manifold combining spherical, hyperbolic, and Euclidean spaces. Building on this, models like κ-GCN (Bachmann et al., 2020) and Q-GCN (Xiong et al., 2022) extended the GCN architecture (Kipf & Welling, 2016) to constant-curvature spaces using the κ-stereographic model and pseudo-Riemannian manifolds, respectively. More recently, Sun et al. (2022) proposed a mixed-curvature GNN for self-supervised learning, while FPS-T (Cho et al., 2023) generalized the Graph Transformer (Min et al., 2022) to operate across multiple manifolds. Spectral Graph Neural Networks. Spectral GNNs employ spectral graph filters (Liao et al., 2024) to process graph data. These models either use fixed filters, as seen in APPNP (Gasteiger et al., 2018) and GNN-LF/HF (Zhu et al., 2021), or learnable filters, as demonstrated by Cheby Net (Defferrard et al., 2016) and GPRGNN (Chien et al., 2020), which approximate polynomial filters using Chebyshev polynomials and generalized Page Rank, respectively. Bern Net (He et al., 2021) expresses filtering operations through Bernstein polynomials. However, many of these methods focus primarily on low-frequency components of the eigenspectrum, potentially overlooking important information from other frequency bands particularly in heterophilic graphs. Models like GPRGNN and Bern Net address this by exploring the entire spectrum, performing well across both homophilic and heterophilic graphs. GPRGNN stands out among them because it can express several polynomial filters and incorporate node features and topological information. Despite these advances, current mixed-curvature and spectral GNNs face significant limitations (L1, L2, L3) that constrain their performance. To the best of our knowledge, this work is the first to unify geometric and spectral information within a single model. Before presenting the architecture of CUSP, we introduce some key preliminary concepts in the following section. 3 PRELIMINARIES We study graphs G = (V, E, A), where V is a finite set of |V| = n vertices, E is a set of edges and A Rn n is a weighted graph adjacency matrix. The nodes are associated with the node feature matrix F Rn df (df is the feature node dimension). A graph signal f : V R on the nodes of the graph may be regarded as a vector f Rn where fi is the value of f at the ith node. An essential operator in spectral graph analysis is the graph Laplacian L = D A Rn n where D Rn n is the diagonal degree matrix with Dii = P j Aij (Kipf & Welling, 2016). The normalized Laplacian is defined as Ln = I An where An = D 1/2AD 1/2 is the normalized adjacency matrix, and I Rn n is the identity matrix. As L is a real symmetric positive semidefinite matrix, it has a complete set of orthonormal eigenvectors U = {ul}n 1 l=0 Rn n, and their associated ordered real nonnegative eigenvalues {λl}n 1 l=0 Rn, identified as the frequencies of the graph. The Laplacian can be diagonalized as L = UΛU where Λ = diag( {λl}n 1 l=0 ) Rn n. Riemannian Geometry (Do Carmo & Flaherty Francis, 1992). A smooth manifold M generalizes the notion of surface to higher dimensions. Each point x M is associated with a tangent space Tx M, which is locally Euclidean. On tangent space Tx M, the Riemannian metric, gx( , ) : Tx M Tx M R, defines an inner product so that geometric notions (like distance, angle, etc.) can be induced. The pair (M, g) is called a Riemannian manifold. For x M, the exponential map at x, expx(v) : Tx M M, projects the vector v Tx M onto the manifold M, and the logarithmic map, logx(y) : M Tx M, projects the vector y M back to the tangent space Tx M. The Riemannian metric defines a curvature (κ) at each point on the manifold, indicating how the space is curved. There are three canonical types: positively curved Spherical (S) space (κ > 0), negatively curved Hyperbolic (H) space (κ < 0), and flat Euclidean (E) space (κ = 0). Published as a conference paper at ICLR 2025 Product Manifolds (Gu et al., 2019). Consider q constant-curvature manifolds {Mκi,di i }q i=1 with dimension di and curvature κi. Then, the product manifold is defined as the Cartesian product P = Mκ1,d1 1 Mκ2,d2 2 Mκq,dq q , with total dimension Pq i=1 di. Each Mi {H, S, E} is known as a component space, and the decomposition P = q i=1Mκi,di i is called the signature of P. κ Stereographic Model (Bachmann et al., 2020). In this work, we adopt the κ-Stereographic Model to define Riemannian algebraic operations across both positively and negatively curved spaces within a unified framework. This model eliminates the need for separate mathematical formulations for different geometries. In particular, Md κ is the stereographic sphere model for spherical manifold (κ > 0), while it is the Poincar e ball model (Ungar, 2001) for hyperbolic manifold (κ < 0). More mathematical details and intuitions have been discussed in Appendix 7.2.4. Ollivier-Ricci Curvature (ORC). Discrete data like graphs lack manifold structure (hence, no curvature). Several discrete analogs of manifold curvature have been defined, which satisfy properties similar to curvature. ORC (Ollivier, 2007) is a graph discrete analog of Ricci curvature (Tanno, 1988) and is defined by transport along an edge of the network, between neighborhoods of the vertex. In an unweighted graph, for a hyperparameter δ [0, 1], we endow each node s (x) neighbourhood (N(x)) with a probability measure, mδ x(z) := 1 δ |N (x)| z N(x), and mδ x(z) = δ when z = x, and analogously for mδ y(z). ORC for an edge (x, y) is then defined w.r.t. the Wasserstein-1 distance, W1 (Piccoli & Rossi, 2016), between these measures, i.e., eκ(x, y) := 1 W1(mδ x,mδ y) d G(x,y) . d G(x, y) is the shortest graph distance between nodes x and y. The Ollivier-Ricci curvature eκ(x) for a node x is defined as the average curvature of its adjacent edges i.e. eκ(x) = 1 |N (x)| P z N(x) eκ(x, z). Generalized Page Ranks (GPR). Generalized Page Rank methods originated in the context of unsupervised graph clustering, where they demonstrated notable improvements over the classical Personalized Page Rank (Kloumann et al., 2017; Li et al., 2019). The core idea behind GPRs is as follows: starting with a seed node s V (vertex set) within a graph cluster, an initial feature vector H(0) Rn 1 is set, where H(0) v = δvs (i.e., 1 for the seed node and 0 for all others). The GPR score is then defined as P k=0 γk e Ak n H(0) = P k=0 γk H(k), where γk R are the GPR weights that control the importance of higher-order neighbors. This iterative process propagates the feature information throughout the graph. Clustering is performed by locally thresholding the GPR scores. Refer to the Appendix for more details on ORC (7.2.2), product manifolds (7.2.3), and GPR (7.4). 4 PROPOSED METHOD: CUSP In this section, we present a comprehensive overview of the architecture of CUSP, as shown in Figure 2. We start by introducing and deriving the Cusp Laplacian (Section 4.1). Building on this, we propose Cusp Filtering, the core component of our approach, which is a GPR-based, mixed-curvature spectral graph filtering network (Section 4.2). Next, we introduce a curvature embedding technique grounded in classical harmonic analysis (Section 4.3), which acts as the positional encoding in the hierarchical attention-based Cusp Pooling mechanism (Section 4.4). Throughout this paper, we use κ to denote the continuous manifold curvature and eκ for the Ollivier-Ricci curvature. 4.1 CUSP LAPLACIAN Figure 2: Consider heat diffusion from y x. If eκ(x, y) < 0, there is a single path for the heat to diffuse from y x. When eκ(x, y) 0, heat can effectively diffuse through other paths from y x (dotted). To effectively incorporate geometric insights into spectral graph learning, we introduce the Cusp Laplacian, a curvature-aware Laplacian operator. We begin by examining the concept of heat flow on a graph (Weber, 2008). Suppose ψ describes a temperature distribution across a graph, where ψ(x) is the temperature at vertex x. According to Newton s law of cooling (He, 2024), the heat transferred from node x to node y is proportional to ψ(x) ψ(y) if nodes x and y are connected (if they are not Published as a conference paper at ICLR 2025 connected, no heat is transferred). Consequently, the heat diffusion equation on the graph can be expressed as dψ y Axy(ψ(x) ψ(y)), where β is a constant of proportionality and A denotes the adjacency matrix of the graph. Further insight can be gained by considering Fourier s law of thermal conductance (Liu, 1990), which states that heat flow is inversely proportional to the resistance to heat transfer. In this context, we leverage the Ollivier-Ricci curvature (ORC) to define the notion of resistance between nodes. Implicitly, ORC measures the transportation cost (W1(:, :)) between the neighborhoods of two nodes, reflecting the effort required to transport mass between these neighborhoods (Bauer et al., 2011). We interpret this transportation cost as the resistance between nodes. The vital takeaway here is that Heat flow between two nodes in a graph is influenced by the underlying Ollivier-Ricci curvature (ORC) distribution. The diffusion rate is faster on an edge with positive curvature (low resistance), and slower on an edge with negative curvature (high resistance). Intuitively, if the neighborhoods of two nodes overlap significantly (Figure 2(c)), the transportation cost between them is low (Rres xy = W1(mδ x,mδ y) d G(x,y) < 1), resulting in a positive curvature value for the edge connecting these nodes. In this scenario, messages (analogously, heat) can be transmitted efficiently between the neighborhoods. Conversely, if the neighborhoods have little overlap (Figure 2(a)), the transportation cost is high, leading to a negative curvature value (Rres xy > 1), and the edge acts as a bottleneck, impeding effective message-passing. As a result, the heat diffusion process is influenced by the underlying curvature (resistance). Definition 1. The Cusp Laplacian operator takes the form e Lψ(x) = X y x wxy (ψ(x) ψ(y)) = X 1 1 eκ(x,y) (ψ(x) ψ(y)) , (1) where x y denotes adjacency between nodes x and y, wxy = e 1 1 eκ(x,y) represents the curvaturebased weight between nodes x and y, and eκ(x, y) is the discrete Ollivier-Ricci curvature between x and y. The function ψ : V R is defined on the vertex set V of the graph. In the matrix form, we can write, e L = e D e A, where e D and e A are the degree and adjacency matrices based on wxy. Refer to Appendix 7.3 for the derivation of Definition 1. The curvature-based Laplacian operator allow us to incorporate geometric cues into the spectral perspective, which are otherwise not captured by the traditional graph Laplacian. 4.2 CUSP FILTERING Now, we will talk about how we create a filter bank (i.e. multiple graph filters) to fuse information from different parts of the eigenspectrum and learn node representations on a product manifold P. Product manifold construction. P can have multiple hyperbolic or spherical components with distinct learnable curvatures (parameters). This enables us to be representative of a wider range of curvatures. However, we only need one Euclidean space, since the Cartesian product of Euclidean space is Ed(e) = j i=1E d(i) i such that Pj i=1 d(i) = d(e). This is not the case with H or S (Eg. Torus i.e. S1 S1 is topologically distinct from Sphere i.e. S2). Thus, we can represent the product manifold signature, Pd M = Q q=1M κ(q),d(q) q = ( H h=1H κ(h),d(h) h ) ( S s=1S κ(s),d(s) s ) Ed(e) with total dimension d M = PH h=1 d(h) + PS s=1 d(s) + d(e). We use a simple combinatorial construction of the mixed-curvature space, induced by the cartesian product (Sun et al., 2022). In Section 5 and Appendix 7.2.3 we describe how we heuristically identify the signature of Pd M. Extending Page Rank GNN to P: We choose GPRGNN (Chien et al., 2020) as the spectral backbone of CUSP (See Appendix 7.4 for more background on GPRGNN), becasue it jointly optimizes node feature and topological information extraction, which is different from other spectral GNNs like Cheby Net (Defferrard et al., 2016). The GPR weights automatically adjust to the node label pattern (homophilic or heterophilic). Given the node feature matrix F Rn df and normalized symmetric Cusp adjacency matrix e An = e D 1 2 e A e D 1 2 , we first extract hidden state features for Published as a conference paper at ICLR 2025 Figure 3: Architecture of CUSP. The input graph (a) is used to construct the Cusp Laplacian (e L), based on the Ollivier-Ricci curvature (b), as illustrated in (c). The computed edge ORC (red), node ORC (Green), and curvature-based weights (blue) have been highlighted in the input graph. With e L, (d) Cusp Filtering introduces multiple graph filters to capture different parts of the eigenspectrum. Each node receives a curvature positional encoding Φ in (e) as part of the (f) Cusp Pooling mechanism, which computes the relative importance of different filters and manifold components. each node and then use GPR to propagate them. The process can be mathematically described as: M κ(q),d(q) q = exp κ(q) 0 (fθ(F)) ; H(l) M κ(q),d(q) q = e An κ(q) H(l 1) M κ(q),d(q) q (2) M κ(q),d(q) q = l {0,L} γl κ(q) H(l) M κ(q),d(q) q ; Z(L) Pd M = ||Q q=1Z(L) M κ(q),d(q) q (3) where fθ(.) : Rdf Rd M represents a neural network with parameter set {θ} that generates the hidden state features of dimension d M. Here, expκ 0 : Rd M Mκ,d M is the exponential map (Section 3) to transform the euclidean node features F to the component manifold M, and || is the attentional concatenation operator (discussed in Section 4.4). κ, κ and κ denote mobius addition, κ-right-matrix-multiplication and κ-left-matrix-multiplication respectively (Appendix 7.2.4). These operations generalize vector addition and multiplication on κ-stereographic model. The GPR weights γl are trained together with {θ} in an end-to-end fashion. The final mixed-curvature node embeddings after attention can be represented as Z(L) Pd M Pn d M, such that PQ q=1 d(q) = d M. Filter Bank. The GPR component of the network may be viewed as a polynomial graph filter (See Appendix 7.4). Let e An = UΛUT be the eigenvalue decomposition of e An. Then, the polynomial graph filter equals PL l=0 γl e Al n = Ugγ,L(Λ)UT , where gγ,L(Λ) is applied element-wise and gγ,L(λ) = PL l=0 γlλl. If one allows γl to be negative and learnt adaptively, the graph filter will pass relevant high frequencies. Consequently, CUSP performs exceptionally well on heterophilic graphs. Theorem 1 (Informal (Chien et al., 2020)). If γl 0 l {0, 1, ..., L}, PL l=0 γl = 1 and l > 0 such that γl > 0, then gγ,L( ) is a low-pass graph filter. Also, if γl = ( α)l, α (0, 1) and L is large enough, then gγ,L( ) is a high-pass graph filter. We encourage the reader to refer to Chien et al. (2020) for a detailed discussion on how different initializations of the GPR weights, as mentioned in the theorem above, assist in designing low-pass and high-pass filters. We present a proof of Theorem 1 in Appendix 7.4. Motivated by the limitation Published as a conference paper at ICLR 2025 L1, instead of a single filter, we construct a filter bank to focus on different parts of the eigenspectrum (to assist in both heterophilic and homophilic graphs). In an attempt to be representative of the higher order filters, the proposed filterbank is: ΩPd M = ZI Pd M, Z(1) Pd M, Z(2) Pd M, . . . , Z(L) Pd M . Here, ZI Pd M is the unfiltered case, where we pass the identity matrix I instead of e An for GPR propagation. 4.3 FUNCTIONAL CURVATURE ENCODING Recall our motivation that CUSP must be able to pay more attention to differently curved substructures in our model, depending on different tasks and datasets, while learning the final node representations. Specifically, our goal is to obtain a continuous functional mapping Φ : KP Pd C from curvature domain to the d C-dimensional product space to serve as positional encoding in our attention mechanism (Section 4.4). We approximate this in two steps, by considering the Ollivier Ricci (ORC) discretization of curvature on our graph and using the Bochner s theorem (Moeller et al., 2016) from classical harmonic analysis. First, we construct a translation-invariant Euclidean curvature encoding map, and then map it to the product manifold. Without loss of generality, we assume that the curvature domain can be represented by the interval: K = [ 1, 1], where 1 to 1 is the typical range1 of ORC in the observed data. Formally, we define the Curvature Kernel KR : K K R with KR(eκa, eκb) := ΦRd C (eκa), ΦRd C (eκb) and KR(eκa, eκb) = ΨR(eκa eκb), eκa, eκb K for some ΨR : [ 2, 2] R. The intuition behind choosing such a kernel for curvature values, lies in the fact that when comparing eκa and eκb we are only concerned with the relative difference between the two values. According to the Bochner s theroem (Moeller et al., 2016), a continuous, translation-invariant kernel K(x, y) = Ψ(x y) on Rd is positive definite if Ψ is the Fourier transform of a non-negative probability measure on R. Our kernel is translation-invariant, since KR(eκa+ec, eκb+ec) = ΨR((eκa+ec) (eκb+ec)) = KR(eκa, eκb) for any constant ec. The following theorems defines the Euclidean encoding and it s corresponding mapping to the product space. Definition 2. Let Rd C be the ambient Euclidean space for a node s curvature encoding. The functional curvature encoding ΦRd C : K Rd C for curvature eκ(x) K of node x, is defined as: ΦRd C (eκ(x)) = r h cos(ω1eκ(x)), sin(ω1eκ(x)), . . . , cos(ωd Ceκ(x)), sin(ωd Ceκ(x)) i , (4) where ω1, . . . , ωd C i.i.d p(ω) are sampled from a distribution p(ω). The corresponding mapping to the product manifold Pd C, is defined as: ΦPd C (eκ(x)) = gθ Q q=1exp κ(q) 0 (ΦRd C (eκ(x))) = gθ Q q=1Φ M κ(q),d(q) q (eκ(x)) , (5) where exp κ(q) 0 : Rd C M κ(q),d(q) q denotes the exponential map on the qth component manifold with curvature κ(q), || is the concatenation operator and gθ : Pd M Pd C is a Riemannian projector. It is easy to show that Φd C(eκa), Φd C(eκb) K(eκa, eκb). The unknown distribution p(ω) is estimated using the inverse cumulative distribution function (CDF) transformation as in Xu et al. (2020). Next, we prove the translation invariance of the curvature kernel in the product manifold setting. Theorem 2. The mixed-curvature kernel KP(eκa, eκb) := ΦPd C (eκa), ΦPd C (eκb) is translation invariant, i.e. KP(eκa, eκb) = ΨP(eκa eκb). Please refer to Appendix 7.5 for the detailed proofs (derivations) of Definition 2 and Theorem 2. We employ ΦPd C (eκ(x)) as the ORC-based positional encoding for each node x in the final pooling layer, enabling it to incorporate local curvature information and compute the task-specific relevance of substructures with varying curvatures within the graph. 4.4 CUSP POOLING The importance of constant-curvature component spaces depends on the downstream task. To this end, we propose a hierarchical attention mechanism called Cusp Pooling. We perform attentional 1ORC can theoretically vary from to 1, but in practice, its values usually lie within [ 1, 1] for the majority of graphs. For extreme scenarios, such as very sparse or highly clustered graphs, values may fall outside this range. In such instances, ORC is normalized to [ 1, 1]. Published as a conference paper at ICLR 2025 concatenation to fuse constant-curvature representations across component spaces so as to learn mixed-curvature representations in the product space, weighing them appropriately. Specifically, we lift component encodings to the common tangent space, where we can perform the usual euclidean operations, and compute their centorid by the mean pooling. Then, we model the importance of a component by the position of the embedding relative to the centorid, parameterized by θ. q=1 log κ(q) 0 Wq κ(q) Z(L) M κ(q),d(q) q (Centroid using linear transformation Wq) (6) τq = σ θ log κ(q) 0 Wq κ(q) Z(L) M κ(q),d(q) q µ(L) (Relative importance of Mq) (7) Finally, with the learnable attentional weights βq = eτq/ PQ q=0(eτq), we perform attentional concatenation. Further, as motivated earlier, for every node x, we fuse the curvature embedding as positional encoding: Z(L)(x) Pd M = Q βq κ(q) Z(L)(x) M κ(q),d(q) q ; ζ(L) (x) = Z(L)(x) Pd M ΦPd M(eκ(x)) (8) We weigh different filters in our filter bank to yield the final node representation ζ(x) for x as ζ(x) = PL l=1 ϵlζ(l) (x). Here, ζ Pn (d M+d C) contains the final node embeddings, and ϵl is a learnable parameter to weigh the importance of different filters in our bank. These node embeddings are then used for the final downstream tasks of node classification and link prediction. In the following section, we lay out the empirical results to validate the efficacy of CUSP. 5 EXPERIMENTATION Datasets. We evaluate CUSP on the Node Classification (NC) and Link Prediction (LP) tasks using eight benchmark datasets. These include (a) Homophilic datasets such as (i) Citation networks Cora, Citeseer and Pub Med (Sen et al., 2008; Yang et al., 2016), and (b) Heterophilic datasets, which comprise (i) Wikipedia graphs Chameleon and Squirrel (Rozemberczki et al., 2021), (ii) Actor co-occurrence network (Tang et al., 2009), and (iii) Webpage graphs from Web KB 2 Texas and Cornell. The dataset statistics and their respective homophily ratios are detailed in Table 1. Refer to Appendix 7.2.1 for the discrete curvature and eigenspectrum distributions of these datasets. Dataset Cora Citeseer Pub Med Chameleon Squirrel Actor Texas Cornell Classes 7 6 5 5 5 5 5 5 Features 1433 3703 500 2325 2089 932 1703 1703 Nodes 2708 3327 19717 2277 5201 7600 183 183 Edges 5278 4552 44324 31371 198353 26659 279 277 H 0.825 0.718 0.792 0.247 0.217 0.215 0.057 0.301 Table 1: Data statistics and homophily ratio (H). Baselines. To ensure a fair comparison, we evaluate CUSP against three types of baselines: (a) Spatial Euclidean, including traditional methods such as GCN (Kipf & Welling, 2016), GAT (Veliˇckovi c et al., 2017), and Graph SAGE (Hamilton et al., 2017). (b) Spatial Riemannian, comprising (i) Constant-curvature models like HGCN (Chami et al., 2019) and HGAT (Zhang et al., 2021b), and (ii) Mixed-curvature GNNs such as κGCN (Bachmann et al., 2020), QGCN (Xiong et al., 2022), and Self MGNN (Sun et al., 2022). (c) Spectral, including Cheby Net (Defferrard et al., 2016), Bern Net (He et al., 2021), GPRGNN (Chien et al., 2020), and Fi GURe (Ekbote et al., 2024) (More details in Appendix 7.6.2). There are no existing methods that integrate both spectral and geometric signals. Experimental Settings. For the transductive LP task, we randomly split edges into 85%/5%/10% for training, validation and test sets, while for transductive NC task, we use the 60%/20%/20% split. The results are averaged over 10 random splits and their 95% confidence intervals are reported. We report the AUC-ROC and F1 Score metrics for LP and NC respectively. For computing ORC (in Cusp Laplacian), we use δ = 0.5, i.e. equal probability mass is retained by the node and distributed among it s neighbors. We adopt the ORC implementation from Ni et al. (2019), and use the Sinkhorn algorithm (Sinkhorn & Knopp, 1967) to approximate the W1 distance. See Appendix 7.2.2 for more computational details and complexity analysis. Next, we adopt the implementation of the κ-stereographic product manifold from Geoopt library3. We heuristically determine the signature of our manifold P (i.e. component manifolds) using the discrete ORC curvature of the 2http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb 3https://github.com/geoopt/geoopt Published as a conference paper at ICLR 2025 Baseline Cora Citeseer Pub Med Chameleon Actor Squirrel Texas Cornell Av. Gain GCN 75.21 0.28 67.30 1.05 83.75 0.07 61.16 0.23 31.12 0.96 43.06 0.33 75.61 0.07 67.72 1.19 11.95 GAT 76.70 0.13 66.23 0.85 82.83 0.22 63.10 0.77 32.65 0.23 43.90 0.01 76.09 0.77 74.01 0.01 10.63 SAGE 71.88 0.91 70.01 0.64 81.09 0.13 59.99 0.89 36.73 0.01 41.11 1.16 77.11 0.45 69.91 0.24 11.59 HGCN 78.50 0.14 69.55 0.39 83.72 0.21 60.18 0.57 35.89 0.29 39.93 0.35 88.11 1.12 72.88 1.15 8.97 HGAT 77.12 0.01 70.12 0.92 84.02 0.19 62.43 0.59 35.12 0.27 41.78 0.37 85.56 1.10 73.12 0.18 8.91 κGCN 78.71 1.37 68.14 0.34 85.18 0.52 62.12 0.49 34.57 0.26 43.04 0.31 85.03 0.63 86.36 0.64 7.06 QGCN 79.64 0.38 71.15 1.11 84.76 0.13 61.83 1.01 32.24 0.65 46.65 0.90 82.76 0.07 83.90 0.71 7.20 Self MGNN 80.19 0.60 70.91 0.38 82.81 0.34 64.97 0.54 38.99 0.23 49.79 0.29 90.92 0.65 85.01 0.69 4.62 Cheby Net 71.09 0.91 66.67 0.38 83.83 0.42 59.96 1.51 38.02 0.01 45.67 0.11 79.08 0.96 71.33 1.04 10.61 Bern Net 73.34 0.53 62.12 2.09 82.15 0.13 62.03 0.12 33.55 0.24 42.81 0.66 75.11 0.14 65.56 1.02 12.98 GPRGNN 79.49 0.31 67.61 0.38 84.07 0.09 65.09 0.43 37.43 1.09 47.51 0.23 88.34 0.09 87.21 0.70 5.58 Fi GURe 80.01 0.09 71.26 0.41 83.89 0.11 67.18 0.02 38.31 0.36 48.71 1.02 86.66 0.62 85.01 0.68 4.94 CUSP 83.45 0.15 74.21 0.02 87.99 0.45 70.23 0.61 43.91 0.11 52.98 0.25 94.03 0.72 92.31 0.09 0.0 Imp. 3.26% 2.95% 2.81% 4.05% 5.32% 3.19% 3.11% 5.10% Table 3: Performance comparision of CUSP with baselines for NC task (Mean F1 Score 95% confidence interval). First, Second and Third best performing models are highlighted. Av. Gain represents the average gain of CUSP over the model in that row, averaged across the different datasets. Imp. implies the % improvement of CUSP over the second best performing baseline. input graph. The key idea is that the underlying curvature of the manifold must align with ORC. However, this discussion, along with how we initialise the learnable curvatures for the component manifolds, has been reserved for the Appendix 7.6.4. For all experiments, we choose the total manifold dimension as d M = 48 and learning rate as 4e-3. We use the filter bank ΩPd M = ZI Pd M, Z(1) Pd M, Z(2) Pd M, . . . , Z(L) Pd M , with L = 10. For the GPR weights, we experiment with different initializations, α {0.1, 0.3, 0.5, 0.9}. We list the hyperparameter settings in Appendix 7.6.5. Analysis. Tables 3 and 8 present a comparative performance analysis of CUSP against baseline models for the NC and LP tasks respectively. (a) CUSP consistently outperforms all baseline models across both homophilic and heterophilic datasets, achieving an improvement up to 5.32% in F1-score for NC, up to 5.11% increase in ROC-AUC score for LP. (b) Riemannian baselines perform well for homophilic datasets, while performing poorly in heterophilic cases (Evidence of limitation L1). (c) While spectral baselines like Cheby Net and Bern Net perform poorly on heterophilic tasks because they act as low-pass, owing to the fixed filters, Fi GURe performs well across all tasks because it uses a filter bank to handle different bands in the eigenspectrum. (d) Mixed-curvature baselines (κGCN and QGCN) outperform constant-curvature GNNs (HGCN and HGAT) because they capture the complex graph geometry (Evidence of limitation L2). CUSP Cora Citeseer Pub Med Chameleon Actor Squirrel Texas Cornell H24 S24 82.50 0.18 73.20 0.16 87.20 0.26 69.42 0.01 42.70 0.22 52.00 0.20 81.97 0.24 87.20 0.02 (H8)2 (S8)2 E16 82.80 0.20 73.50 0.18 85.50 0.28 69.80 0.03 43.06 0.19 50.76 0.18 91.60 0.23 91.60 0.24 H8 S8 E32 82.70 0.32 73.40 0.17 86.40 0.29 69.60 0.07 40.73 0.20 51.20 0.19 94.03 0.72 92.31 0.09 H16 (S16)2 81.83 0.20 72.72 0.13 85.90 0.60 70.23 0.61 42.50 0.21 52.98 0.25 92.50 0.22 83.11 0.36 (H16)2 E16 81.90 0.17 72.50 0.15 87.99 0.45 65.30 0.28 43.91 0.11 51.10 0.24 93.20 0.19 91.20 0.45 H24 E24 81.60 0.16 72.30 0.14 87.50 0.64 67.50 0.23 43.50 0.16 48.30 0.23 91.45 0.61 91.39 0.98 S24 E24 80.80 0.41 71.80 0.19 80.99 0.31 69.20 0.21 36.44 0.23 51.83 0.21 90.99 0.21 90.00 0.10 H16 S16 E16 83.45 0.15 74.21 0.02 85.80 0.27 70.17 0.17 43.20 0.18 52.71 0.17 93.80 0.18 91.80 0.21 (S8)2 E32 80.50 0.30 71.50 0.18 79.30 0.32 68.18 0.20 37.14 0.24 51.60 0.22 92.80 0.22 90.80 0.53 (H16)3 81.00 0.19 72.00 0.16 87.70 0.92 68.11 0.25 43.70 0.17 45.67 0.25 88.15 0.25 85.01 0.51 Table 2: Performance comparison of CUSP with different manifold signatures for Node Classification (NC). Best performing signatures are in Bold, and cases with a large decline in performance because of manifold mismatch are in Blue. Table 4 presents the learnt filter weights (ϵ), highlighting the distinct preferences of homophilic and heterophilic datasets. Homophilic datasets, such as Citeseer and Pub Med, emphasize low-pass filters, with the highest weights assigned to lower-order filters, Z(2) and Z(3), at 45.94% and 47.97% respectively. In contrast, heterophilic datasets like Actor and Cornell favor high-pass filters, attributing 40.09% and 28.56% to higher-order filters, Z(5) and Z(9). In the next section, we perform extensive ablation studies to evaluate the effectiveness of all components of CUSP. 5.1 ABLATION STUDY Impact of product manifold signatures. Different datasets leverage substructures with varying curvatures as inductive biases for downstream tasks, leading to differing performance across product manifold signatures. Table 5 shows the learned curvature and weight (β) for the best-performing Published as a conference paper at ICLR 2025 Dataset I Z(1) Z(2) Z(3) Z(4) Z(5) Z(6) Z(7) Z(8) Z(9) Z(10) Cora 0.1729 0.1586 0.0695 0.0438 0.0446 0.0221 0.1525 0.0526 0.0711 0.0739 0.1385 Citeseer 0.0478 0.0506 0.4594 0.0183 0.0765 0.0187 0.1093 0.0755 0.0686 0.0232 0.0522 Pub Med 0.1666 0.0342 0.0132 0.4795 0.1116 0.0289 0.0057 0.0153 0.0564 0.0442 0.0444 Chameleon 0.0232 0.1679 0.0536 0.1696 0.0121 0.0137 0.0392 0.2178 0.1356 0.0140 0.1532 Actor 0.0360 0.0753 0.0332 0.0337 0.0101 0.4009 0.2419 0.0096 0.1155 0.0089 0.0347 Squirrel 0.0306 0.0796 0.0719 0.2562 0.0351 0.0423 0.0415 0.1322 0.1050 0.1105 0.0951 Texas 0.1052 0.0308 0.0664 0.0490 0.0766 0.0274 0.0995 0.0402 0.1741 0.0639 0.2667 Cornell 0.1159 0.0302 0.0671 0.0434 0.0747 0.0249 0.0966 0.0358 0.1691 0.2856 0.0568 Table 4: Learned filter weights (NC) for the top-performing split, distinguishing between homophilic (favoring low-pass filters) and heterophilic (favoring high-pass filters). First, second, and third highest filter weights are highlighted. Dataset Signature Cora H16( 0.36, 0.34) S16(+0.68, 0.29) E16(0, 0.37) Citeseer H16( 0.62, 0.36) S16(+.55, 0.49) E16(0, 0.15) Pub Med H16( 0.91, 0.40) H16( 0.37, 0.36) E16(0, 0.24) Chameleon H16( 0.21, 0.22) S16(+0.13, 0.29) S16(+0.59, 0.49) Actor H16( 0.87, 0.39) H16( 0.53, 0.36) E16(0, 0.25) Squirrel H16( 0.28, 0.14) S16(+0.31, 0.39) S16(+0.67, 0.53) Texas H8( 0.45, 0.32) S8(+0.34, 0.23) E32(0, 0.45) Cornell H8( 0.50, 0.14) S8(+0.23, 0.19) E32(0, 0.67) Table 5: Learning results (NC) of CUSP for the best performing product signature manifold (dim) (curvature, weight). configurations. Table 2 highlights performance across multiple signatures (which are heuristically determined as discussed in Appendix 7.6.4), with large degradation (marked in blue) when there is a mismatch between manifold and graph curvature. For example, Squirrel, which has a predominantly positively curved structure (Figure 7.2.1), performs poorly (drop of 6.5%) on signatures dominated by negative curvature, such as S24 E24 and (H16)3. Impact of mixed-curvature space. Table 6 outlines the comparison of CUSP with its Euclidean variant CUSPeuc, where all representations are learned in Euclidean space, omitting Cusp pooling. On average, CUSPeuc experiences a 3.2% performance drop across all datasets for the NC task, highlighting the importance of capturing mixed-curvature signals. Notably, the performance degradation is more pronounced on heterophilic datasets ( 4%) compared to homophilic datasets ( 2%). This suggests that CUSP s integration of spectral and geometric properties is crucial, particularly for heterophilic data where capturing the eigenspectrum is more relevant. Dataset CUSP CUSPeuc CUSPlap CUSPenc CUSPpool CUSPfil Cora 83.45 0.15 80.78 0.13 81.95 0.34 82.61 0.25 80.11 0.22 81.20 0.29 Citeseer 74.21 0.02 72.50 0.19 72.73 0.30 71.95 0.28 73.10 0.30 72.87 0.27 Pub Med 87.99 0.45 85.23 0.25 86.67 0.10 87.11 0.12 86.23 0.10 87.29 0.09 Chameleon 70.23 0.61 66.47 0.56 68.12 0.35 67.83 0.31 68.47 0.32 66.12 0.34 Actor 43.91 0.11 39.03 0.09 43.03 0.27 41.12 0.28 40.03 0.27 38.81 0.29 Squirrel 52.98 0.25 49.92 0.36 51.39 0.35 50.92 0.35 51.03 0.11 48.13 0.17 Texas 94.03 0.72 90.15 0.61 92.15 0.52 93.15 0.55 92.15 0.60 90.27 0.62 Cornell 92.31 0.09 89.46 0.13 90.46 0.17 91.06 0.28 90.46 0.07 89.73 0.09 Avg. Gain 0 3.19 1.57 1.67 2.19 3.08 Table 6: Ablation study (NC) results on benchmark datasets. Av. Gain represents the average gain of CUSP over the model in that column, averaged across the different datasets. Impact of Cusp Laplacian. To assess the effectiveness of the Cusp Laplacian, we conducted an ablation study by replacing the Cusp Laplacian-based adjacency matrix e A, with the standard graph adjacency matrix (for Cusp filtering) in CUSPlap. This resulted in a performance drop of 1.57% (avg.) across all datasets. The relatively modest degradation highlights the Cusp Laplacian s role in capturing curvature information. Impact of the filter bank, curvature encoding and Cusp pooling. CUSPfil replaces the filter bank with a single learnable filter, as a result the performance degrades largely on heterophilic datasets ( 4%), while the degradation in performance is not that large on homophilic datasets ( 2.5%). Further, we remove the curvaturebased positional encoding in CUSPenc and replace the Cusp pooling mechanism with simple concatenation in CUSPpool. We observe a consistent decline in performance across both these ablations. Owing to spatial limitations, we present the experimental results for the LP task in Appendix 7.6.3. 6 CONCLUSION In this paper, we propose to unify Spectral and Curvature signals in a graph for learning optimal graph representations, aiming to inspire further research in this area. CUSP introduces a graph learning paradigm parameterized by spectral filters on a mixed-curvature product manifold. We propose a new curvature-informed Cusp Laplacian operator to capture the underlying geometry, use it to define a novel GPR-based spectral filter-bank (Cusp Filtering) and introduce an attentionbased pooling mechanism to fuse representations from multiple mixed-curvature graph filters (Cusp Pooling). CUSP outperforms the state-of-the-art baselines for node classification and link prediction over homophilic and heterophilic datasets, highlighting the efficacy of combining spectrum and curvature (geometry), in learning graph representations. Published as a conference paper at ICLR 2025 ACKNOWLEDGEMENTS We would like to sincerely thank Prof. Geoffrey J. Gordon (ggordon@cs.cmu.edu) from Carnegie Mellon University for his invaluable feedback and comments throughout multiple iterations of this paper, which greatly helped improve its quality. Gregor Bachmann, Gary B ecigneul, and Octavian Ganea. Constant curvature graph convolutional networks. In International conference on machine learning, pp. 486 496. PMLR, 2020. Frank Bauer, J urgen Jost, and Shiping Liu. Ollivier-ricci curvature and the spectrum of the normalized graph laplace operator. ar Xiv preprint ar Xiv:1105.3803, 2011. Mikhail Belkin, Jian Sun, and Yusu Wang. Discrete laplace operator on meshed surfaces. In Proceedings of the twenty-fourth annual symposium on Computational geometry, pp. 278 287, 2008. Ines Chami, Zhitao Ying, Christopher R e, and Jure Leskovec. Hyperbolic graph convolutional neural networks. Advances in neural information processing systems, 32, 2019. Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. ar Xiv preprint ar Xiv:2006.07988, 2020. Sungjun Cho, Seunghyuk Cho, Sungwoo Park, Hankook Lee, Honglak Lee, and Moontae Lee. Curve your attention: Mixed-curvature transformers for graph representation learning. ar Xiv preprint ar Xiv:2309.04082, 2023. Micha el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016. Manfredo Perdigao Do Carmo and J Flaherty Francis. Riemannian geometry, volume 2. Springer, 1992. Chanakya Ekbote, Ajinkya Deshpande, Arun Iyer, Sundararajan Sellamanickam, and Ramakrishna Bairi. Figure: Simple and efficient unsupervised node representations with filter augmentations. Advances in Neural Information Processing Systems, 36, 2024. Johannes Gasteiger, Aleksandar Bojchevski, and Stephan G unnemann. Predict then propagate: Graph neural networks meet personalized pagerank. ar Xiv preprint ar Xiv:1810.05997, 2018. Karish Grover, SM Angara, Md Shad Akhtar, and Tanmoy Chakraborty. Public wisdom matters! discourse-aware hyperbolic fourier co-attention for social text classification. Advances in Neural Information Processing Systems, 35:9417 9431, 2022. Albert Gu, Frederic Sala, Beliz Gunel, and Christopher R e. Learning mixed-curvature representations in products of model spaces. In International conference on learning representations, volume 5, 2019. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. Mingguo He, Zhewei Wei, Hongteng Xu, et al. Bernnet: Learning arbitrary graph spectral filters via bernstein approximation. Advances in Neural Information Processing Systems, 34:14239 14251, 2021. Zhuo-Chen He. Constrained heat kernel graph diffusion convolution: A high-dimensional statistical approximation via information theory. IEEE Access, 2024. Alec Jacobson and Olga Sorkine-Hornung. A cotangent laplacian for images as surfaces. Technical Report/ETH Zurich, Department of Computer Science, 757, 2012. J. Jost and S. Liu. Ollivier s Ricci curvature, local clustering and curvature-dimension inequalities on graphs. Discrete & Computational Geometry, 51(2):300 322, 2014. Published as a conference paper at ICLR 2025 Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. Isabel M Kloumann, Johan Ugander, and Jon Kleinberg. Block models and personalized pagerank. Proceedings of the National Academy of Sciences, 114(1):33 38, 2017. Taewook Ko, Yoonhyuk Choi, and Chong-Kwon Kim. A spectral graph convolution for signed directed graphs via magnetic laplacian. Neural Networks, 164:562 574, 2023. H. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2 (1-2):83 97, 1955. Pan Li, I Chien, and Olgica Milenkovic. Optimizing generalized pagerank methods for seedexpansion community detection. Advances in Neural Information Processing Systems, 32, 2019. Ningyi Liao, Haoyu Liu, Zulun Zhu, Siqiang Luo, and Laks VS Lakshmanan. Benchmarking spectral graph neural networks: A comprehensive study on effectiveness and efficiency. ar Xiv preprint ar Xiv:2406.09675, 2024. Y. Lin, L. Lu, and S. Yau. Ricci curvature of graphs. Tohoku Mathematical Journal, 63(4):605 627, 2011. I-Shih Liu. On fourier s law of heat conduction. Continuum mechanics and Thermodynamics, 2: 301 305, 1990. Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212 220, 2017. Rada Mihalcea and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge university press, 2011. Erxue Min, Runfa Chen, Yatao Bian, Tingyang Xu, Kangfei Zhao, Wenbing Huang, Peilin Zhao, Junzhou Huang, Sophia Ananiadou, and Yu Rong. Transformer for graphs: An overview from architecture perspective. ar Xiv preprint ar Xiv:2202.08455, 2022. John Moeller, Vivek Srikumar, Sarathkrishna Swaminathan, Suresh Venkatasubramanian, and Dustin Webb. Continuous kernel learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part II 16, pp. 657 673. Springer, 2016. Chien-Chun Ni, Yu-Yao Lin, Feng Luo, and Jie Gao. Community detection on networks with ricci flow. Scientific reports, 9(1):1 12, 2019. Yann Ollivier. Ricci curvature of metric spaces. Comptes Rendus Mathematique, 345(11):643 646, 2007. Benedetto Piccoli and Francesco Rossi. On properties of the generalized wasserstein distance. Archive for Rational Mechanics and Analysis, 222:1339 1365, 2016. Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2):cnab014, 2021. Frederic Sala, Chris De Sa, Albert Gu, and Christopher R e. Representation tradeoffs for hyperbolic embeddings. In International conference on machine learning, pp. 4460 4469. PMLR, 2018. Aliaksei Sandryhaila and Jos e MF Moura. Discrete signal processing on graphs: Graph fourier transform. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6167 6170. IEEE, 2013. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93 93, 2008. R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343 348, 1967. Published as a conference paper at ICLR 2025 Li Sun, Zhongbao Zhang, Jiawei Zhang, Feiyang Wang, Hao Peng, Sen Su, and S Yu Philip. Hyperbolic variational graph neural network for modeling dynamic graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 4375 4383, 2021. Li Sun, Zhongbao Zhang, Junda Ye, Hao Peng, Jiawei Zhang, Sen Su, and S Yu Philip. A selfsupervised mixed-curvature graph neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 4146 4155, 2022. Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 807 816, 2009. Shukichi Tanno. Ricci curvatures of contact riemannian manifolds. Tohoku Mathematical Journal, Second Series, 40(3):441 448, 1988. Dorina Thanou, Xiaowen Dong, Daniel Kressner, and Pascal Frossard. Learning heat diffusion graphs. IEEE Transactions on Signal and Information Processing over Networks, 3(3):484 499, 2017. Yu Tian, Zachary Lubberts, and Melanie Weber. Curvature-based clustering on graphs. ar Xiv preprint ar Xiv:2307.10155, 2023. Abraham A Ungar. Hyperbolic trigonometry and its application in the poincar e ball model of hyperbolic geometry. Computers & Mathematics with Applications, 41(1-2):135 147, 2001. Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017. Andreas Weber. Analysis of the physical laplacian and the heat flow on a locally finite graph. ar Xiv preprint ar Xiv:0801.0812, 2008. Richard C Wilson, Edwin R Hancock, El zbieta Pekalska, and Robert PW Duin. Spherical and hyperbolic embeddings of data. IEEE transactions on pattern analysis and machine intelligence, 36(11):2255 2269, 2014. Bo Xiong, Shichao Zhu, Nico Potyka, Shirui Pan, Chuan Zhou, and Steffen Staab. Pseudoriemannian graph convolutional networks. Advances in Neural Information Processing Systems, 35:3488 3501, 2022. Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Inductive representation learning on temporal graphs. ar Xiv preprint ar Xiv:2002.07962, 2020. Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp. 40 48. PMLR, 2016. Xiao-Meng Zhang, Li Liang, Lin Liu, and Ming-Jing Tang. Graph neural networks and their current applications in bioinformatics. Frontiers in genetics, 12:690049, 2021a. Yiding Zhang, Xiao Wang, Chuan Shi, Xunqiang Jiang, and Yanfang Ye. Hyperbolic graph attention network. IEEE Transactions on Big Data, 8(6):1690 1701, 2021b. Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework. In Proceedings of the Web Conference 2021, pp. 1215 1226, 2021. Shichao Zhu, Shirui Pan, Chuan Zhou, Jia Wu, Yanan Cao, and Bin Wang. Graph geometry interaction learning. Advances in Neural Information Processing Systems, 33:7548 7558, 2020. Published as a conference paper at ICLR 2025 Table of Contents 14 7 Appendix 15 7.1 Notation Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 7.2 More on Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 7.2.1 Analysis of Real-world Graphs . . . . . . . . . . . . . . . . . . . . . . . . 16 7.2.2 Ollivier-Ricci Curvature (ORC) . . . . . . . . . . . . . . . . . . . . . . . . 17 7.2.3 Product Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 7.2.4 Kappa-Stereographic Model . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.2.5 Spectral Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.3 More on Cusp Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.3.1 Relevent Theorems for Cusp Laplacian . . . . . . . . . . . . . . . . . . . 19 7.4 Generalised Pageranks and GPRGNN . . . . . . . . . . . . . . . . . . . . . . . . 21 7.4.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7.4.2 Why GPRGNN as the backbone of CUSP? . . . . . . . . . . . . . . . . . 22 7.5 Theorems for Curvature Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.5.1 Proof of Definition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.5.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 7.6 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.6.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.6.3 Experimental Results for Link Prediction . . . . . . . . . . . . . . . . . . 25 7.6.4 Estimating Product Manifold Signature . . . . . . . . . . . . . . . . . . . 26 7.6.5 More Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 27 Published as a conference paper at ICLR 2025 7.1 NOTATION TABLE Notation Reference H Hyperbolic manifold S Spherical manifold E Euclidean manifold Pd M Product manifold of dimension d M κ Continous manifold curvature eκ Ollivier-Ricci Curvature (ORC) eκ(x, y) ORC of edge {x, y} eκ(x) ORC of node x mδ x Probability mass assigned to node x for ORC computation δij Kroneckner delta function N(x) Neighbourhood set of node x x y This implies that x and y are adjacent nodes δ ORC neighbourhood weighting parameter W1(.) Wasserstein-1 distance d G(x, y) Shortest path (graph distance) between nodes x and y on graph G Mκi,di i Constant-curvature manifold with dimension di and curvature κi. Mi {H, S, E} e L, e D, e A Cusp Laplacian operator; e L = e D e A e Ln, e An Normalized Cusp Laplacian and Adjacency matrices; e An = e D 1 2 e A e D 1 2 L, D, A Traditional graph Laplacian, Adjacency and Degree matrices; L = D A df Input graph feature dimension d M Total dimension of the product manifold d C Total dimension of the curvature embedding F Rn df Input feature matrix expκ 0 : Rd M Mκ Exponential map, to map from tangent plane (Euclidean) to the product manifold κ Mobius addition κ κ-right-matrix-multiplication κ κ-left-matrix-multiplication ζ(x) Final node representation for node x ϵl Learnable weight of lth filter ζ Pn (d M+d C) Matrix containing final node embeddings βq Learnable weight of qth component manifold τq Relative importance of manifold Mq in Cusp Pooling L Total number of filters Q Total number of components in product manifold l Used to denote the lth filter Z(L)(x) Pd M The GPR-based node representation for filter with L layers, on manifold Pd M κ(q) Manifold curvature of qth component in product manifold d(q) Manifold dimension of qth component in product manifold H(l) M κ(q),d(q) q Hidden state representation after l layers of GPR, on qth manifold component with curvature κ(q) and dimension d(q) γl GPR weight for lth layer in the filter, while propagation GPR score. α Initialising parameter for GPR Signature Pd M = Q q=1M κ(q),d(q) q = ( H h=1H κ(h),d(h) h ) ( S s=1S κ(s),d(s) s ) Ed(e) ψ : V R A function defined on vertex set V ΦPd C (eκ(x)) Curvature encoding on product manifold KP(eκa, eκb) KP(eκa, eκb) := ΦPd C (eκa), ΦPd C (eκb) is the curvature kernel λi ith eigenvalue of e An eλi ith eigenvalue of e Ln fθ(.) : Rdf Rd M Neural network with parameter set {θ} that generates the hidden state features before feeding input to Cusp Filtering. gθ(.) : Rd M Rd C Neural network projector used in curvature encoding Published as a conference paper at ICLR 2025 7.2 MORE ON PRELIMINARIES 7.2.1 ANALYSIS OF REAL-WORLD GRAPHS Figure 4: Distribution of Ollivier-Ricci curvatures eκ(x, y) of edges across different datasets. The histograms illustrate the frequencies of edge-based Ollivier-Ricci curvature values for each dataset, highlighting the topological diversity in both homophilic and heterophilic settings, and hence the need of learning representations in manifolds with different curvatures. Figure 5: Distribution of Ollivier-Ricci curvatures eκ(x) of nodes across different datasets. Figure 6: Distribution of Spectral Energy with the respective eigenvalues of the graph Laplacian. Spectral Energy, Ei = f 2 i /P j f 2 j , where fj is the jth Fourier mode of the graph Laplacian. Low frequency implies homophily and high frequency components correspond to heterophily. These plots highlight the importance of capturing signals from different parts of the eigenspectrum for designing a GNN that works well across multiple tasks. Published as a conference paper at ICLR 2025 7.2.2 OLLIVIER-RICCI CURVATURE (ORC) In an unweighted graph, the neighborhood of each node x, denoted as N(x), is assigned a probability distribution according to a lazy random walk formulation (Lin et al., 2011). Specifically, we define the distribution as follows: α, if z = x, 1 α |N (x)|, if z N(x), 0, otherwise. (9) Here, α controls the probability that a random walk will remain at the current node, while the remaining probability mass (1 α) is uniformly distributed across the neighboring nodes. This formulation connects ORC with lazy random walks and influences the balance between local exploration and the likelihood of revisiting a node. In this work, we use α = 0.5, meaning that equal probability mass is distributed between the node itself and its neighbors, striking a balance between breadth-first and depth-first search strategies. The choice of α is crucial and depends on the topology of the graph. A smaller α value encourages more local exploration, while a larger α favors revisiting nodes, thereby promoting a lazy walk. For our experiments, α = 0.5 was chosen to reflect an equal probability mass distribution between the node and its neighbors. Computational Considerations. Computing ORC can be computationally intensive due to the need to calculate the Wasserstein-1 distance (W1), between the neighborhood distributions of connected nodes. In a discrete setting, this corresponds to solving a linear program. Typically, W1(mx, my) between two nodes x and y is computed using the Hungarian algorithm (Kuhn, 1955), which has a cubic time complexity. However, this becomes prohibitively expensive as the graph size increases. Alternatively, the Wasserstein-1 distance can be approximated using the Sinkhorn algorithm (Sinkhorn & Knopp, 1967), which reduces the complexity to quadratic time. For this work, we employ the Sinkhorn approximation to compute ORC efficiently. Below, we provide an alternative to approximate ORC of an edge in linear time, in case of very large (million-scale) real-world graphs. Approximating ORC in Linear Time. Even with the quadratic complexity of the Sinkhorn algorithm, scaling to large networks remains challenging. To address this, a linear-time combinatorial approximation of ORC can be employed, as suggested by Tian et al. (2023). This method approximates the Wasserstein distance by utilizing local structural information, making it much more computationally feasible. The approximation of ORC builds on classical bounds first introduced by Jost & Liu (2014). Let #(x, y) denote the number of triangles formed by the edge (x, y), and define a b = min(a, b), a b = max(a, b) and dx is the degree of node x. The following bounds on ORC can be derived for an edge e = x, y: Theorem 3 ( Jost & Liu (2014)). For an unweighted graph, the Ollivier-Ricci curvature of an edge e = x, y satisfies the following bounds: 1. Lower bound: eκ(x, y) 1 1 + + #(x, y) 2. Upper bound: eκ(x, y) #(x, y) dx dy . (10) The ORC of an edge, can then be approximated as the arithmetic mean of these bounds: bκ(x, y) := 1 κupper(x, y) + κlower(x, y) . (11) The proof of these bounds has been detailed in Tian et al. (2023). This approximation is computationally efficient, with linear-time complexity, and can be parallelized easily across edges, making it suitable for largescale graphs. The computation relies solely on local structural information, such as the degree of the nodes and the number of triangles. 7.2.3 PRODUCT MANIFOLDS Let M1, M2, . . . , Mk denote a set of smooth manifolds. Their Cartesian product forms a product manifold, denoted by P , such that P = M1 M2 Mk. Any point p P is characterized by its coordinates p = (p1, p2, . . . , pk), where each pi corresponds to a point on the individual manifold Mi. Similarly, a tangent vector v Tp P can be expressed as v = (v1, v2, . . . , vk), where each vi Tpi Mi represents the projection of v in the tangent space of the respective component manifold Mi. Optimization over manifolds requires the notion of taking steps along the manifold, which can be achieved by moving in the tangent space and mapping those movements back onto the manifold through the exponential map. The exponential map at a point p P, Published as a conference paper at ICLR 2025 denoted expp : Tp P P, allows for this transfer. For product manifolds, the exponential map decomposes into individual component exponential maps. Specifically, given a tangent vector v = (v1, v2, . . . , vk) at p = (p1, p2, . . . , pk) P, the exponential map on P can be expressed as: expp(v) = (expp1(v1), expp2(v2), . . . , exppk(vk)) (12) 7.2.4 KAPPA-STEREOGRAPHIC MODEL The κ-stereographic model (Bachmann et al., 2020) unifies Hyperbolic and Spherical geometries under gyrovector formalism. This model leverages the framework of gyrovector spaces to represent all three constant curvature geometries hyperbolic, Euclidean, and spherical simultaneously. Additionally, it facilitates smooth transitions between these constant curvature geometries, enabling the joint learning of space curvatures alongside the embeddings. It is a smooth manifold Mκ,d = {z Rd| κ||z||2 2 < 1}, whose origin is 0 Rd, equipped with a Riemannian metric gκ z = (λκ z)2I, where λκ z is given by λκ z = 2 1 + κ||z||2 2 1 . The Riemannian operations under this model are elaborated in the table below:f Operation Formalism in Ed Unified formalism in κ-stereographic model (Hd/ Sd) Distance Metric dκ M(x, y) = x y 2 dκ M(x, y) = 2 |κ| tan 1 κ p |κ| x κ y 2 Exp. Map expκ x(v) = x + v expκ x(v) = x κ |κ| λκ x v 2 Log. Map logκ x(y) = x y logκ x(y) = 2 |κ|λκ x tan 1 κ p |κ| x κ y 2 x κy x κy 2 Addition x κ y = x + y x κ y = (1+2κx T y+K y 2)x+(1 κ||x||2)y 1+2κx T y+κ2||x||2||v||2 Table 7: Operations in Hyperbolic Hd, Spherical Sd and Euclidean space Ed. κ-right-matrix-multiplication. Given a matrix X Rn d holding κ-stereographic embeddings in its rows and weights W Rd e, the Euclidean right multiplication can be written row-wise as (XW)i = Xi W. Then the κ-right-matrix-multiplication is defined row-wise as (X κ W)i = expκ 0 ((logκ 0(X)W)i ) = tanκ αi tan 1 κ (||X i||) (XW)i ||(XW)i || (13) where αi = ||(XW)i || ||Xi || and expκ 0 and logκ 0 denote the exponential and logarithmic map in the κ-stereo. model. κ-left-matrix-multiplication. Given a matrix X Rn d holding κ-stereographic embeddings in its rows and weights A Rn n, the κ-left-matrix-multiplication is defined row-wise as (A κ X)i := ( X j Aij) κ mκ(X1 , , Xn ; Ai ). (14) 7.2.5 SPECTRAL GRAPH THEORY Graph Fourier Transform (GFT) (Sandryhaila & Moura, 2013) lays the foundation for Graph Neural Networks (GNNs). A GFT is defined using a reference operator R which admits a spectral decomposition. Traditionally, in the case of GNNs, this reference operator has been the symmetric normalized Laplacian Ln = I An (Kipf & Welling, 2016). The graph Fourier transform of a signal f Rn is then defined as ˆf = U f Rn, and its inverse as f = Uˆf. A graph filter is an operator that acts independently on the entire eigenspace of a diagonalisable and symmetric reference operator R, by modulating their corresponding eigenvalues. A graph filter is defined via the graph filter function g(.) operating on the reference operator as g(R) = Ug(Λ)U . 7.3 MORE ON CUSP LAPLACIAN Spectral graph theory has shown significant progress in relating geometric characteristics of graphs to properties of spectrum of graph Laplacians and related matrices. Several variants of the graph Laplacian matrices have been shown to capture specific inductive biases for different tasks (Ko et al., 2023; Belkin et al., 2008; Jacobson & Sorkine-Hornung, 2012). Proof of Definition 1. Say the function ψ : V R is defined on the vertex set V of the graph. Suppose ψ describes a temperature distribution across a graph, where ψ(x) is the temperature at vertex x. According to Newton s law of cooling (He, 2024), the heat transferred from node x to node y is proportional to ψ(x) ψ(y) Published as a conference paper at ICLR 2025 if nodes x and y are connected (if they are not connected, no heat is transferred). Consequently, the heat diffusion equation on the graph can be expressed as dψ y Axy(ψ(x) ψ(y)), where β is a constant of proportionality and A denotes the adjacency matrix of the graph. Further insight can be gained by considering Fourier s law of thermal conductance (Liu, 1990), which states that heat flow is inversely proportional to the resistance to heat transfer. ORC measures the transportation cost (W1(:, :)) between the neighborhoods of two nodes, reflecting the effort required to transport mass between these neighborhoods (Bauer et al., 2011). We interpret this transportation cost as the resistance between nodes. The vital takeaway here is that Heat flow between two nodes in a graph is influenced by the underlying Ollivier-Ricci curvature (ORC) distribution. The diffusion rate is faster on an edge with positive curvature (low resistance), and slower on an edge with negative curvature (high resistance). Thus, the ratio Rres xy = W1(mx,my) d G(x,y) represents the resistance from node x to node y, i.e. dψxy dt 1 Rres xy . It can be observed that 1 Rres xy = d G(x,y) W1(mx,my) = 1 1 eκ(x,y) (From the definition of ORC) would tend to infinity when W1(mx, my) = 0 (i.e. eκ(x, y) = 1). Thus, to ensure continuity, we create a new ratio as 1 R xy = e Rres xy = e 1 1 eκ(x,y) . Thus, we can modify the above heat flow equation as: Axy(ψ(x) ψ(y)) R xy (Inversely proportional to R xy) y Axy(ψ(x) ψ(y)) e 1 1 eκ(x,y) = β 1 1 eκ(x,y) 1 1 eκ(x,y) ( Dxx = X 1 1 eκ(x,y) Dxx e 1 1 eκ(x,y) Axy ψ(y) (δxy is the Kronecker delta.) 1 1 eκ(x,y) Lxy ψ(y) (L, D, A are Laplacian, Degree and Adjacency matrices.) 1 1 eκ(x,y) ψ = βe Lψ ( is the element-wise product) This gives us the standard heat equation on graphs. Here, β is the updated constant of proportionality. e L = e D e A is the Cusp Laplacian operator, where e D and e A are the updated Degree and Adjacency matrices, to represent that the graph is transformed under edge weights wxy = 1 R xy = e 1 1 eκ(x,y) . Finally, our Cusp Laplacian operator can be written as (x y means xy is an edge in the graph): e Lψ(x) = X y x wxy (ψ(x) ψ(y)) = X 1 1 eκ(x,y) (ψ(x) ψ(y)) (15) Why is e Rres xy = e 1 1 eκ(x,y) the right choice? To mathematically justify that e 1 1 eκ(x,y) is an appropriate choice, we must verify its properties: 1. Asymptotics. As eκ(x, y) 1, e 1 1 eκ(x,y) 0 , indicating that nodes with high positive curvature experience very fast heat diffusion (minimal resistance). Conversely, as eκ(x, y) 1 , e 1 1 eκ(x,y) 1 e , meaning that nodes with high negative curvature have slow heat diffusion (higher resistance). 2. Continuity. The exponential function is smooth and continuous, ensuring that even small changes in the curvature result in smooth changes in the heat flow dynamics, which is crucial for stable numerical simulations and theoretical consistency. 3. Monotonicity. For eκ(x, y) > 0, e 1 1 eκ(x,y) is a decreasing function with respect to eκ(x, y). This means as curvature increases, the resistance decreases, aligning with the physical intuition of heat flow. 7.3.1 RELEVENT THEOREMS FOR CUSP LAPLACIAN Theorem 4 (Positive Semidefiniteness of Cusp Laplacian). The normalized Laplacian operator e L is positive semidefinite, i.e., for any real vector u Rn, we have u T e Lnu 0. Published as a conference paper at ICLR 2025 Proof. We start by showing that the normalized Cusp Laplacian e Ln = I e D 1/2 e A e D 1/2 = I e An (16) is positive semi-definite. Let u be any real vector of unit norm and f = e D 1/2u, then we have u T e Lnu = u T u u T e D 1/2 e A e D 1/2u = x,y=1 fxfy e Axy (17) x=1 e Dxxf 2 x x,y=1 fxfy e Axy = 1 x=1 e Dxxf 2 x 2 x,y=1 fxfy e Axy + y=1 e Dyyf 2 y ) (18) x,y=1 e Axy(fx fy)2 = 1 1 1 eκ(x,y) Axy(fx fy)2, (19) where the last step follows from the definition of the degree. We know that e 1 1 eκ(x,y) > 0 κ(x, y), hence our Cusp Laplacian is positive semidefinite. Theorem 5. The eigenvalues {eλi}n i=1 of the normalized Cusp Laplacian e Ln lie in the interval [0, 2]. Proof. We begin by noting that Theorem 4 shows that the normalized Cusp Laplacian e Ln has real, nonnegative eigenvalues, meaning we need only to prove that the largest eigenvalue, denoted as λn, is less than or equal to 2. Before moving to that, we show that 0 is indeed an eigenvalue of e L associated with the unit eigenvector τ where τ = e Dii P v e Dvv . Let 1 be the all one vector. Then, a direct calculation reveals that e Lsymτ = τ e D 1/2 e A e D 1/2τ = τ e D 1/2 e A e D 1/2 e D1/21 1 q P v e Dvv (20) = τ e D 1/2 e A1 1 q P v e Dvv = τ e D 1/2 e D1 1 q P v e Dvv (21) = τ e D1/21 1 q P v e Dvv = τ τ = 0. (22) Combining this result with the positive semi-definite property of the Laplacian shows that 0 is indeed the smallest eigenvalue of e Lsym associated with the eigenvector τ. For the second part, using the Courant-Fischer theorem, we know that the largest eigenvalue can be expressed as: λn = max u =0 u e Lnu Substituting the definition of the normalized Cusp Laplacian e Ln = I e An into this expression, and letting f = e D 1/2u, we have: λn = max u =0 u e D 1/2e L e D 1/2u u u = max f =0 f e Lf The degree matrix, can be expressed in the quadratic form as f e Df = Pn x=1 e Dxx|fx|2. For the numerator involving e L, we expand the quadratic form: x,y=1 e Axy(fx fy)2 = 1 1 1 eκ(x,y) Axy(fx fy)2 (23) 1 1 eκ(x,y) Axy(fx + fy)2 2 x=1 |fx|2 n X x=1 Dxx|f(x)|2. (24) The last inequality follows from the fact that e 1 1 eκ(x,y) 1 e as eκ(x, y) 1 implies that it is always < 1. Thus, we can conclude that, f e Lf 2f e Df, and the Rayleigh quotient is bounded f e Lf f e Df 2. This shows that the largest eigenvalue of the normalized Cusp Laplacian e Ln is bounded by 2, completing the proof that the eigenvalues of e Ln are contained within the interval [0, 2]. Published as a conference paper at ICLR 2025 Figure 7: Architecture of GPRGNN. 7.4 GENERALISED PAGERANKS AND GPRGNN Equivalence of the GPR method and polynomial graph filtering. If we truncate the infinite series in the GPR definition at some integer K, PK k=0 γk e Ak n becomes a polynomial graph filter of degree K. Consequently, optimizing the GPR weights is tantamount to optimizing the polynomial graph filter. It is important to note that any graph filter can be approximated using a polynomial graph filter, enabling the GPR method to handle a wide variety of node label patterns. Additionally, increasing K enhances the approximation of the optimal graph filter. This again illustrates the advantage of large-step propagation. GPRGNN architecture. GPR-GNN initially derives hidden state features for each node and subsequently employs GPR to disseminate them. The GPR-GNN procedure can be represented as: ˆP = softmax(Z), Z = k=0 γk H(k), H(k) = e Asym H(k 1), H(0) i: = fθ(Xi:), (25) Here, fθ(.) denotes a neural network parametrized by {θ}, which produces the hidden state features H(0). The GPR weights γk are optimized alongside {θ} in an end-to-end manner. The GPR-GNN model is straightforward to interpret: As previously mentioned, GPR-GNN is capable of adaptively managing the contribution of each propagation step to fit the node label pattern. Analyzing the trained GPR weights also aids in understanding the topological properties of a graph, such as identifying the optimal polynomial graph filter. 7.4.1 PROOF OF THEOREM 1 We first state the formal version of Theorem 1 Theorem 6 (Formal version of Theorem 1). Assume the graph G is connected. Let λ1 λ2 ... λn and eλ1 eλ2 ... eλn be the eigenvalues of e An and e Ln, respectively. If γl 0 l {0, 1, ..., L}, PL l=0 γl = 1 and l > 0 such that γl > 0, then gγ,L(λi) gγ,L(λ1) < 1 i 2. Also, if γl = ( α)l, α (0, 1) and L , then lim L gγ,L(λi) lim L gγ,L(λ1) > 1 i 2. 1. Note that gγ,L(λi) gγ,L(λ1) < 1 i 2 implies the low-pass case as after applying the graph filter gγ,L, the lowest frequency component (correspond to λ1) further dominates. 2. Unfiltered case. Recall that in the unfiltered case, we do not multiply with e An. It can also be viewed as multiplying the identity matrix I, where the eigenvalue ratio is |λi|0 |λ1|0 = 1. Hence gγ,L acts like a low pass filter in this case. 3. In contrast, lim L gγ,L(λi) lim L gγ,L(λ1) > 1 i 2 implies that after applying the graph filter, the lowest frequency component (correspond to λ1) no longer dominates. This corresponds to the high pass filter case. Proof. We start with the low pass filter result. From Theorem 5, we know that 0 eλ1 eλ2 ... eλn 2. Given the spectrum of e An, we know that e An has spectrum negatives of e An, and I e An adds one to each eigenvalue of e An . Hence, 1 λ1 λ2 ... λn 1 follows directly. Now, we know that λ1 = 1 and |λi| < 1, i 2. Further, we have assumed that gγ,L(λ1) = PL l=0 γl = 1. Hence, proving Theorem 6 is equivalent to show |gγ,L(λi)| < 1 i 2. Published as a conference paper at ICLR 2025 This is obvious since gγ,L(λ) = PL l=0 γlλl is a polynomial of order L with nonnegative coefficients. It is easy to check that l 1, |λ|l < 1, |λ| < 1. Combine with the fact that all γk s are nonnegative we have l=0 γl|λl| = l=0 γl|λ|l (a) l=0 γl = 1. Finally, note that the only possibility that the equality holds is γl = δ0,l since l 1, |λ|l < 1, |λ| < 1. However, by assumption PL l=0 γl = 1 and l > 0 such that γl > 0 we know that this is impossible. Hence (a) is a strict inequality <. For the high pass filter result, it can be observed that: lim L gγ,L(λ) = lim L l=0 γlλl = lim L l=0 ( αλ)l = 1 1 + αλ, where the last step is due to the fact that α (0, 1) and thus lim L ( αλ)L = 0, |λ| 1. Thus we have lim L gγ,L(λi) lim L gγ,L(λ1) = 1 + α 1 + αλi (b)> 1 i 2. The strict inequalities (b) is from the fact that |λi| < 1, i 2. Notably, supλ [1, 1) 1 1+αλ happens at the boundary λ = 1, which corresponds the the bipartite graph. It further shows that the graph filter with respect to the choice γl = ( α)l emphasizes high frequency components and thus it is indeed acting as a high pass filter. 7.4.2 WHY GPRGNN AS THE BACKBONE OF CUSP? In this section, we elaborate on why GPR is an ideal backbone when compared to other Spectral GNNs. 1. Adaptive Filter Design. GPR learns the filter coefficients directly, allowing the spectral response to adapt to the task and dataset. This flexibility is critical for modeling both homophilic and heterophilic graphs. 2. Universality. Unlike fixed low-pass filters like Cheb Net, which excel primarily in homophilic settings, GPR s learnable filters enable it to balance low-pass and high-pass components, making it suitable for both homophilic and heterophilic graphs. This is one of the main goals of our paper - to achieve superior performance on homophilic and heterophilic tasks. Fixed polynomial filters in Cheb Net and Bernsteinbased methods approximate spectral responses up to a fixed order, limiting their ability to model complex spectral properties. 3. GPRGNN escapes oversmoothing. GPR weights are adaptively learnable, which allows GPR-GNN to avoid over-smoothing and trade node and topology feature informativeness. See Section 4 of Chien et al. (2020) for more theoretical analysis on the same and proofs, which is beyond the scope of this work. GPR not only mitigates feature over-smoothing, but also works on highly diverse node label patterns (See Section 4 and 5 of Chien et al. (2020)). 4. Capturing node features and graph topology. In many important graph data processing applications, the acquired information includes both node features and observations of the graph topology. GPRGNN jointly optimizes node feature and topological information extraction, regardless of the extent to which the node labels are homophilic or heterophilic. 5. Filter Bank Construction. Using GPR based spectral filters, helps us to effectively construct a filter bank where each adaptive filter contributes to a specific spectral profile, enabling the model to aggregate information across different spectral bands. This approach captures diverse patterns in node features and topology, unlike Cheb Net or Bernstein-based methods, which rely on fixed polynomial approximations and lack such flexibility. 7.5 THEOREMS FOR CURVATURE ENCODING Theorem 7 (Bochner s Theorem). (Moeller et al., 2016) A continuous, translation-invariant kernel K(x, y) = Ψ(x y) on Rd is positive definite if and only if there exists a non-negative measure on R such that Ψ is the Fourier transform of the measure. 7.5.1 PROOF OF DEFINITION 2 Proof. Using the Bochner s theorem stated above, our curvature kernel KR has the expression: KR(eκa, eκb) = ΨR(eκa, eκa) = Z R eiω(eκa eκb)p(ω)dω = Eω[ξω(eκa)ξω(eκb) ], (26) Published as a conference paper at ICLR 2025 where ξω(eκ) = eiωeκ. Since kernel KR and measure p(ω) are real, we extract the real part of (26): KR(eκa, eκb) = Eω cos(ω(eκa eκb)) = Eω cos(ωeκa) cos(ωeκb) + sin(ωeκa) sin(ωeκb) . (27) The above formulation suggests approximating the expectation by the Monte Carlo integral, i.e. KR(eκa, eκb) 1 d C Pd C i=1 cos(ωieκa) cos(ωieκb)+sin(ωieκa) sin(ωieκb), with ω1, . . . , ωd C i.i.d p(ω). Therefore, we propose the finite dimensional functional mapping to Rd C as: ΦRd C (eκ) = h cos(ω1eκ), sin(ω1eκ), . . . , cos(ωd Ceκ), sin(ωd Ceκ) i (28) The unknown probability distribution p(ω) is estimated using the inverse cumulative distribution function (CDF) transformation as in Xu et al. (2020). Since our GNN is operating in the mixed-curvature space, we must map our defined curvature kernel based representations to the product manifold. We do so using the exponential map, for a node x with ORC curvature eκ(x) as: ΦPd C (eκ(x)) = gθ Q q=1exp κ(q) 0 (ΦRd C (eκ(x))) (29) = gθ Q q=1Φ M κ(q),d(q) q (eκ(x)) (30) where exp κ(q) 0 : Rd C M κ(q),d(q) q denotes the exponential map on the qth component manifold with curvature κ(q), || is the concatenation operator and gθ : Pdf Pd C is a Riemannian projector. We need gθ because we maintain a single product manifold for CUSP with total dimension df. So, upon taking the exponential map with respect to this product manifold, we are required to project the curvature embeddings to the required dimension d C. 7.5.2 PROOF OF THEOREM 2 Proof. We begin by recalling that in Euclidean space, the curvature kernel KR is: KR(eκa, eκb) = ΦRd C (eκa), ΦRd C (eκb) = ΨR(eκa eκb). The key property here is translation invariance: KR(eκa + ec, eκb + ec) = KR(eκa, eκb) = ΨR(eκa eκb). Next, we move to the product manifold Pd C, which consists of multiple components of different curvatures, such as hyperbolic, spherical, and Euclidean spaces. For each component manifold M κ(q),d(q) q with curvature κ(q), the stereographic inner product ., . κ x : Tx Mn κ Tx Mn κ R, is defined on the tangent plane of the Riemannian manifold as: u, v κ x = u T gκ xv = (λκ x)2 u, v , where the conformal factor λκ x is defined as: λκ x = 2 1 + κ x 2 2 . This conformal factor modulates the stereographic projection in the curved space, and it ensures that distances are mapped correctly in the manifold space. Now, consider the inner product between the curvature embeddings ΦPd C (eκa) and ΦPd C (eκb) in the mixed-curvature space. KP(eκa, eκb) = q=1 Φ M κ(q),d(q) q (eκa), Φ M κ(q),d(q) q (eκb) κ(q), where each component manifold M κ(q),d(q) q contributes to the overall inner product in the product manifold Pd C. Using the stereographic inner product in each component, we can write: KP(eκa, eκb) = λ κ(q) x 2 ΦRd C (eκa), ΦRd C (eκb) . We now need to show that translation invariance holds in the mixed-curvature product manifold. Since the conformal factor λκ x depends only on the norm x 2, any translation by a constant ec does not affect the relative difference between curvature embeddings. Specifically, for any constant shift ec: KP(eκa + ec, eκb + ec) = λ κ(q) x 2 ΨR((eκa + ec) (eκb + ec)) = ΨP(eκa eκb). Thus, the kernel in the mixed-curvature space remains invariant to translation, completing the proof. Published as a conference paper at ICLR 2025 7.6 EXPERIMENTATION 7.6.1 DATASETS The performance of CUSP is evaluated over eight benchmark datasets for two primary tasks: Node Classification (NC) and Link Prediction (LP). These datasets encompass both homophilic and heterophilic domains. Detailed descriptions of each dataset are provided below. 1. Citation Networks. Cora, Pub Med, and Citeseer (Sen et al., 2008; Yang et al., 2016) are citation networks in which nodes symbolize research papers, and edges denote citation links between them. Each node is labeled with its subject category. This dataset is commonly utilized for node classification because of its pronounced homophilic properties. 2. Wikipedia graphs. Chameleon and Squirrel (Rozemberczki et al., 2021) are heterophilic graphs derived from Wikipedia articles. Nodes represent articles, and edges represent hyperlinks between them. Node labels correspond to website traffic levels. 3. Actor Co-occurrence Network. Actor (Tang et al., 2009) is a heterophilic graph dataset where nodes depict actors and edges signify co-occurrences on the same Wikipedia page. The node labels correspond to the professional background of the actors. 4. Webpage graphs. Texas and Cornell4 are parts of the Web KB dataset, and are sparsely connected heterophilic graphs. Here, nodes represent web pages, and edges represent hyperlinks between them. Labels reflect different types of webpages. 7.6.2 BASELINES This part offers an in-depth discussion of the baseline models against which CUSP is compared. We classify the baseline models into three categories: Spatial, Riemannian, and Spectral methods, which each address a distinct facet of graph neural network architectures. Spatial baselines. The first kind of baselines includes the traditional spatial methods, which directly operate on the node features and their immediate neighborhoods. 1. GCN (Kipf & Welling, 2016). Graph Convolutional Networks (GCNs) represent one of the foundational graph neural networks that utilize spectral graph convolution in the spatial domain. They derive node embeddings by combining features from neighboring nodes via a linear combination involving the adjacency matrix and the nodes features. 2. GAT (Veliˇckovi c et al., 2017). Graph Attention Network (GAT) introduces attention mechanisms to graph neural networks. Each node assigns learnable attention weights to its neighbors and aggregates their features based on these weights. 3. Graph SAGE (Hamilton et al., 2017). Graph SAGE is an inductive technique designed to learn node embeddings by sampling and aggregating features from a fixed set of neighboring nodes, instead of processing the entire graph. This method enables Graph SAGE to create embeddings for nodes not encountered during training by using efficient neighborhood sampling. Riemannian Baselines. Riemannian models function within non-Euclidean spaces (such as hyperbolic or spherical manifolds) and are tailored for graph data characterized by intricate geometric properties (e.g. hierarchical or cyclic structures). 1. HGCN (Chami et al., 2019). Hyperbolic Graph Convolutional Networks utilize hyperbolic geometry to represent the hierarchical and tree-like characteristics of graphs. This approach extends GCN to hyperbolic space by introducing a hyperbolic variant of the convolutional operation. It is especially suitable for datasets that exhibit hierarchical or tree-like configurations. 2. HGAT (Zhang et al., 2021b). Hyperbolic Graph Attention Network (HGAT) extends Graph Attention Networks (GAT) into hyperbolic space by integrating attention mechanisms with hyperbolic geometry, and calculates the attention weights among the nodes in the hyperbolic space to enhance the aggregation of features. 3. κGCN (Bachmann et al., 2020). κGCN allows for learning the curvature of each node in a graph and generalizes GCN to operate in mixed-curvature spaces. The curvature parameter κ determines whether a node lies in hyperbolic, spherical, or Euclidean space. By learning curvature adaptively, κGCN offers flexibility in modeling graphs with regions of different geometries, providing a better fit for graphs with complex structures. 4http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb Published as a conference paper at ICLR 2025 4. QGCN (Xiong et al., 2022). Pseudo-Riemannian GCN extends a GCN to a pseudo-Riemannian manifold, enabling functionality in mixed-curvature spaces. This network is capable of modeling graph regions with both positive and negative curvature. 5. Self MGNN (Sun et al., 2022). Self MGNN generates embeddings within a mixed-curvature space through self-supervision. It dynamically allocates varied curvatures to different regions of the graph by utilizing a mixed-curvature embedding space. This approach incorporates both self-supervision and mixed-curvature learning to improve performance on heterogeneous graphs. Spectral Baselines. These techniques utilize the eigenvalues of either the graph Laplacian or adjacency matrix to establish convolutional filters that function in the frequency domain. 1. Cheby Net (Defferrard et al., 2016). Cheby Net implements spectral convolutions through a polynomial approximation of the graph Laplacian, sidestepping the expensive process of eigenvalue decomposition. Instead, it approximates the convolution using Chebyshev polynomials. This approach allows Cheby Net to execute localized graph convolutions efficiently, making it well-suited for handling larger graphs. 2. Bern Net (He et al., 2021). Bern Net employs Bernstein polynomials to approximate graph filters, providing flexible management over the filter s frequency response. This method extends polynomial-based graph filters and is adaptable to various frequency elements in graphs. 3. GPRGNN (Chien et al., 2020). The Generalized Page Rank Graph Neural Network (GPRGNN) builds upon the Personalized Page Rank (PPR) approach, integrating it into the framework of graph neural networks. It propagates node features through the graph by using a weighted sum of adjacency matrix powers, dynamically adjusting to both homophilic and heterophilic graphs. 4. Fi GURe (Ekbote et al., 2024). Fi GURe employs adaptive filters to capture various sections of the graph spectrum, enabling it to learn both high-pass and low-pass filters specific to the task. It dynamically selects the optimal filter bank to accurately represent the graph s architecture. 7.6.3 EXPERIMENTAL RESULTS FOR LINK PREDICTION Baseline Cora Citeseer Pub Med Chameleon Actor Squirrel Texas Cornell Av. Gain GCN 88.54 0.51 85.42 0.89 91.31 0.73 86.07 0.64 85.12 0.78 90.01 0.15 69.08 0.99 73.09 0.92 9.58 GAT 85.45 0.66 87.23 0.11 87.65 0.04 88.99 0.13 87.33 0.08 90.23 0.14 68.79 0.72 75.12 0.77 9.31 SAGE 87.12 0.82 90.71 0.65 90.09 0.90 90.01 0.58 86.06 0.73 91.02 0.61 76.54 0.69 77.98 0.88 6.97 HGCN 91.63 0.55 94.13 0.67 91.04 0.79 91.45 0.62 90.01 0.80 92.34 0.01 69.99 0.84 74.03 0.57 6.34 HGAT 90.43 0.03 91.02 0.16 88.99 0.89 89.77 0.02 90.99 0.01 89.22 0.04 71.58 0.89 72.03 0.22 7.66 κGCN 92.04 0.70 93.33 0.57 92.45 0.85 92.03 0.63 90.45 0.88 91.35 0.60 76.09 0.76 73.05 0.71 5.56 QGCN 92.17 0.79 92.75 0.52 92.16 0.09 91.67 0.05 91.07 0.06 90.98 0.92 75.44 0.10 73.89 0.26 5.65 Self MGNN 93.12 0.04 92.99 0.91 90.99 0.17 93.51 0.14 91.98 0.19 95.01 0.16 74.51 0.62 78.99 0.81 4.28 Cheby Net 88.23 0.85 89.22 0.06 86.54 0.29 90.01 0.23 88.09 0.44 92.13 0.57 73.45 0.01 79.01 0.18 7.33 Bern Net 86.34 0.13 87.09 0.60 85.34 0.82 87.15 0.37 87.22 0.15 91.22 0.55 77.65 0.87 78.34 0.19 8.12 GPRGNN 91.16 0.72 93.05 0.81 92.03 0.01 91.22 0.16 89.76 0.62 92.34 0.23 76.05 0.18 80.04 0.12 4.96 Fi GURe 91.98 0.69 94.33 0.15 92.67 0.83 93.09 0.31 90.11 0.29 95.43 0.65 76.99 0.16 80.12 0.58 3.82 CUSP 95.08 0.13 96.88 0.65 96.01 0.01 97.66 0.33 96.04 0.38 97.17 0.11 81.23 0.14 85.23 0.05 0 Imp. 1.96 2.55 3.34 4.15 4.06 1.74 3.58 5.11 Table 8: Performance comparision of CUSP with baselines for LP task (Mean AUC Score 95% confidence interval). First, Second and Third best performing models are highlighted. Dataset CUSP CUSPeuc CUSPlap CUSPenc CUSPpool CUSPfil Cora 95.08 0.13 92.45 0.25 93.21 0.42 93.78 0.33 92.13 0.40 93.02 0.27 Citeseer 96.88 0.65 94.03 0.30 95.34 0.22 94.08 0.29 94.01 0.31 94.56 0.26 Pub Med 96.01 0.01 93.52 0.60 94.81 0.40 94.92 0.38 94.11 0.35 94.16 0.39 Chameleon 97.66 0.33 95.02 0.44 96.23 0.52 96.13 0.28 95.13 0.40 95.67 0.57 Actor 96.04 0.38 91.45 0.55 93.99 0.32 92.81 0.42 91.98 0.47 92.13 0.49 Squirrel 97.17 0.11 93.14 0.22 95.56 0.37 94.33 0.43 93.75 0.32 92.89 0.14 Texas 81.23 0.14 78.45 0.36 79.88 0.52 80.12 0.42 79.23 0.37 77.81 0.54 Cornell 85.23 0.05 82.89 0.32 83.91 0.39 84.23 0.37 82.31 0.48 82.45 0.42 Avg. Gain 0 3.0437 1.5462 1.8625 2.8312 2.8262 Table 9: Ablation study (LP) results. CUSPeuc is the Euclidean variant, CUSPlap uses the traditional Laplacian, CUSPenc gets rid of curvature encoding, CUSPpool replaces Cusp pooling with concatenation, and CUSPfil uses a single filter instead of a filter bank. Av. Gain represents the average gain of CUSP over the ablation model in that column, averaged across the different datasets. Published as a conference paper at ICLR 2025 CUSP Cora Citeseer Pub Med Chameleon Actor Squirrel Texas Cornell H24 S24 93.10 0.22 94.33 0.35 95.41 0.25 97.28 0.45 95.25 0.44 96.05 0.51 75.43 0.31 76.95 0.38 (H8)2 (S8)2 E16 93.00 0.27 95.11 0.28 94.52 0.34 96.66 0.39 95.34 0.38 95.40 0.57 80.95 0.42 84.80 0.43 H8 S8 E32 93.20 0.24 93.25 0.30 95.65 0.30 97.11 0.37 94.92 0.40 95.51 0.45 81.23 0.14 85.23 0.05 H16 (S16)2 93.50 0.25 94.12 0.33 95.32 0.45 97.66 0.33 95.20 0.34 97.17 0.11 80.30 0.45 78.02 0.47 (H16)2 E16 93.25 0.21 93.05 0.32 96.01 0.01 96.80 0.40 96.04 0.38 96.15 0.55 79.50 0.38 84.60 0.40 H24 E24 92.70 0.30 94.71 0.35 95.45 0.42 96.51 0.47 95.00 0.37 88.95 0.53 79.70 0.54 81.20 0.60 S24 E24 88.50 0.39 91.23 0.35 89.99 0.44 91.12 0.43 79.90 0.52 91.20 0.51 79.08 0.52 83.75 0.48 H16 S16 E16 95.08 0.13 96.88 0.65 95.55 0.41 94.44 0.34 95.55 0.35 96.25 0.38 81.07 0.29 84.60 0.30 (S8)2 E32 88.05 0.40 90.35 0.42 86.30 0.48 89.34 0.45 93.33 0.54 90.25 0.48 80.10 0.43 82.01 0.55 (H16)3 89.10 0.44 91.01 0.50 93.93 0.55 96.19 0.50 95.25 0.49 92.30 0.52 77.77 0.53 79.44 0.58 Table 10: Performance comparison of CUSP with different manifold signatures for Link Prediction (LP). Best performing signatures are in Bold, and cases with a large decline in performance because of manifold mismatch are in Blue. Dataset Signature Cora H16( 0.21, 0.31) S16(+0.49, 0.38) E16(0, 0.31) Citeseer H16( 0.78, 0.29) S16(+0.55, 0.39) E16(0, 0.32) Pub Med H16( 0.76, 0.56) H16( 0.28, 0.41) E16(0, 0.03) Chameleon H16( 0.34, 0.09) S16(+0.71, 0.25) S16(+0.55, 0.34) Actor H16( 0.77, 0.17) H16( 0.39, 0.42) E16(0, 0.41) Squirrel H16( 0.17, 0.23) S16(+0.54, 0.38) S16(+0.38, 0.39) Texas H8( 0.38, 0.31) S8(+0.18, 0.19) E32(0, 0.50) Cornell H8( 0.41, 0.19) S8(+0.09, 0.26) E32(0, 0.55) Table 11: Learning results of CUSP on Link Prediction (LP) task for the best performing product signature. Format of entries manifold type (dim) (learnt curvature, learnt manifold weight). Dataset I Z(1) Z(2) Z(3) Z(4) Z(5) Z(6) Z(7) Z(8) Z(9) Z(10) Cora 0.1125 0.2393 0.0520 0.0504 0.0676 0.0272 0.1531 0.1251 0.0939 0.0095 0.0694 Citeseer 0.2131 0.0150 0.2195 0.0283 0.0922 0.1530 0.0350 0.0320 0.0282 0.1254 0.0582 Pub Med 0.0279 0.1793 0.0285 0.0296 0.0563 0.3870 0.0603 0.0324 0.0612 0.0271 0.1104 Chameleon 0.0601 0.1136 0.1694 0.0924 0.1329 0.1605 0.1026 0.0157 0.0475 0.0344 0.0709 Actor 0.1230 0.0321 0.0324 0.1147 0.1287 0.3700 0.0136 0.0389 0.0604 0.0691 0.0172 Squirrel 0.0182 0.0289 0.1056 0.2099 0.0209 0.0355 0.0854 0.2159 0.0475 0.1042 0.1279 Texas 0.0890 0.1983 0.0179 0.4570 0.0035 0.1429 0.0177 0.0087 0.0474 0.0131 0.0046 Cornell 0.0886 0.2026 0.0181 0.4460 0.0034 0.1472 0.0183 0.0089 0.0487 0.0134 0.0048 Table 12: Learned filter weights (Link Prediction) for the top-performing split, distinguishing between homophilic (favoring low-pass filters) and heterophilic (favoring high-pass filters). First, second, and third highest filter weights are highlighted. 7.6.4 ESTIMATING PRODUCT MANIFOLD SIGNATURE In our model, the mixed-curvature product manifold Pd C is essential for representing the geometric structure of the data. The curvature configuration needed for each dataset depends on the intrinsic geometry of its graph. To generalize across various datasets, we aim to determine the optimal signature of the product manifold, specifically the proportions of hyperbolic, spherical, and Euclidean components. This estimation is based on analyzing the Ollivier-Ricci curvature (ORC) distribution as a heuristic. See Figures 4 and 5 in Appendix 7.2.1 for the ORC distribution of multiple datasets. For datasets with many positively curved edges, we select a Spherical component, and for those with negatively curved edges, we choose a Hyperbolic component. For example, the curvature distribution of Pub Med s edges in Figure 4 shows two significant peaks around 0.45 and +0.25. Given this distribution, we opt for Spherical and Hyperbolic components when evaluating CUSP on Pub Med. Empirically, the best performance for Pub Med is achieved with the signature (H16)2 E16. We initialize the curvatures of Pub Med s component manifolds with these prominent values: H with 0.45 and S with +0.25. We select two hyperbolic manifolds to capture different curvature ranges. An overview of this simple idea is provided in Algorithm 1. By systematically analyzing the curvature distribution, our heuristic-based algorithm identifies the manifold signature that best represents the dataset s underlying geometric structure. We heuristically cluster the curvature distribution and identify the centroid curvatures without altering their order or frequencies. The use of predefined dimensions allows for flexibility based on experimental settings. Since optimal dimension allocations can vary and are complex to analyze, we manually set the dimensions of the component manifolds as a hyperparameter. This ensures fair and uniform comparison across multiple datasets, as different datasets may perform best with different configurations. We do not claim that this algorithm finds the best possible, optimal combination of component manifolds, rather, it estimates a potential signature that might be a good fit for a particular dataset. Published as a conference paper at ICLR 2025 Algorithm 1 Product manifold signature estimation and curvature initialisation Require: Edge curvature histogram C = {(κi, fi)}N i=1 Threshold ϵ to distinguish between curved and flat regions Maximum number of Hyperbolic (Hmax) and Spherical (Smax) components. Total product manifold dimension d M (Optional) Preferred component manifold dimensions dpre (h), dpre (s), dpre (e) Ensure: Product manifold signature Pd M = Q q=1M κ(q),d(q) q 1: Normalize frequencies: f i = fi PN j=1 fj 2: Construct weighted curvature set: C = {(κi, f i)}N i=1 3: Determine optimal number of clusters K using methods like the elbow method, constrained by K Hmax + Smax + 1 There can be only 1 Euclidean component 4: Cluster C into K clusters using weighted clustering (e.g., weighted K-means) 5: Initialize empty lists H, S, E 6: for each cluster c in clusters do 7: Compute cluster centroid curvature κc = P (κi,f i) c κi |c| 8: Compute total frequency weight wc = P (κi,f i) c f i 9: if κc < ϵ and |H| Hmax then Negative curvature 10: Assign manifold component: Mq = Hκc Curvature initialization 11: Add (Mq, wc) to H 12: else if κc > ϵ and |S| Smax then Positive curvature 13: Assign manifold component: Mq = Sκc Curvature initialization 14: Add (Mq, wc) to S 15: else 16: Assign manifold component: Mq = E Approximate zero curvature, i.e. κc [ ϵ, ϵ] 17: Add (Mq, wc) to E 18: end if 19: end for 20: if Predefined dimensions dpre (h), dpre (s), dpre (e) are provided then 21: Assign dimensions d(q) to each component q as per predefined values Dimension assignment 22: else 23: Set total number of components Q = |H| + |S| + |E| Dimension assignment 24: Allocate dimensions d(q) to each component q: d(q) = d M wq PQ p=1 wp Proportional to weights 25: Adjust d(q) to ensure PQ q=1 d(q) = d M 26: end if 27: Formulate manifold signature: Pd M = |H| h=1H κ(h),d(h) h |S| s=1S κ(s),d(s) s Ed(e) 7.6.5 MORE EXPERIMENTAL SETTINGS Hyperparameter Tuning Configurations Description L {5, 10, 15, 20, 25} Total number of graph filters. δ {0.2, 0.5, 0.7} Neighbourhood weight distribution parameter for ORC α {0.1, 0.3, 0.5, 0.9} Alpha (initialization) parameter for GPR propagation d C {8, 16, 32, 64} Total dimension of curvature embeddings. d M {32, 48, 64, 128, 256} Total dimension of the product manifold. dropout {0.2, 0.3, 0.5} Dropout rate epochs {50, 100, 300} Number of training epochs lr {1e 4, 4e 3, 0.001, 0.01} Learning rate weight decay {0, 1e 4, 5e 4} Weight decay Table 13: Hyperparameter configurations used in the experiments for all baselines. Some of the hyperparameters are specific to CUSP. We highlight the final configuration of CUSP for NC in Red.